A Bioinformatics Approach to the Identiﬁcation, Classiﬁcation, and Analysis of Hydroxyproline-Rich Glycoproteins

.

The genomics era has produced vast amounts of biological data that await examination.In order to "mine" such data effectively, a bioinformatics approach can be utilized to identify genes of interest, subject them to various in silico analyses, and extract relevant biological information on them from various public databases.Examination of such data produces novel insights with respect to the genes in question and can be used to facilitate and guide further research in the field.Such is the case here, where bioinformatics tools were developed to identify, classify, and analyze members of the Hyp-rich glycoprotein (HRGP) superfamily encoded by the Arabidopsis (Arabidopsis thaliana) genome.
HRGPs are a superfamily of plant cell wall proteins that are subdivided into three families, arabinogalactan proteins (AGPs), extensins (EXTs), and Pro-rich proteins (PRPs), and extensively reviewed (Showalter, 1993;Kieliszewski and Lamport, 1994;Nothnagel, 1997;Cassab, 1998;Jose ´-Estanyol and Puigdome `nech, 2000;Seifert and Roberts, 2007).However, it has become increasingly clear that the HRGP superfamily is perhaps better represented as a spectrum of molecules ranging from the highly glycosylated AGPs to the moderately glycosylated EXTs and finally to the lightly glycosylated PRPs.Moreover, hybrid HRGPs, composed of HRGP modules from different families, and chimeric HRGPs, composed of one or more HRGP modules within a non-HRGP protein, also can be considered part of the HRGP superfamily.Given that many HRGPs are composed of repetitive protein sequences, particularly the EXTs and PRPs, and many have low sequence similarity to one another, particularly the AGPs, BLAST searches typically identify only a few closely related family members and do not represent a particularly effective means to identify members of the HRGP superfamily in a comprehensive manner.
Building upon the work of Schultz et al. (2002) that focused on the AGP family, a new bioinformatics software program, BIO OHIO, developed at Ohio University, makes it possible to search all 28,952 proteins encoded by the Arabidopsis genome and identify putative HRGP genes.Two distinct types of searches are possible with this program.First, the program can search for biased amino acid compositions in the genome-encoded protein sequences.For example, classical AGPs can be identified by their biased amino acid compositions of greater then 50% Pro (P), Ala (A), Ser (S), and Thr (T), as indicated by greater than 50% PAST.Similarly, arabinogalactan peptides (AG peptides) are identified by biased amino acid compositions of greater then 35% PAST, but the protein (i.e.peptide) must also be between 50 and 90 amino acids in length.Likewise, PRPs can be identified by a biased amino acid composition of greater then 45% PVKCYT.Second, the program can search for specific amino acid motifs that are commonly found in known HRGPs.For example, SP 4 pentapeptide and SP 3 tetrapeptide motifs are associated with EXTs, a fasciclin H1 motif is found in fasciclin-like AGPs (FLAs), and PPVX(K/T) (where X is any amino acid) and KKPCPP motifs are found in several known PRPs (Fowler et al., 1999).In addition to searching for HRGPs, the program can analyze proteins identified by a search.For example, the program checks for potential signal peptide sequences and glycosylphosphatidylinositol (GPI) plasma member anchor addition sequences, both of which are associated with HRGPs (Showalter, 1993(Showalter, , 2001;;Youl et al., 1998;Sherrier et al., 1999;Svetek et al., 1999).Moreover, the program can identify repeated amino acid sequences within the sequence and has the ability to search for bias amino acid compositions within a sliding window of user-defined size, making it possible to identify HRGP domains within a protein sequence.
Here, we report on the use of this bioinformatics program in identifying, classifying, and analyzing members of the HRGP superfamily (i.e.AGPs, EXTs, PRPs, hybrid HRGPs, and chimeric HRGPs) in the genetic model plant Arabidopsis.An overview of this bioinformatics approach is presented in Figure 1.In addition, public databases and programs were accessed and utilized to extract relevant biological information on these HRGPs in terms of their expression patterns, most similar sequences via BLAST analysis, available genetic mutants, and coexpressed HRGP, glycosyl transferase (GT), prolyl 4-hydroxylase (P4H), and peroxidase genes in Arabidopsis.This information provides new insight to the HRGP superfamily and can be used by researchers to facilitate and guide further research in the field.Moreover, the bioinformatics tools developed here can be readily applied to protein sequences from other species to analyze their HRGPs or, for that matter, any given protein family by altering the input parameters.

Finding and Classifying AGPs
The BIO OHIO program was used to identify potential classical AGPs, including the Lys-rich classical AGPs, AG peptides, and chimeric AGPs (i.e.FLAs and other chimeric AGPs) from the Arabidopsis proteome (Table I).The program initially identified 64 possible classical AGPs by searching for biased amino acid compositions of at least 50% PAST.Similarly, 86 potential AG peptides were identified by searching for proteins between 50 and 90 amino acids in length with biased amino acid compositions of at least 35% PAST.Finally, 25 potential FLAs were identified by searching for the following fasciclin H1 motif: [MALIT]

T[VILS] [FLCM][CAVT][PVLIS][GSTKRNDPEIV]+[DNS]
[DSENAGE]+ [ASQM].The 175 proteins identified by the program were further examined individually to determine if they appeared to be AGPs.The presence of a signal peptide was one such factor, as was the presence and location of AP, PA, SP, and TP repeats, since these dipeptide sequences are often present in known AGPs (Nothnagel, 1997).Finally, the presence of a GPI anchor addition sequence provided additional support, although not all AGPs have this sequence.By these criteria, 64 of the original 175 were classified as AGPs; moreover, they fall into several distinct classes: 20 classical AGPs, three Lys-rich (classical) AGPs, 16 AG peptides, 21 chimeric FLAs, three chimeric plastocyanin AGPs (PAGs), and one other chimeric AGP (Tables I and II).Additionally, one other AGP was documented in the literature, AGP30, a nonclassical or chimeric AGP, but was not identified by the program given that its PAST value of 34% was below the 50% threshold value used by the program (Baldwin et al., 2001;van Hengel and Roberts, 2003).Consequently, this AGP was added to the list of AGPs appearing in Table II but was not counted in Table I.In addition, four PRPs (PRP18, PRP5, PRP6, PRP16), 20 EXTs (EXT40, EXT17, EXT38, EXT19, EXT22, EXT18, EXT15, EXT7, EXT9, EXT10, EXT2, EXT11, EXT13, EXT16, EXT6, EXT12, EXT14, EXT8, EXT20, EXT21), and three hybrid AGP/EXTs (HAEs; HAE1, HAE3, HAE4) were identified by the program using the 50% PAST rule; further information on these HRGP sequences is presented below.Some AGPs, particularly chimeric AGPs, can be below the 50% PAST threshold but were identified by searching the Arabidopsis protein database annotations and then subjecting such proteins to further analysis (i.e.searching for signal peptides, AP, PA, SP, and TP repeats, or GPI anchor addition sequences).With this approach, 21 additional AGPs were found, including two classical AGPs (AGP50C and AGP57C), 14 PAGs, and five other chimeric AGPs, including AGP30.The locus identifiers of these sequences are indicated in italics in Table II.
With the addition of these AGPs from the protein database annotations, the total number of potential AGPs became 85 and included 22 classical AGPs, three Lys-rich classical AGPs, 16 AG peptides, 21 chimeric FLAs, 17 chimeric PAGs, and six other chimeric AGPs (Table II).Representative amino acid sequences of these potential AGPs, including the predicted locations of their signal peptides and GPI anchor addition sequences, are displayed in Figure 2 and Supplemental Figure S1.The classical AGPs ranged in size from 87 to 739 amino acids.The majority (19 of 22) were predicted to have a signal peptide, and many (14 of 22) were also predicted to have a GPI anchor.The Lys-rich, classical AGPs ranged in size from 185 to 247 amino acids.All three were predicted to have a signal peptide, but only two were predicted to have a GPI anchor.The AG peptides ranged in size from 58 to 87 amino acids.All 16 AG peptides were predicted to have a signal peptide, but only 12 were predicted to have a GPI anchor.The FLAs ranged in size from 247 to 462 amino acids.The majority (20 of 21) were predicted to have a signal peptide, but only 11 were predicted to have a GPI anchor.The FLAs are a type of chimeric AGP; each FLA contains either one or two AGP domains.Such AGP domains were readily visualized with the BIO OHIO program by utilizing the sliding windows feature to search for biased amino acid sequences within a user-defined amino acid window size (e.g.80% PAST in a 10-amino acid window) that slides along the protein sequence.Usually, such domains were also apparent by examining the location of Figure 1.Bioinformatics workflow diagram summarizing the identification, classification, and analysis of HRGPs (AGPs, EXTs, and PRPs) in Arabidopsis.Classical AGPs were defined as containing greater than 50% PAST coupled with the presence of AP, PA, SP, and TP repeats distributed throughout the protein, Lys-rich AGPs were a subgroup of classical AGPs that included a Lysrich domain, and chimeric AGPs were defined as containing greater than 50% PAST coupled with the localized distribution of AP, PA, SP, and TP repeats.AG peptides were defined to be 50 to 90 amino acids in length and containing greater than 35% PAST coupled with the presence of AP, PA, SP, and TP repeats distributed throughout the peptide.FLAs were defined as having a fasciclin domain coupled with the localized distribution of AP, PA, SP, and TP repeats.Extensins were defined as containing two or more SP 3 or SP 4 repeats coupled with the distribution of such repeats throughout the protein; chimeric extensins were similarly identified but were distinguished from the extensins by the localized distribution of such repeats in the protein; and short extensins were defined to be less than 200 amino acids in length coupled with the extensin definition.PRPs were identified as containing greater than 45% PVKCYT or two or more KKPCPP or PVX(K/T) repeats coupled with the distribution of such repeats and/or PPV throughout the protein.Chimeric PRPs were similarly identified but were distinguished from PRPs by the localized distribution of such repeats in the protein.Hybrid HRGPs (i.e.AGP/EXT hybrids) were defined as containing two or more repeat units used to identify AGPs, extensins, or PRPs.The presence of a signal peptide was used to provide added support for the identification of an HRGP but was not used in an absolute fashion.Similarly, the presence of a GPI anchor addition sequence was used to provide added support for the identification of classical AGPs and AG peptides, which are known to contain such sequences.BLAST searches were also used to provide some support to our classification if the query sequence showed similarity to other members of an HRGP subfamily.Note that some AGPs, particularly chimeric AGPs, and PRPs were identified from an Arabidopsis database annotation search and that two chimeric extensins were identified from the primary literature as noted in the text.
the AP, PA, SP, and TP repeat units, which was easily done by the BIO OHIO program.The PAGs ranged in size from 177 to 370 amino acids.The 17 PAGs were all predicted to have a signal peptide, and 16 were predicted to have a GPI anchor.The other chimeric AGPs ranged in size from 222 to 826 amino acids.All but one (five of six) of these chimeric AGPs were predicted to have a signal peptide, and only one was predicted to have a GPI anchor as well as a signal peptide.
BLAST analysis was also conducted using The Arabidopsis Information Resource (TAIR) WU-Blast 2.0 to identify other potential AGP sequences and to provide insight to AGP sequences with the greatest similarity (Table II; Supplemental Table S1).BLAST searches were initially conducted with the filtering option on, but they were repeated with filtering off for those searches that found no other HRGPs.Such analysis showed that not all AGPs can be found with this method, but it did reveal sequences showing high degrees of similarity.BLAST was most successful for locating other FLAs and PAGs.In other words, a BLAST search using any one FLA sequence found most, but typically not all, other known FLA sequences.

AGP Gene Expression and Coexpressed HRGPs, GTs, P4Hs, and Peroxidases
In order to elucidate patterns of gene expression for these predicted AGPs, three public databases were searched: Genevestigator (https://www.genevestigator.ethz.ch/), the Arabidopsis Membrane Protein Library (http://www.cbs.umn.edu/arabidopsis/), and the Arabidopsis Massively Parallel Signature Sequencing (MPSS) Plus Database (http://mpss.udel.edu/at/).While about half of the AGPs had a broad range of expression throughout the plant, the other half showed organ-specific expression.Notably, several AGPs were specifically or preferentially expressed in the pollen, while others were expressed in roots, stems, leaves, and siliques (Table II; Supplemental Figs.S2-S5).Moreover, in examining the expression levels of all the AGP genes, the ones specifically or preferentially expressed in the pollen were the most highly expressed, as indicated by their high relative signal intensities.Furthermore, there was no observed correlation between organ-specific expression and a particular AGP class or between environmental stress-induced expression and a particular AGP class.
In order to elucidate HRGP gene networks and identify genes involved with AGP biosynthesis, the AGP genes were next examined with respect to coexpressed genes using The Arabidopsis Co-Response Database (http://csbdb.mpimp-golm.mpg.de/csbdb/dbcor/ath.html;Table III; Supplemental Table S2).Unfortunately, 39 of the 85 AGPs had no coexpression data available, so the following information was based on the 46 AGPs for which data were available.In analyzing the data, a focus was placed not only on other HRGPs but on GTs, P4Hs, and peroxidases, since GTs and P4Hs, and possibly peroxidases (Kjellbom et al., 1997), are responsible for posttranslational modification of AGPs.In terms of AGPs being expressed with other HRGPs, a total of 73 HRGPs were coexpressed with one or more AGPs.Among all HRGPs, FLA7 was coexpressed with the most AGPs, a total of 22 different AGPs.Interestingly, several different EXT and PRP genes were also coexpressed with numerous AGP genes.For the GTs, 27 of the 42 members of the GT2 family, 17 of the 42 members of the GT8 family, 11 of the 33 members of the GT47 family, and two of the three members of the GT29 family were coexpressed with various AGPs, to name just a few.Most notably, two members of the GT47 family (At5g22940 and At4g38040) were found to be coexpressed with 17 and 15 AGP genes, respectively.Also notable was the one member of the GT29 family (At1g08660) that was coexpressed with 14 different AGP genes and the three members of the GT8 family (At1g24170, At5g47780, At1g13250) that were coexpressed with 13, 11, and 10 different AGPs, respectively.In conducting this GT analysis, it was observed that not all of the CAZY members are annotated as GTs in the coexpression database.Consequently, coexpressed genes had to be cross-referenced against the gene identifiers listed in the CAZY database.For the P4Hs, five of 13 members of the P4H gene family were coexpressed with various AGPs.Among these, one P4H gene (At3g06300 or P4H2) was coexpressed with 10 different AGPs.Many peroxidase genes showed evidence of coexpression.The greatest amount of coexpression was exhibited by At4g26010, which was coexpressed with 13 different AGPs.

AGP Gene Organization and Mutants
Information was extracted from the TAIR and SALK Web sites with regard to the gene structure and avail-  able genetic mutants for each of the predicted AGP genes.The AGP genes contained few, if any, introns.Of the 85 AGPs, 46 had no introns and 32 had only one intron (Table II; Supplemental Table S3).One chimeric AGP (At5g21160 or AGP32I), however, was predicted to have 14 introns.
Examination of the various mutant lines available for research showed that nearly 99% (84 of 85) of the AGP genes had one or more mutants available.Of these mutants, 33% were in the promoter region, 19% were in the 5# untranslated region (UTR), 25% were in an exon, 6% were in an intron, and 17% were in the 3# UTR (Table II; Supplemental Table S4).

Finding and Classifying EXTs
The BIO OHIO program was used to identify potential EXTs by searching for SP 3 and SP 4 sequences repeated two or more times (Table IV).The program initially identified 114 and 63 potential EXTs by searching for these tetrapeptide and pentapeptide repeats, respectively.
The 114 and 63 proteins identified by the program were further examined individually to determine if they appeared to be EXTs, with the realization that the 63 proteins are a subset of the 114.The presence of a signal peptide was one such factor, as was the presence and location of SP 3 , SP 4 , and SP 5 repeats, since these peptide sequences are often present in known EXTs.GPI anchor addition sequences are not known to be associated with EXTs; nonetheless, testing for the presence of such a sequence was performed out of curiosity.By these criteria, 57 of the 114 and 50 of the 63 proteins were classified as EXTs.While the SP 4 criteria resulted in a high percentage of EXT sequences, they did not locate all potential EXTs, given that the SP 3 criteria were used to find more EXTs, but with a higher rate of false positives.Subsequent analysis involved examining the 57 EXT sequences and attempting to classify them.Based upon the repeat sequences found in these EXTs, they were placed into nine classes: three SP 5 EXTs, two SP 5 /SP 4 EXTs, 12 SP 4 EXTs, two SP 4 / SP 3 EXTs, one SP 3 EXT, 12 short EXTs, 11 (chimeric) Leu-rich repeat EXTs (LRXs) that include pollen a Italics indicate a protein found using the Arabidopsis database annotation search.b Boldface indicates a protein that was not previously identified by Schultz et al. (2002).The letter designations in the names represent the following: C, classical AGP; P, AG peptide; K, Lys-rich classical AGP; I, chimeric AGP.c Signal peptide.d Indicates the number of mutants available in each location: P, promoter; 5, 5# UTR; E, exon; I, intron; 3, 3# UTR. e Underline indicates the result of a BLAST search with filtering turned off.f nr, Not reported.This indicates that data for a particular protein are not found in Genevestigator, Arabidopsis Membrane Protein Library, or MPSS.g Experimentally found to be GPI anchored (Schultz et al., 2004).
The Arabidopsis protein database annotations were searched, but no additional EXTs were found beyond those already identified by the program.Additionally, four other PERKs were documented in the literature but were not identified by the program, because three (At5g24400 or PERK2, At1g68690 or PERK9, At4g32710 or PERK14) were not included in the Arabidopsis protein database and one (At1g52290 or PERK15) found in the database contained only one SPP.The PERK14 sequence was subsequently found on the TAIR Web site but lacked SP 3 /SP 4 repeats.Nonetheless, PERK14 and PERK15, being members of the PERK family and having publicly available sequences, were added in italics to the list of EXTs appearing in Table V and subjected to subsequent analyses.PERK2 and PERK9 were described as pseudogenes on the TAIR Web site and had no sequences available.Thus, they were not added to the table or analyzed further.In addition, two AGPs (AGP9C, AGP19K) and four HAEs (HAE1, HAE2, HAE3, HAE4) were identified by the program using the SP 3 rule.Analysis of these AGP sequences was already presented in the AGP section above; however, the four hybrid HRGPs were considered here along with the EXT family members.
The three other chimeric EXTs were annotated in the Arabidopsis protein database as late embryogenesis abundant protein (EXT50), expressed protein (EXT51), and plastocyanin-like protein (EXT52).EXT50, EXT51,     and EXT52 contained five, seven, and three SP 4 repeats, respectively.EXT51 also contained numerous TP and SP repeats, reminiscent of AGPs.
A hybrid HRGP was defined as a protein that contains sequence characteristics of different HRGPs, such as EXT and AGP sequence modules, within the same protein.The four hybrid proteins identified in the EXT search had sequence characteristics of both EXTs and AGPs.Three of these hybrids, HAE1, HAE3, and HAE4, were identified because they passed an EXT test as well as the classical AGP test, having at least 50% PAST and multiple PA and TP repeats.The other hybrid, HAE2, contained two SP 4 repeats and one additional SP 3 module but did not pass the 50% PAST threshold, having only 43% PAST.Nonetheless, it contained multiple AP, PA, SP, and TP repeats, which are indicative of AGPs.
BLAST analysis was also conducted with each of the EXTs, chimeric EXTs, and HAEs to identify other related sequences and to provide insight to EXT sequences with the greatest similarity (Table V; Supplemental Table S1).Such analysis showed that not all EXTs were found with this method but did reveal sequences showing high degrees of similarity and clearly showed many more potential EXT sequences compared with the results from the similar strategy for analysis of the AGPs.Such BLAST analysis of LRXs and PERKs proved especially effective, as a BLAST query using any one LRX or PERK resulted in the identification of all other members in their respective class.Analysis of the other chimeric EXTs revealed that only EXT52 resulted in BLAST hits; these hits were PAG17, PAG9, and PAG10.This result was expected, since EXT52 contains a plastocyanin domain along with the EXT motifs.BLAST analysis of the At4g11430 hybrid HRGP (HAE3) as the query sequence showed similarity to both AGP and EXT genes, providing support for its identification as a hybrid HRGP.BLAST results for the other HAEs were less informative, with HAE1 showing similarity to no other HRGPs and HAE2 and HAE4 showing similarity to only one PRP and multiple chimeric PRPs, respectively.
As seen in Table V and in Supplemental Figure S6, the 20 SP 5 , SP 5 /SP 4 , SP 4 , SP 4 /SP 3 , and SP 3 EXTs ranged in size from 212 to 1,018 amino acids.The majority (17 of 20) were predicted to have a signal peptide, and none was predicted to have a GPI anchor.The 12 short EXTs ranged in size from 96 to 181 amino acids.All but one was predicted to have a signal peptide, and surprisingly, seven were predicted to have a GPI anchor.The 11 LRXs ranged in size from 433 to 956 amino acids and consisted of an N-terminal Leu-rich repeat domain and a C-terminal EXT domain.All but two were predicted to have a signal peptide, and none was predicted to have a GPI anchor.The 13 PERKs ranged in size from 509 to 760 amino acids and consisted of an N-terminal EXT domain and a C-terminal kinase domain.None was predicted to have a signal peptide or a GPI anchor.The three chimeric EXTs contained three to seven diagnostic EXT repeats; two had signal peptides, and none contained GPI anchor addition sequences.The four HAEs contained 219 to 375 amino acids; three had a signal peptide and none had GPI anchor addition sequences.The EXT domains/motifs in the LRXs, PERKs, and other chimeric EXTs as well as the EXT/AGP hybrids were readily visualized with the BIO OHIO program by observing the locations of the SP 3 , SP 4 , and SP 5 repeat units.

EXT Gene Expression and Coexpressed HRGPs, GTs, P4Hs, and Peroxidases
In order to elucidate patterns of gene expression for these predicted EXTs, including the various chimeric EXTs and four HRGP hybrids, the same three public databases were searched as with the AGPs.While several EXTs had a broad range of expression throughout the plant, most of the EXT genes showed organ-specific expression.Notably, several EXTs were specifically or preferentially expressed in the root (27), while several others were specifically or preferentially expressed in the pollen/stamen (14) or siliques (one; Table V; Supplemental Figs.S7-S10).Moreover, in examining the expression levels of all the EXT genes, many of those specifically or preferentially expressed in the pollen were the most highly expressed ones, as indicated by their high relative signal intensities.
Next, the EXT and hybrid HRGP genes were examined with respect to coexpressed genes (Table VI; Supplemental Table S5).For EXTs, there was no information for 29 out of the 59 genes in The Arabidopsis Co-Response Database, and the four hybrid HRGP genes were also not listed in this database.In analyzing the data, a focus was placed not only on other HRGPs but on GTs, P4Hs, and peroxidases, since GTs, P4Hs, and EXT peroxidases are responsible for posttranslational modification of EXTs; this approach represents one potential avenue to identify genes involved in the posttranslational modification of EXTs.In terms of EXTs being expressed with other HRGPs, a total of 67 HRGPs were coexpressed with one or more EXTs.The most highly coexpressed HRGP was FLA2, which was coexpressed with a total of 15 EXTs, while  FLA9 was next on the list, being coexpressed with 14 EXTs.As reported above, FLA2 and FLA9 were also coexpressed with many AGP genes.A number of EXT genes, including EXT9, EXT13, EXT14, EXT6, EXT10, EXT2, and LRX4, were also coexpressed with 10 or more EXT genes.
For the GTs, the most coexpressed was CslB04, a member of the GT2 family, which was coexpressed with nine EXTs.Also highly coexpressed were At1g24170 (GT8), At1g74380 (GT34), At4g15290 (GT2), and At5g22940 (GT47), all of which were coexpressed with seven EXTs.Notably, several of the GTs that were coexpressed with EXTs were also coex-pressed with AGPs.For example, one member of the GT8 family, At1g24170, was coexpressed with seven different EXTs and 13 different AGPs.For the P4Hs, four of 13 members of the P4H gene family were coexpressed with various EXTs.Among these, one P4H gene (At3g06300 or P4H2) was coexpressed with six different EXTs.As reported above, this P4H gene was also coexpressed with 10 different AGPs.Many peroxidase genes were coexpressed, but the greatest amount of coexpression was exhibited by At1g05240, At3g49960, At4g26010, At5g17820, and At5g67400, which were all coexpressed with eight different EXTs.Interestingly, these same peroxidase genes  a Italics indicates a protein that did not meet our search criteria but was identified previously in the primary literature.
b Boldface indicates a protein that was not previously identified in the primary literature or by Johnson et al. (2003b).
e Underline indicates the result of a BLAST search with filtering turned off.f Not reported.This indicates that data for a particular protein are not found in Genevestigator, Arabidopsis Membrane Protein Library, or MPSS.
Bioinformatics of Hydroxyproline-Rich Glycoproteins Plant Physiol.Vol.153, 2010 were coexpressed with the greatest number of AGP genes as well (Table III).Given that EXTs are known to be cross-linked at YXY sequence motifs by an EXT peroxidase with an acidic pI, it was interesting to observe that the At3g03670-encoded peroxidase, which had a predicted endomembrane localization and a predicted pI of 4.8, was coexpressed with two of the three EXTs containing the greatest numbers of YXY sequence repeats (i.e.EXT20 and EXT21).

EXT Gene Organization and Mutants
Information was extracted from the TAIR and SALK Web sites with regard to the gene structure and available genetic mutants for each of the predicted EXTs.With the exception of the PERK genes, EXT genes including the four HRGP hybrid genes contain few, if any, introns (Table V; Supplemental Table S6).Of the 46 non-PERK EXT genes, 36 had no introns and eight had only one or two introns.All four HAEs contained either zero or one intron.One chimeric EXT (At3g11030), however, was predicted to have four introns.In contrast, the PERK genes contained between six and eight introns.
Examination of the various mutant lines available for research showed that all of the EXT genes (including HAEs) had one or more mutants available.Of these mutants, 29% are in the promoter region, 17% are in the 5# UTR, 30% are in an exon, 4% are in an intron, and 20% are in the 3# UTR (Table V; Supplemental Table S7).

Finding and Classifying PRPs
The BIO OHIO program was used to identify potential PRPs primarily by searching for proteins with a biased amino acid composition of at least 45% PVKCYT.In addition, PRPs were identified by searching for KKPCPP and PPVX(K/T) sequences repeated two or more times (Fowler et al., 1999).The program initially identified 113 potential PRPs by searching for 45% PVKCYT and identified 13 and two potential PRPs by searching for the PPVX(K/T) and KKPCPP repeats, respectively.Eleven of these 13 potential PRPs and both of these two potential PRPs were also identified with the 45% PVKCYT search criteria (Table VII).
The 113 proteins identified by the program were further examined individually to determine if they appeared to be PRPs.The presence of a signal peptide was one such factor, as was the presence and location of PPV repeats, since these peptide sequences are often present in known PRPs.The PRPs, like the EXTs, are not known to contain GPI anchor addition sequences, but the presence of such sequences was queried nonetheless.By these criteria, 15 of the 113 were classified as PRPs.The 45% PVKCYT search criteria failed to find all the potential PRP sequences and had a high rate of false positives.In addition to the 15 PRPs, nine AGPs (AGP45P, AGP56C, AGP9C, AGP7C, AGP4C, AGP18K, AGP19K, AGP30I, AGP33I), 31 EXTs (EXT40, EXT17, EXT32, EXT37, EXT41, LRX3, LRX1, EXT39, EXT20, EXT21, EXT3/5, EXT8, EXT7, EXT35, EXT9, EXT10, EXT2, EXT11, EXT13, EXT16, EXT15, EXT18, EXT1/4, EXT22, EXT19, EXT30, PEX3, EXT6, EXT12, EXT14, EXT51), and three hybrid HRGPs (HAE2, HAE3, HAE4) were found with the 45% PVKCYT search.In addition, two AGPs (AGP4C, AGP9C), one EXT (EXT1/4), and one hybrid HRGP (HAE3) were found with the two PPVX(K/T) repeat search; further information on these sequences was presented in the AGP and EXT sections above.Three additional PRPs (PRP8, PRP9, PRP11) did not pass the biased amino acid test but were found instead by a database annotation search.The locus identifiers of these sequences are indicated in italics in Table VIII.With these additional PRPs, 18 total PRPs were found and subjected to further analysis.Six of the 18 PRPs contained a non-HRGP domain along with a PRP domain and thus were classified as chimeric PRPs.The remaining 12 PRPs were not divided further into subclasses (Table VIII).Representative sequences of these two classes of PRPs are shown in Figure 4.
BLAST analysis was conducted to identify other potential PRP sequences and to provide insight to PRP sequences with the greatest similarity (Table VIII; Supplemental Table S1).BLAST was somewhat successful in identifying other PRPs, but all PRPs cannot be found with a single BLAST search.Interestingly, the BLAST searches showed that six of the 18 PRPs are similar to AGP30, a nonclassical (chimeric) AGP.In fact, when AGP30 was used as the query sequence in a BLAST search, the top four hits were all PRPs rather than AGPs (Table II; Supplemental Table S1).Also consistent with these findings is the fact that AGP30 was not identified with the traditional 50% PAST search used for AGPs but was found with the 45% PVKCYT search used for PRPs.

PRP Gene Expression and Coexpressed HRGPs, GTs, P4Hs, and Peroxidases
In order to elucidate patterns of gene expression for these predicted PRPs, the same three public databases were searched as with the AGPs and EXTs.While most PRPs had a broad range of expression throughout the plant, several of the PRP genes showed organ-specific expression.Notably, several PRPs were specifically or preferentially expressed in the roots, while other individual PRPs were expressed in the endosperm, shoot apex, and petiole (Table VIII; Supplemental Figs.S12-S15).Moreover, in examining the expression levels of all the PRP genes, endosperm-specific At2g27380 (PRP6) was the most highly expressed one, as indicated by its high relative signal intensity.
Unlike the AGPs and EXTs, the PRPs displayed some common and dramatic (i.e.approximately 8-fold or more) patterns of environmental stress-induced gene expression.For example, eight of the PRP genes (PRP1, were downregulated by ABA, while two of the PRP genes (PRP6 and -14) were up-regulated by ABA.In addition, three PRPs (PRP2, -3, and -11) were up-regulated by zeatin, three PRPs (PRP 4, were up-regulated by nematode infection, and two PRPs (PRP9 and -10) were up-regulated by Pseudomonas syringae infection.
Next, the PRP genes were examined with respect to coexpressed genes using The Arabidopsis Co-Response Database (Table IX; Supplemental Table S8).Twelve out of the 18 PRPs had data available.In analyzing the data, a focus was placed not only on other HRGPs but on GTs, P4Hs, and peroxidases, since these enzymes are responsible for posttranslational modification of PRPs; this approach represents one potential avenue to identify genes involved in the posttranslational modification of PRPs.In terms of PRPs being expressed with other HRGPs, 46 different HRGPs are coexpressed with at least one PRP.The HRGP showing greatest coexpression was FLA8, which was coexpressed with five PRPs; FLA8 was a Italics indicates a protein found using the Arabidopsis database annotation search.
b Boldface indicates a protein that was not previously identified in the primary literature.
c Signal peptide.
e Underline indicates the result of a BLAST search with filtering turned off.

Bioinformatics of Hydroxyproline-Rich Glycoproteins
Plant Physiol.Vol.153, 2010 also coexpressed with 16 AGPs.FLA9 and FLA2, which were coexpressed with many AGPs and EXTs, were each coexpressed with three PRPs.For the GTs, At5g22940 of the GT47 family was coexpressed with six PRPs, twice as many as any other GT.Moreover, At1g24170, a GT8 family member that was coexpressed with many AGPs and EXTs, was not coexpressed with any PRPs.At3g14570 (Gsl04), a member of the GT family 48, was coexpressed with three PRPs; it was also coexpressed with four AGPs but no EXTs.
For the P4Hs, two of 13 members of the P4H gene family, At3g06300 (P4H2) and At5g18900 (P4H4), were coexpressed with two and one PRPs, respectively, as well as with many AGPs and EXTs.For the peroxidases, some peroxidase genes were coexpressed.The greatest amount of coexpression was exhibited by At1g77490 (tAPX) and At2g22420 (PER17); each was coexpressed with two PRPs.Both of these peroxidases also were coexpressed with EXTs and AGPs.

PRP Gene Organization and Mutants
Information was extracted from the TAIR and SALK Web sites with regard to the gene structure and available genetic mutants for each of the predicted PRP genes.None of the 18 PRPs contained more than three introns, with most containing either zero (eight of 18) or one intron (seven of 18; Table VIII; Supplemental Table S9).
Examination of the various mutant lines available for research showed that all of the PRP genes have one or more mutants available.Of these mutants, 32% were in the promoter region, 14% were in the 5# UTR, 30% were in an exon, 4% were in an intron, and 20% were in the 3# UTR (Table VIII; Supplemental Table S10).

DISCUSSION
The BIO OHIO Program for Finding and Analyzing HRGP Genes Based on Biased Amino Acid Compositions and Amino Acid Sequence Motifs As genomes are sequenced, bioinformatic tools need to be developed to analyze such data efficiently and accurately.Here, we describe one such tool for the purpose of identifying and analyzing HRGPs encoded by nucleic acid sequences.The BIO OHIO software has the ability to identify AGPs, EXTs, and PRPs as well as hybrid and chimeric HRGPs.This program requires only that the protein sequence data be available as a data file, which is routinely generated in a completed genome sequencing project.Here, the BIO OHIO program was used to search the 28,952 protein sequences encoded by the Arabidopsis genome.Several different strategies were used by the program to identify candidate HRGPs.Specifically, the program has the ability to identify proteins meeting a user-defined amino acid composition in full-length proteins or proteins of some defined size.This strategy was effective in identifying candidate classical AGPs, Lys-rich AGPs, AG peptides, and certain PRPs.The program can also be used to identify proteins containing specific, user-defined peptide sequences repeated any number of times.This strategy was used to identify candidate FLAs, EXTs, and certain PRPs.Both strategies were able to identify candidate hybrid and chimeric HRGPs.Another search strategy built into the program is to search for keywords within the annotated Arabidopsis protein database.This approach proved useful in finding Bioinformatics of Hydroxyproline-Rich Glycoproteins Plant Physiol.Vol.153, 2010 some chimeric AGPs and PRPs not identified by the above approaches.In addition, the program can search for signal peptide sequences, GPI anchor addition sequences, and repeating sequences within proteins; such additional information in conjunction with careful examination of the protein sequence was used to manually identify candidate proteins as HRGPs.X).This bioinformatics approach has advantages over conventional BLAST searches in terms of speed and accuracy.BLAST searches are time-consuming, requiring much postanalysis data acquisition and analysis after a list of "hits" to a query sequence is obtained.Furthermore, BLAST analyses fail to identify all members of an AGP, EXT, or PRP subfamily, since many of the subfamily members have limited amino acid sequence similarities and/or have various repeated amino acid sequence modules within a given sequence, complicating the alignment process.Nonetheless, BLAST analysis was used here to identify the most closely related sequences to a given HRGP, and by playing a version of the six degrees of separation game, it could be used to identify many, but not all, HRGP members in a time-consuming, convoluted, and laborious endeavor.Schultz et al. (2002) previously utilized a bioinformatics approach to identify candidate AGP genes from Arabidopsis.In contrast to this study, only 52 AGPs (14 classical AGPs, three Lys-rich AGPs, 10 AG peptides, 21 [chimeric] FLAs, and four other chimeric AGPs) were identified.The additional AGPs found in this study are largely attributed to using an updated Arabidopsis protein database, altering the definition of an AG peptide to include up to 90 amino acids (compared with 75), and analyzing HRGP-related sequences based on annotations in the database.In addition, Schultz et al. (2002) also identified 19 candidate EXT genes as a by-product of searching for AGPs using the greater than 50% PAST amino acid bias.As explained by Johnson et al. (2003b), these 19 genes were subsequently examined for the presence of a signal peptide and SP 3 and SP 4 repeat units.In contrast, the additional EXTs found in this study are largely attributed to using an updated protein database, to searching for SP 3 and SP 4 repeats in all the proteins encoded by the genome (not just those proteins passing the 50% PAST test), and to analyzing HRGP-related sequences based on annotations in the database and literature.Johnson et al. (2003b) also reported the existence of 17 PRPs based on searching for proteins with greater than 49% PKVY and greater than 47% PKVL amino acid biases, similar to the findings obtained in this study.
While most of the AGP, EXT, and PRP genes fitting canonical sequencing parameters are now identified, identifying chimeric HRGPs, particularly chimeric AGPs, remains a challenge, given that no clear consensus sequence exists as for the AGPs.Thus, while we have identified six chimeric AGPs in addition to the FLAs and PAGs, it is likely that other proteins contain AGP modules.For instance, two homologous Arabidopsis genes, At5g64080 and At2g13820, designated Arabidopsis XYLOGEN PROTEIN1 (AtXYP1) and AtXYP2, respectively, are known to contain AGP-like regions, but they were not identified in our searches.A glimpse of other such chimeric AGPs was provided in a previous study, where putative GPI-anchored proteins were identified by bioinformatics to reveal not only numerous GPI-anchored AGPs but also approximately 50 other proteins containing AGP sequence modules, but annotated as phytocyanins, stellacyaninlike, uclacyanin-like, early nodulin-like, COBRA, b-(1,3)-glucanases, aspartyl proteases, LTPL, SKU5, receptor-like kinases, and other unknown or hypothetical proteins (Borner et al., 2003).
In order to identify such chimeric AGPs, the sliding windows feature of the BIO OHIO program was utilized.Specifically, the Arabidopsis protein database was searched using windows of 10, 20, and 30 amino acids and searching for greater than 80%, 90%, and 95% PAST.In order to find all 85 AGPs identified in our searches with a sliding windows approach, an amino acid composition of greater than 60% PAST is required with a window size of 10 amino acids.While this approach finds all of the AGPs predicted by our searches, it produces many false positives in the process, making this approach of limited usefulness in initial searches on its own.However, the sliding windows feature is especially useful to identify single or multiple AGP modules in chimeric AGPs when identified by other approaches.
Laboratory experimentation has verified and validated this in silico approach to identifying HRGPs.With respect to the AGPs, reports on several cloned AGP genes and/or characterized AGP glycoproteins in Arabidopsis exist and substantiate predictions made by the program (Schultz et al., 2000(Schultz et al., , 2004;;Johnson et al., 2003a;van Hengel and Roberts, 2003;Sun et al., 2005;Liu and Mehdy, 2007;Yang et al., 2007).Moreover, at the protein level, several of the AGPs predicted here to have signal peptides and GPI anchors are substantiated in these reports.With respect to the EXTs, only three nonchimeric EXT genes (EXT1/ 4, EXT2, EXT3/5) and several LRXs and PERKs are cloned (Merkouropoulos et al., 1999;Yoshiba et al., 2001;Baumberger et al., 2003b;Nakhamchik et al., 2004).Moreover, both the LRXs and PERKs were previously examined using BLAST and other homologybased genomic tools to identify members of these two chimeric EXT classes, in agreement with the bioinformatics findings presented here (Baumberger et al., 2003a;Nakhamchik et al., 2004).In contrast to the AGPs, there is little information on the EXTs at the glycoprotein level in Arabidopsis.With respect to the PRPs, only four PRPs are cloned in Arabidopsis, namely PRP1, -2, -3, and -4, and little is known about any of the Arabidopsis PRPs from glycoprotein studies (Fowler et al., 1999).Thus, this work extends and consolidates the experimental inventory of HRGPs and makes testable predictions with respect to the presence (or absence) of signal peptides and GPI anchor addition sequences.Although the majority of HRGPs identified by this bioinformatics approach contain signal peptides, several HRGPs do not.It is unknown whether this represents limitations to the predictive power of the program or is due to the possibility that HRGPs lacking such a sequence remain inside the cells or are secreted by an alternative secretory pathway, as reported in some cases (Nickel, 2003;Lee et al., 2004).For instance, all PERKs lack a signal peptide but are localized to the plasma membrane, with the EXT region extending into the cell wall (Nakhamchik et al., 2004).Similarly, while GPI anchors predicted for many AGPs are experimentally verified in several instances, including in Arabidopsis, it was surprising to observe here and elsewhere that several EXTs and one PRP also have predicted GPI anchor addition sequences (Borner et al., 2003), which await biochemical and functional verification at the protein and cell biology levels, respectively.
Four hybrid HRGPs containing AGP and EXT sequence motifs also are encoded by the Arabidopsis genome.These hybrids, like the chimeric HRGPs, complicate the classification system.Indeed, it is human nature to classify things into discrete categories, but the chimeric and hybrid HRGPs remind us that nature cares little for the organizational principles coveted by the human mind.Consequently, it is perhaps best to view the HRGPs as a spectrum of molecules composed of some combination of hyperglycosylated AGP modules, moderately glycosylated EXT modules, lightly glycosylated or nonglycosylated PRP modules, and, in the case of chimeric HRGPs, other non-HRGP modules.

HRGP Gene Expression in Development and in Response to Biotic and Abiotic Stress
Microarray as well as MPSS data are valuable, publicly available genetic resources for the Arabidopsis community, effectively revealing developmental, organ-specific, and stress-specific patterns of gene expression for nearly all of the Arabidopsis genes.These resources can thus provide clues to possible HRGP functions and/or allow researchers to focus their research projects.For example, in looking for phenotypic alterations in a HRGP mutant plant, microarray or MSPP data can guide the researcher in terms of the particular developmental times, organs, or conditions to examine in order to reveal a phenotype.Microarray and MPSS data are available for all but a few HRGPs.The majority of the AGP and EXT genes demonstrate organ-specific expression, while the remaining genes are expressed in multiple organs.Many AGPs, including classical AGPs, AG peptides, and at least one FLA, show pollen-specific expression.Likewise, root-specific AGPs are found in each AGP class.In contrast, pollen-specific expression of the EXT genes is restricted to the chimeric EXTs, most notably to certain LRXs (i.e.PEXs) and PERKs.Root-specific expression is exhibited by certain members of virtually all EXT classes.Approximately half of the PRPs show organ-specific expression, mostly in roots, while the rest are more widely expressed.Clearly, the notion that HRGPs in a particular class have some common organspecific function appears unlikely, although the idea that certain AGPs are markers of cellular identity is supported by the organ-specific expression patterns revealed here (Knox et al., 1989).Comparing published northern and reverse transcription-PCR data on selected HRGP genes in studies conducted by various researchers with the microarray and MPSS data has consistently resulted in good agreement between these various methods to determine patterns of gene expression.
The recently updated Genevestigator Web site has considerably simplified the process of examining stressinduced gene expression in Arabidopsis microarrays.Virtually all HRGP genes are up-and down-regulated by various abiotic and biotic stress conditions.With the exception of some of the PRP genes, which exhibit common regulatory responses to auxin, zeatin, and infection by nematodes and P. syringae, it is difficult to summarize the diverse array of responses exhibited by the various HRGP genes.However, the coexpression database analysis takes into account these data, making common patterns of regulation much easier to recognize and examine.Nonetheless, if one is interested in a particular HRGP gene or in regulation by a particular stress condition, the data collected here constitute an ideal starting point for verification of this stressinduced gene regulation and for formulating functional hypotheses for particular HRGP genes.

HRGP Networks and Genes Involved in Posttranslational Modification
One unique genetic resource available to Arabidopsis researchers is the coexpression database.This database reports genes that are coexpressed with a gene of interest based on hundreds of different microarray gene analyses experiments.For HRGPs, this coexpression database offers the opportunity to reveal networks of genes associated with a given HRGP gene.In this study, the focus was placed on elucidating HRGP gene networks and in identifying candidate genes involved with the posttranslational modification of HRGPs, including genes involved with prolyl hydroxylation, glycosylation, and cross-linking.With regard to HRGP networks, it was remarkable that certain FLAs, namely FLA2, -7, -8, and -9, were coexpressed with so many different AGPs, EXTs, and PRPs.One interpretation of this result is that these FLAs play important roles in coordinating activities among various HRGP molecules; however, this and other interpretations must await functional characterization of these FLAs.Clearly, HRGP gene networks likely exist, given that sets of HRGP genes appear to be coregulated by a variety of conditions.It is possible that such regulatory networks are controlled by common regulatory sequences found in the HRGP genes.Efforts are currently under way as an extension of this work to identify such sequences using bioinformatics to allow for subsequent experimental testing of these elements and the transcription factors that bind to them.
It was hypothesized that a number of GT genes are expressed in conjunction with various HRGP genes to allow for the coordinated glycosylation of the encoded core protein.Furthermore, it was hypothesized that particular GTs would be responsible for synthesis of the various sugar linkages associated with the arabinogalactan polysaccharides attached to noncontiguous Hyp residues in AGPs, while other GTs would be associated with synthesis of the short arabinoside oligosaccharide chains attached to contiguous Hyp residues in EXTs and PRPs according to the Hyp continuity hypothesis (Tan et al., 2003).It was also hypothesized that GTs responsible for the addition of single Gal units to Ser residues in EXTs would be found.Moreover, based on the elucidated structures of dicot EXTs (Akiyama et al., 1980) and a well-characterized Hyp-AG isolated from transgenic tobacco (Nicotiana tabacum; Tan et al., 2004), and knowing the specificity of GTs, a minimum of 20 transferase activities are likely to be involved in the O-linked glycosylation of HRGPs.Specifically, for EXTs and PRPs, we predict one Ser-a-galactosyltransferase, at least one Hyp-b-arabinosyltransferase, one a-(1,2) arabinosyltransferase, and two b-(1,2)arabinosyltransferases, while for AGPs, we predict one Hyp-bgalactosyltransferase, one a-(1,5)arabinosyltransferase, at least four a-(1,3)arabinosyltransferases, at least three b-(1,3)galactosyltransferases, three b-(1,6)galactosyltransferases that add the three branch sites on the AG core, at least two b-(1,6)glucuronyltransferases, one a-(1,4)rhamnosyltransferase, and at least two a-(1,2) fucosyltransferases.Indeed, many GT genes are coexpressed with AGPs, EXTs, and PRPs.In fact, 36 different GTs representing 19 families were coexpressed with all three HRGP subfamilies, while some GTs are expressed only with two subfamilies or are restricted to one particular HRGP subfamily.While it is possible to speculate on the activities of these various GTs with respect to HRGPs based on their annotations and proposed mechanisms (i.e.inverting or retaining) in the CAZY database, such speculations would have to be tested by developing appropriate biochemical assays and/or obtaining and biochemically characterizing GT mutants.Indeed, such research is currently under way in a number of cell wall laboratories and is beginning to yield results.For example, it was recently shown that a mutant in the At2g35610 gene, encoding a GT77 family member, results in the production of underarabinosylated EXTs (Gille et al., 2009).Thus, the At2g35610 gene likely encodes one of the arabinosyltransferases required for EXT glycosylation and possibly for clustered Hyp residues in certain AGPs, consistent with the identification of this gene in the coexpression data presented here in Tables VI and III, respectively.
Although only four plant P4Hs are cloned and characterized to date (two [P4H1 and P4H2] from Arabidopsis [Hieta and Myllyharju, 2002;Tiainen et al., 2005], one from tobacco [Yuasa et al., 2005], and one from Chlamydomonas [Keskiaho et al., 2007]), 13 P4H genes are predicted to exist for Arabidopsis (Vlad et al., 2007).The coexpression analysis performed here shows that only one of these P4H genes, namely P4H2, was consistently coexpressed with numerous HRGPs.This indicates that this P4H likely acts on AGPs, EXTs, and PRPs and is not restricted to a particular HRGP subfamily.Unfortunately, no published reports on P4H-2 mutants, or any P4H mutants in Arabidopsis, exist at present.However, the genetic redundancy in the P4H family may make such mutant work difficult.Nonetheless, a report that a P4H gene silenced by RNA interference in Chlamydomonas has an altered wall phenotype should bolster similar work in Arabidopsis (Keskiaho et al., 2007).
An acidic EXT peroxidase was isolated from tomato (Solanum lycopersicum) with EXT cross-linking activity (Schnabelrauch et al., 1996).It is also likely that PRPs and possibly AGPs undergo similar peroxidasecatalyzed cross-linking.In an effort to identify potential peroxidases involved with HRGP cross-linking, the coexpression database was used.Indeed, an acidic peroxidase (At3g03670) was identified using this approach and was coexpressed with the two most Tyr-rich EXTs.It will now be interesting to overexpress this enzyme for use in the EXTcross-linking assay and/or to obtain mutants in this gene and observe whether EXT is altered in these mutant plants in terms of more soluble EXTs, less cross-linked EXTs, or reduced amounts of the diisodityrosine/puchrescein cross-linking agent.It should be noted that several other peroxidase genes are also coexpressed and are worthy candidates for similar types of analysis.

HRGP Mutants Are Genetic Tools to Uncover HRGP Function
Genetic mutants are one of the most valuable resources available to the Arabidopsis community, as they provide insight to protein function and facilitate further research to elucidate the mechanism of action.This is clearly the case with HRGP research, where several genetic mutants in AGPs, EXTs, and PRPs are serving as useful tools to elucidate function.It should also be noted that for each informative HRGP mutant, there are many HRGP mutants that fail to reveal a phenotype.There are many potential reasons for such failure, including but not limited to one or more of the following: the existence of genetic redundancy or other genetic backup systems, the inability of certain mutants to adequately reduce mRNA or protein levels to reveal a phenotype, and the inability to examine the mutant under the proper environmental conditions to reveal its phenotype.
At present, several reports on HRGP mutants exist in Arabidopsis, including agp17 (Gaspar et al., 2004), agp18 (Acosta-Garcia and Vielle-Calzada, 2004), agp19 (Yang et al., 2007), sos5 (fla4;Shi et al., 2003), agp30 (van Hengel andRoberts, 2003;van Hengel et al., 2004), rsh-ext3 (Hall and Cannon, 2002), lrx1 (Baumberger et al., 2001), and perk13 (Humphrey et al., 2007).All these mutants have provided functional insights to the role of various AGPs and EXTs.The agp17 mutant displays resistance to Agrobacterium tumefaciens transformation with reduced levels of AtAGP17 in the roots.An RNA interference approach was used to silence the AGP18 and reveal its role in female gametogenesis.An agp19 mutant revealed that AGP19 plays a role in plant growth and development, specifically in cell division and expansion.Studies with the transposon-insertion mutant agp30 suggest that AGP30I has a role in root regeneration and seed germination.The sos5 mutant study indicates that FLA4 plays a role in cell expansion.The rsh-ext3 mutant shows that EXT3 plays an important role in embryo development and cell plate formation, while the lrx1 and perk13 mutants indicate roles for LRX1 and PERK13 in root hair formation and root cell elongation, respectively.
There are currently 1,442 mutant lines available for nearly every HRGP gene, as shown in Tables II, V, and VIII and in Supplemental Tables S4, S7, and S10.While this list is now current, new mutant lines are continually being added to the collection, some of which are now being made available as homozygous knockout lines, saving the researcher valuable time and effort.In any event, once the mutant seed lines are received, they must be planted and verified by PCR analysis to confirm the presence of the mutation in the gene of interest.Mutations existing in the exon regions generally offer the highest probability of obtaining a null mutant and when available should probably be examined first.If a phenotype is observed in the mutant, it is important to confirm that the mutant phenotype is caused by the mutated gene of interest and not by another mutation elsewhere in the genome.Such confirmation can be achieved by studying other mutant lines (i.e.allelic mutants) for a gene of interest and observing the same mutant phenotype or by complementing the original mutant with the wild-type version of the gene of interest.Although mutants affecting the HRGP core proteins allow for the assessment of a particular HRGP's functional role, obtaining mutants in the genes responsible for HRGP posttranslational modification (i.e.GTs, P4Hs, peroxidases) offers perhaps even greater opportunities to address and reveal HRGP function, as multiple HRGPs would be affected by such a mutation.

CONCLUSION
The BIO OHIO bioinformatics program reported here represents a valuable tool to mine genomic databases for HRGP genes, including AGPs, EXTs, PRPs, chimeric HRGPs, and hybrid HRGPs.While this program was utilized to mine the Arabidopsis proteome, it can now be utilized to examine proteomes resulting from other plant genome projects, namely poplar (Populus species), rice (Oryza sativa), Physcomitrella, and Chlamydomonas.Preliminary evidence indicates, not surprisingly, that poplar is most similar to Arabidopsis in terms of its HRGP inventory, while the other species have considerable differences from the dicot HRGP inventory.In Arabidopsis, there are many surprises with respect to the HRGP family members beyond just finding new putative HRGPs, including finding HRGPs that apparently lack signal peptides, the predicted existence of GPI anchor addition sequences in certain EXTs, the numerous HRGPs that show organ-specific expression, and the likely existence of coregulated HRGP networks.Depending upon an investigator's interest, there is now a wealth of information provided to guide future HRGP research.Many of these predictions will require verification or confirmation, but hypotheses can now be formed and specific experiments designed based on the information presented here to facilitate future HRGP research.
Refinements to the BIO OHIO program are possible.In particular, reducing the number of false positives during a search and improving or developing search strategies to identify the chimeric HRGPs, particularly chimeric AGPs and chimeric PRPs, represent two of the most challenging areas for improving the predictive power of the program.In addition to the sliding windows approach, other more novel approaches are being examined to improve the predictive power of the program, including using hidden Markov models, neural networks, as well as supervised and unsupervised learning approaches.
Finally, while the program was specifically developed to identify HRGPs from plant genomic data, it can be readily adapted to identify other proteins or protein families.The ability to select any amino acid bias or sequence motif of interest should make this program attractive to other researchers, including those outside of the plant community, who wish to screen whole genome protein sequences meeting their desired criteria.In addition, this program can be used to screen virtually any protein database, including those created manually or from EST databases.

Development and Basic Operation of the BIO OHIO Bioinformatics Program
A Perl program, named BIO OHIO, was written that analyzes each predicted protein sequence in the Arabidopsis (Arabidopsis thaliana) genome.This program is available upon request along with a user manual describing the use and operation of this program; however, an abbreviated version of the program is accessible at http://132.235.14.51/functional_genomics.html.The database used (i.e.ATH1.pep) was dated June 10, 2004, and downloaded from The Institute for Genomic Research (ftp.tigr.org/pub/data/a_thaliana/ath1/SEQUENCES/).The program is able to categorize proteins based on various characteristics and patterns of amino acids as specified by the user/researcher.For each identified protein or "hit," the following information was provided: (1) the Arabidopsis Genome Initiative locus identifier and sequence name; (2) the entire protein sequence; (3) the length of the protein; (4) the total PAST percentage for each protein; (5) analysis for the presence of a signal peptide within the first 50 amino acid residues; and (6) analysis for the presence of a GPI anchor addition sequence.In addition, the program provided analysis of repeated sequences within the proteins.In particular, the presence of AP, PA, SP, and TP dipeptide repeats were noted, as these sequences are typically associated with known AGPs.Protein hits were classified as AGPs if they did not contain repeats associated with EXTs or PRPs (e.g.multiple SP 4 , SP 3 , or PPV repeats) but contained predominantly AP, PA, SP, or TP repeats.In order to verify the predictions easily, the program predicted signal peptides and GPI anchor addition sequences and also allowed direct connection to the SignalP Web site (http://www.cbs.dtu.dk/services/SignalP/) to verify signal peptides, the Plant big-PI predictor Web site (http://mendel.imp.ac.at/gpi/ plant_server.html)to verify GPI anchor predictions, and the TAIR Web site (http://arabidopsis.org/) for gene and protein information.When conflicts arose between BIO OHIO and the SignalP Web site or the Plant big-PI predictor Web site, data from the SignalP Web site or the Plant big-PI predictor Web site were used.
Finding Classical AGPs and AG Peptides Using Biased Amino Acid Compositions and Finding FLAs by Searching for Fasciclin Motifs Classical AGPs were identified as proteins of any length that consisted of 50% or greater of the amino acids P, A, S, and T (PAST).AG peptides were identified as proteins of 50 to 90 amino acids in length consisting of 35% or greater PAST.A reduced PAST level was used, since AG peptides usually contain an N-terminal signal peptide and possibly a C-terminal GPI anchor addition signal sequence, which can make up about half of the peptide and contain little PAST.FLAs were designated as proteins containing the consensus

Finding EXTs by Searching for SP 4 and SP 3 Repeat Motifs
The program allowed for searches of any given amino acid string written as a regular expression.Thus, EXTs were identified by searching for the occurrence of two or more SP 4 (or SP 3 ) repeats in the protein.Since some of these hits were already annotated as PERKs in the TAIR database, we also manually included other known members of this family from the published literature (Baumberger et al., 2003a;Nakhamchik et al., 2004).Hits were examined for the location and distribution of SP 4 and SP 3 repeats as well as for the occurrence of other repeating sequences, including YXY.In addition, these sequences were examined for potential signal peptides and GPI anchor addition sequences as described above.
Finding PRPs by Using Biased Amino Acid Compositions and by Searching for PPVX(K/T) and KKPCPP Repeat Motifs PRPs were first identified by searching for a biased amino acid composition of greater then 45% PVKCYT (Fowler et al., 1999).PRPs were also identified by searching for the occurrence of two or more PPVX(K/T) (where X represents any amino acid) and KKPCPP motifs (Fowler et al., 1999).Hits were examined for the location and distribution of these repeats as well as PPV repeat units.In addition, these sequences were examined for potential signal peptides and GPI anchor addition sequences as described above.

Finding Amino Acid Sequence Repeats in a Protein Sequence
Operating on a Bio::Perl sequence object, a frequency function determines the repeating elements in a given protein sequence.The length of the repeating elements is a parameter that can be set by specifying a minimum length of an element and a maximum length of an element.This variability allows a very thorough examination of the sequence.For each length that lies between the minimum and maximum length, set in the parameters, a sliding window of that length is used and shifted across the sequence, in increments of one amino acid, starting at position 1 and ending at the last position: the length of the sliding window + 1.The discovered elements are stored in a hash structure, with the subsequence of the sliding window as the key and the number of occurrences as the entry.Upon this hash structure, the percentages are computed and stored.This extended hash is then passed onto a visualization function that adds html tags around a currently highlighted pattern and thus allows the analysis of pattern distribution among the complete amino acid sequence.

Figure 2 .
Figure 2. Protein sequences encoded by representative AGP gene classes in Arabidopsis.Colored sequences at the N and C termini indicate predicted signal peptide (green) and GPI anchor (light blue) addition sequences if present.AP, PA, SP, and TP repeats (yellow) and Lys-rich regions (olive) are also indicated.

Figure 3 .
Figure 3. Protein sequences encoded by representative EXT and hybrid HRGP gene classes in Arabidopsis.Colored sequences at the N and C termini indicate predicted signal peptide (green) and GPI anchor (light blue) addition sequences if present.SP 3 (blue), SP 4 (red), SP 5 (purple), and YXY (dark red) repeats are also indicated.AP, PA, SP, and TP (yellow) repeats are indicted on hybrid HRGP only.

Figure 4 .
Figure 4. Protein sequences encoded by representative PRP gene classes in Arabidopsis.Colored sequences at the N terminus indicate predicted signal peptide (green).PPVX(K/T) (gray), KKPCPP (teal), and PPV (pink) repeats are also indicated.

Table I .
AGPs identified from the Arabidopsis genome based on biased amino acid compositions, size, and the presence of fasciclin domainsThe number in parentheses indicates the number of proteins that had a predicted signal peptide sequence.

Table II .
Identification, characterization, and classification of the AGP genes in Arabidopsis

Table II .
(Continued from previous page.)

Table II .
(Continued from previous page.)

Table IV .
EXTs identified from the Arabidopsis genome based on SP 3 and SP 4 amino acid repeat unitsThe number in parentheses indicates the number of proteins that had a predicted signal peptide sequence.

Table V .
Identification, characterization, and classification of the EXT genes in Arabidopsis (Table continues on following page.)

Table V .
(Continued from previous page.)

Table VI .
Table continues on following page.)(Continued from previous page.)

Table VII .
PRPs identified from the Arabidopsis genome based on biased amino acid composition and repeat unitsThe number in parentheses indicates the number of proteins that had a predicted signal peptide sequence.

Table VIII .
Identification, characterization, and classification of the PRP genes in Arabidopsis

Table X .
A summary of the HRGP superfamily in ArabidopsisBoldface entries are subtotals for the various HRGP families.