Despite its importance, relatively little is known about the relationship between the structure, function, and evolution of proteins, particularly in land plant species. We have developed a database with predicted protein domains for five plant proteomes ( http://pfp.bio.nyu.edu ) and used both protein structural fold recognition and de novo Rosetta-based protein structure prediction to predict protein structure for Arabidopsis and rice proteins. Based on sequence similarity, we have identified ∼15,000 orthologous/paralogous protein family clusters among these species and used codon-based models to predict positive selection in protein evolution within 175 of these sequence clusters. Our results show that codons that display positive selection appear to be less frequent in helical and strand regions and are overrepresented in amino acid residues that are associated with a change in protein secondary structure. Like in other organisms, disordered protein regions also appear to have more selected sites. Structural information provides new functional insights into specific plant proteins and allows us to map positively selected amino acid sites onto protein structures and view these sites in a structural and functional context.
Although genomes remain complex structures that encode a large number of genetic entities, protein-coding genes arguably represent the largest and most important component of eukaryotic genomes. A large fraction of eukaryotic proteins are encoded by gene families that evolve by gene duplication and diversification ( Doolittle 1995 ; Conery and Lynch 2001 ; Lynch and Conery 2003 ; Conrad and Antonarakis 2007 ; Hahn et al. 2007 ; Sterck et al. 2007 ). In the model plant species Arabidopsis thaliana , for example, nearly 1,000 gene families have been identified, which together account for >8,000 (∼33%) of protein-coding loci, and the numbers for rice are comparable ( Yu et al. 2002 , 2005 ).
Adaptive evolution in organisms can proceed through the diversification of these protein-coding genes and gene families, and understanding the nature of evolutionary change requires us to understand how proteins evolve, both in structure and function. Methods of phylogenetic analysis that can reconstruct protein domain families are well described, including maximum parsimony and Bayesian maximum likelihood ( Yang 2006 ) methods, and several methods have been developed that can identify positive selection in key amino acid sites in evolving proteins ( Martinez-Castilla and Alvarez-Buylla 2003 ; Hernandez-Hernandez et al. 2007 ; Kelleher et al. 2007 ; Wagstaff and Begun 2007 ).
Despite the intense interest in protein family diversification, however, detailed evolutionary analyses have only been undertaken for a few plant gene families, including those that encode the myb-like ( Rosinski and Atchley 1999 ), homeodomain ( Bharathan et al. 1999 ), MADS-box ( Purugganan et al. 1995 ; Purugganan 1997 ; Martinez-Castilla and Alvarez-Buylla 2003 ; Kramer et al. 2004 ; Nam et al. 2004 ), and proteasomal ( Swaffield and Purugganan 1997 ) proteins. The potential exists for studying how selection of amino acids can occur in a structural context, and a few studies have started to incorporate structural information in the evolutionary analyses of gene families ( Roth and Liberles 2006 ; Zhao and Su 2010 ).
A major obstacle to studying the structural evolution of proteins is the lack of well-defined structures for the vast majority of eukaryotic proteins. Although genomics projects have become adept at obtaining the primary sequence of the entire complement of protein-coding genes in genomes, annotations that depict secondary or tertiary structures remain sparse. It is thus imperative that we develop methods to extend the annotations of genomes by incorporating protein structural information ( Eisenberg et al. 2000 ; Baliga et al. 2004 ; Weston et al. 2003 ).
We have developed methods that take whole-genome protein sequences and provide structural annotation of these proteins. These methods include matching regions of protein sequence to known structures in the Protein Data Bank (PDB) ( http://www.rcsb.org/pdb ) using PSI-BLAST ( Altschul and Koonin 1998 ) and fold recognition methods (Fold and Function Assignment System [FFAS]) ( Jaroszewski et al. 2000 , 2005 ; Rychlewski et al. 2000 ) in order to predict domain boundaries and annotate predicted domains with structure information. Select domains that elude identification by BLAST and fold recognition are then predicted using Pfam ( http://pfam.sanger.ac.uk ) and heuristic approaches and are considered contenders for Rosetta de novo structure prediction ( Kuhlman and Baker 2000 ; Chevalier et al. 2002 ; Gray et al. 2003a , 2003b ; Kuhlman et al. 2003 ; Rohl et al. 2004 ; Rohl 2005 ). Using these approaches, we have previously completed structural/functional annotation of 94 proteomes ( Drew et al. 2011 ), and Rosetta-generated structure predictions have specifically produced functional insights even when function is not evident from sequence-based analyses alone ( Bonneau et al. 2001 , 2002 ).
To date, several plant proteomes have been fully sequenced, with several more currently underway. Like other organisms, however, this increase in availability of sequence information has not been matched with an increase in known plant protein structures or known protein functions. In this work, we have applied our structure-based annotation method to plant genome data to examine the evolution of selected amino acids on plant protein families in the context of their structure. We focused on the angiosperms, which are arguably the most diverse major plant group on the planet, with over 260,000 living species, in more than 450 families ( Angiosperm Phylogeny Group 2003 ; Jiao et al. 2011 ).
We have predicted structural domains for all known proteins in five flowering plant proteomes— A. thaliana , Oryza sativa , Populus trichocarpa , Sorghum bicolour , and Vitis vinifera . To our knowledge, this is the largest database of inferred protein structures currently available for plants. Using OrthologID (OID) ( Chiu et al. 2006 ), we have also identified ∼15,000 gene families (categorized as alignments with at least two sequences) within these five proteomes. Using codon-based models, we have done selection analysis on amino acid sites ( Yang 2007 ) for 2,120 gene families and examined the structural context of these positively selected sites. Finally, we placed these positively selected sites in a structural context by highlighting and visualizing these sites on corresponding three-dimensional predicted protein structures.
Materials and Methods
Genomic and Proteomic Data
Phylogenies and alignments were generated from analysis of five plant species which have complete genome sequences available: A. thaliana ( http://www.arabidopsis.org/ Version: 9), O. sativa (cv. nipponbare) ( http://rice.plantbiology.msu.edu/ , Version: MSU6), V. vinifera ( www.Gramene.org , Version: 2007-12-IGGP), P. trichocarpa ( http://genome.jgi-psf.org/ Version: 2004-12-JGI), and S. bicolour ( http://genome.jgi-psf.org/ , Version: Sbi1). The complete proteomic and genomic sequences were downloaded from the Gramene Web site ( http://www.gramene.org/ ) using BioMart ( www.biomart.org ). Both A. thaliana and O. sativa annotations were listed as fully complete, with the remaining three taxa in draft assembly.
Protein/Gene Family Choice
To identify gene families, we modified OID ( Chiu et al. 2006 ), which is a semiautomated homology search and phylogeny reconstruction pipeline. OID was modified to remove the MAFFT alignment refinement step (Katoh et al. 2002) , which removes ambiguously aligned regions. This allowed for the OID protein alignments to be mapped correctly to their respective nucleotide alignments. Using OID, gene family alignments and phylogenies were generated from the annotated protein-coding genes in our study species using BLAST ( Altschul et al. 1990 ), with an expectation cutoff of e −20 . OID produced 14,822 putative gene families, with family sizes ranging from n = 2–1,600 gene sequences (see fig. 1 ). We define gene family based on sequence information with a BLAST analysis cutoff e −20 , a stringent cutoff has been used in previous studies (for example, Kinsella et al. 2003 ; Xiao et al. 2007 ) to delineate gene/protein families. Not included in this count are the 66,478 orphan sequences ( n = 1) produced by OID. Due to the high number of OID results that contained less than 10 gene sequences, we limited our final analysis to only those gene families with at least 10 sequences regardless of whether each alignment included a representative from each species. This reduced the data set to 2,230 gene families, each corresponding to a plant protein domain family.
Due to the large evolutionary distances between the organisms used in this study (∼120–200 Myr between monocots and dicots), the alignments were modified to remove excessive gapped regions. CodeML estimates ancestral sequences for alignments when estimating whether positive selection has occurred within the protein family. Excessive gapped regions make this task computationally difficult (i.e., computational run times >600 h) and increase the possibilities of misalignments between homologous sequences; thus, large gapped alignments are not suitable for use in CodeML. Using multiple cutoff options, for both columns and rows, we determined the threshold percentage of gaps that allowed for feasible use with CodeML. Based on this, we culled each alignment in two ways. If the length of an aligned sequence contained at least 70% gaps, the sequence was removed. Furthermore, if more than 30% of a column in the resulting alignment was comprised of gaps, that column was removed from the alignment. Inspection of the data at this point showed alignments with greater than 100 sequences were still excessively gapped and would not be suitable for CodeML analysis, so we removed these alignments from our analysis. This further reduced our data set by 110 gene families, to 2,120 families. This left a count of 46,667 sequences in our final analysis, with alignment sizes ranging from 10 to 100 sequences.
Gene family groups with known function were initially identified using the known families listed on The Arabidopsis Information Resource (TAIR) Web site ( http://www.arabidopsis.org ). In addition, we annotated the gene families in our analysis with known functional groups gleaned from the literature, as well as homologous relationships found in the Gramene Web site. This led to the identification of 192 families with previously annotated/known functions and containing at least two members of each of the five species (ca. 30,000 sequences).
Positive Selection Analysis
Positively selected sites were predicted using CodeML from the PAML package ( Yang 2007 ). CodeML uses different evolutionary models to account for differences in transition/transversion rates in DNA, and also codon usage biases found within degenerate codons, and uses maximum likelihood to estimate the fit of levels of sequence divergence in homologous sequences given these models ( Bishop et al. 2000 ; Yang et al. 2000 ). Five models were used in this analysis: M0-M3 and M7-M8 with ambiguous residue positions included and using the F3X4 codon frequency model.
CodeML calculates the ratio of synonymous (d S ) and nonsynonymous (d N ) changes that have occurred at each codon with a d N /d S > 1 suggestive of positive selection. Each CodeML model builds on the preceding one and adds additional d N /d S (ω) classes. The most basic model M0, assuming all sites are undergoing negative/deleterious selective pressure (d N /d S < 0; 1 class of sites). M1 allows for some sites to be under neutral selection (d N /d S = 1), whereas M2 allows for some sites to be under positive selection (d N /d S > 1). As each model is applied to the data more complex parameters are applied, allowing for multiple d N /d S classes in the data set. The most complex model we used was M8, which allows for 13 classes of ω sites. If no positive selection was found using the basic M2 and M3 selection models, we did not proceed to the detailed codon selection analysis using models M7 and M8. For a more detailed description of CodeML classes, models, and the statistical analyses involved, see Yang et al. (2000) .
Protein Domain Prediction
Ginzu ( Kim et al. 2005 ) was used to predict domains for all proteins in our five proteomes (213,587 proteins). Ginzu first searches for sequence matches to known three-dimensional protein structures using PSI-BLAST ( Altschul and Koonin 1998 ), providing structural information for predicted domains. Ginzu searches then for matches to experimental structures using fold recognition ( Jaroszewski et al. 2005 ). Domains in regions of protein not matched by PSI-BLAST or fold recognition are predicted using matches to Pfam and a heuristic ( Chivian et al. 2003 ) that predicts domain boundaries based on patterns within multiple sequence alignments (MSAs). These latter domains, which lack structural information, are then exported (if <165 amino acids) for external structure prediction via Rosetta, resulting in domain and structural predictions of varying confidence for all proteins considered in the Plant Proteome Folding Pipeline (Plant PFP). Using these methods resulted in 409,017 predicted domains. Although we predicted domains for all five proteomes, protein-folding structure prediction was performed only on A. thaliana and O. sativa due to the large computational overhead required for folding proteins.
Prediction of Protein Structural Elements and Positive Site Mapping
PSIPRED ( McGuffin et al. 2000 ) and DISOPRED2 ( Ward et al. 2004 ) were used to predict protein folds and disordered regions, respectively. PSIPRED uses neural networks to analyze the position-specific scoring matrices produced from PSI-BLAST to infer secondary structure and is one of the top secondary structure prediction methods available. Disordered regions are defined as those regions that do not fold into a three-dimensional structure in their native state. Disordered regions are flexible, dynamic, and can be partially or completely unfolded in solution. DISOPRED2 uses known structural information, coupled with sequence records, to infer disordered regions.
For each residue position in an alignment, we use PSIPRED and DISOPRED2 to categorize specific amino acid sites into what secondary structure element they were found. If at least 80% of residues at a particular alignment position were predicted to be of the same fold/residue type, we classed the amino acid site as belonging to that type. For those sites that did not show outright support for a particular protein fold/class, we classified it as a mixed site.
The Dictionary of Secondary Structure Proteins (DSSP) ( Kabsch and Sander 1983 ) was also used for additional secondary structure annotation information. Using PDB atomic coordinates, DSSP defines secondary structure, geometrical features, and solvent exposure of proteins. We also obtained Structural Classification of Proteins (SCOP) structural information. The SCOP ( Murzin et al. 1995 ) database uses manual inspection, with the help of automated methods, to predicted structural and evolutionary relatedness.
The Plant PFP Web site is available at http://pfp.bio.nyu.edu and currently represents 213,587 proteins from our five chosen organisms along with their respective protein structure predictions. The Ginzu pipeline as described by Drew et al. (2011) was used to analyze 211,140 of these proteins, skipping 2,447 proteins due to their excessive length or high percentage of residues predicted to be in disordered regions. From these 211,140 proteins, Ginzu produced 409,017 domains (listed as “Domains” in table 1 ). The 173,820 domains predicted by PSI-BLAST and fold recognition methods were automatically associated with their top matching PDB structure. The remaining 235,197 domains were considered for Rosetta de novo structure prediction. Twenty-nine thousand two hundred and two domains were returned from Rosetta with predicted structures, 4,769 of which are considered to be high confidence, where high confidence is determined by a MAMMOTH Confidence Metric (MCM) score of 0.8 or greater, which correlates to the high atomic accuracy of the predicted structure in relation to the native structure (see table 2 ) ( Drew et al. 2011 ). To evaluate the accuracy of high-confidence structure predictions, a double-blind benchmarking of the structural analyses methods were used, and these correctly predicted 47% of structures using SCOP (v1.67) superfamily classifications ( Drew et al. 2011 ), which is high for computational structure prediction. Comparison of the predicted and experimentally determined structures showed a strong correlation in structure prediction ( Drew et al. 2011 ).
|Organism||Number of Proteins||Number of Domains||Domains with Known PDB Structures a||Domains without Known PDB Structures b|
|Organism||Number of Proteins||Number of Domains||Domains with Known PDB Structures a||Domains without Known PDB Structures b|
Based on PSI-BLAST, fold recognition methods.
Domain structures based on Pfam, MSA, and heuristic methods.
|Organism||All De Novo Domain Predictions||De Novo Domain Predictions (>0.8 confidence)|
|Organism||All De Novo Domain Predictions||De Novo Domain Predictions (>0.8 confidence)|
To facilitate the exploration of predicted sites of positive selection mapped onto structures and the exploration of our predicted domain families, we have constructed a web interface to our resource ( http://pfp.bio.nyu.edu ). This web interface allows searching via accession or ontology term, by sequence using BLAST, or by searching the list of predetermined functional families by name (i.e., “bHLH transcription factor”). User selection of a family group produces an initial page showing the phylogeny and sequence identifiers, with each identifier listing the mapped protein domains.
This initial window contains three tabs—JALview, JMol, and Phylowidget. JALview ( Waterhouse et al. 2009 ) is used to display the MSA of the proteins in the family, Phylowidget ( Jordan and Piel 2008 ) to display the family's phylogenetic tree, and JMol ( http://www.jmol.org ) to view the three-dimensional structure (either a PDB structure or a Rosetta de novo predicted structure) of the individual proteins of the family. The Web site also provides a display that shows the division of family proteins into domains via the Ginzu program and displays the methods by which Ginzu predicted each domain.
Selecting a mapped protein domain within a sequence that has a structure annotation and then selecting the “Load in JMol” link loads the domain into the JMol tab, with the sites of positive selection highlighted (see fig. 2 for a JMol view). If a domain maps to a known protein structure in PDB, a link to this protein is included. Positively selected sites are highlighted along the structure in blue, whereas the atoms of selected residues are circled in yellow, which enables the choosing of specific positively selected residues within the alignment to be viewed easily on the structure.
Positive Selection Prediction for Protein Families
Using likelihood ratio tests and d N /d S values from model M8 of CodeML, we determined selection had occurred in 175 gene families, which indicated that 8.2% of families show selection. In total, there were over 4,000 sites that showed positive selection. Ignoring families that did not contain any selected sites with >95% confidence in prediction ( P > 0.95 for d N /d S > 1), resulted in 938 selected sites in 122 gene families. The majority of families (97) had 10 or less sites of positive selection. Although this created a very conservative data set, it increased confidence in the resulting analyses. There were 10–95 sequences per alignment, with a mean size of 20. Alignment length ranged from 51 to 1,372 residues, with a mean sequence length of 321 residues. Mapping these results back to the families where a gene function has been associated, we found 43 gene families mapped to 19 known functions ( supplementary table 1 , Supplementary Material online).
Initial codeML undertook analyses using sequence alignments for gene families that contained both paralogues as well as orthologues. With the high level of gene duplication and polyploidization that has occurred within plant genomes, identification of orthologous genes can prove difficult. Previous work on a comparison of paralogous and orthologous evolutionary rates by Conant et al. (2007) found that the evolutionary rates do not differ significantly between orthologs and paralogs. Nevertheless, we also expanded on the current analysis, by separating gene families with positively selected sites into separate paralogous sequence alignments from each individual species. We used only alignments that 1) originally contained more than one species and 2) contained at least seven sequences from a species and undertook codeML analyses using the same parameters as previously described. From the 122 alignments, we found that 101 met our criteria. We then separated out 142 separate paralogous gene families of individual species from these 101 alignments and reanalyzed the individual sequence alignments for positive selection. A reanalysis for sites of positive selection that included alignments from multiple species (i.e., potential orthologues) had 923 sites of positive selection ( P > 0.95). We find that 722 identified selected sites overlapped in the two analyses, which suggests that our original analyses are able to identify ∼80% of the positively selected sites found in paralogous gene families (while also identifying sites that were selected between putative orthologues). It must be noted that analyzing only the paralogues also identified sites that were not observed in the larger data sets; however, we feel that the results from the larger data sets that also include putative orthologues provide greater power and confidence in the selection analyses.
One other issue is whether the sequences we were analyzing were too divergent, and the synonymous sites were saturated. Previous work suggests that codon-based models estimating selection on protein sequences are valid only when synonymous site divergence of d S < 5 ( Alvarez-Ponce et al. 2011 ). We calculated the mean pairwise d S for each of our gene families and only 12.4% having d S < 5. However, sequence alignments for which we found positive selection had 45.8% with a mean pairwise d S < 5, and the majority of these had d S values between 0 and 1. If we only consider sequence alignments at the range of d S < 5, we find that ∼21% of these gene families show evidence for positive selection.
Amino Acid Properties of Positively Selected Sites
Using amino acid properties from Lise and Jones (2005) , we categorized selected and nonselected residues using eight properties (hydrophobic, polar, small, aliphatic, aromatic, positive, negative, and unique). We changed the property “proline” from Lise and Jones to “unique,” which included glycine and proline, due to their distinctive properties.
Comparing selected versus nonselected residues for each property showed statistically significant differences in five properties ( P < 0.0001): hydrophobic, polar, small, aliphatic, and positive. There was a significant increase in selected residues with polar, small, and positive categories and a less than expected number of residues within aliphatic and hydrophobic categories. There was a slight increase in negatively charged residues within selected sites, but the results were not much different from expected (see table 3 ). These results indicate that hydrophilic, small, and positively charged residues are possibly more prone to evolve with positive selection than other residue types.
|Properties||Observed||Expected||P value a|
|Properties||Observed||Expected||P value a|
Significance of positive correlation between number of positively selected sites and polarity, size, and positive charge while hydrophobic and aliphatic residues showed a strong negative correlation with positive selection.
Clustering of Selected Sites
To examine the possibility that the length of the protein sequence impacts the relative number of selected sites, we examined the distribution of selected sites in relation to the length of the sequence alignments, both the original alignments and the alignments that were culled of excessively gapped regions. We used a Spearman's rank correlation test to examine the relationship between sequence length and number of selected sites (i.e., as protein length increases, does also the number of positively selected sites). For proteins with the same length, the mean number of selected sites was used. The closer the score is to ±1, the more correlated the two variables. Spearman’s rank correlation scores of 0.16 ( P = 0.07) and 0.33 ( P = 0.0002) were found for the culled alignment lengths and the original alignment lengths, respectively. These results indicate that, at least when considering the unculled alignment lengths, there does appear to be a moderately low but significant correlation between increased length and number of positively selected sites.
We looked to see whether the sites under positive selection showed evidence for clustering along the sequence. Using a pairwise distance measure, for each alignment we found the average distance between sites of selection along the primary sequence. We chose an equal number of sites at random and found the average pairwise distance in the data set, and we repeated this random selection of sites 1,000 times. Using a 95% cutoff, we found 40% of alignments had a smaller pairwise distance between selected sites compared with 95% of the random distances. This increased to 58% if we use a 90% cutoff. This suggests that there is significant sequence-space clustering of positively selected sites in a substantial number of proteins.
To corroborate this result, we did a sliding window analysis using a window size of five residues moving one amino acid at a time. Comparing the location of selected sites, with 10,000 permutations based on random site selection, we found that in only one case did the number of windows containing random amino acid sites equal the number of windows containing actual sites of selection. Together, these results indicate that the majority of selected sites are clustered within plant protein gene families.
Secondary Structure Prediction of Positively Selected Sites
We predicted the secondary structure of all sequences using PSIPRED and created secondary structure alignments. When comparing the secondary structure distribution of positively selected versus nonpositively selected residues, we found a significant reduction in predicted helical ( P < 0.0001) and strand residues ( P < 0.0001) that contain selected sites (see fig. 3 ). There appears to be slight reduction of predicted coiled residues with selected sites, although the results were not significantly different than expected. Interestingly, 66% of sites could not be classed within any definitive secondary structure type and were classed as of mixed structure.
Looking more closely at this “mixed” group, we divided the data into amino acid positions that showed only combinations of coils and sheets (CS), coils and helices (CH), sheets and helices (SH), and positions that still contained all types of secondary structure elements (ALL). In each of these mixed group categories, the positively selected sites were significantly overrepresented (CH, P < 0.006; CS and ALL, P < 0.0001; SH, P <0.03).
Initially if an amino acid position was found to occur in particular secondary structure type in at least 80% of the sequences in an alignment, we classified that amino acid position as being of that type. Modifying this to include 60%, 70%, 90%, and 100% cutoffs, we found all secondary structure element counts remained the same, except for counts for coils, and within the mixed group CS. As the stringency of the cutoff increased, the number of coiled-only regions decreased and the number of CS regions increased within both selected and nonselected positions, indicating that the change from coils to sheets or vice versa may be more flexible than other structural changes.
Using DISOPRED2 we found that positively selected amino acid sites (compared with nonselected sites) were found largely in disordered regions. We found an increased number of sites of positive selection predicted to be disordered than expected ( P < 0.0001) (see fig. 3 ). As disordered regions can undergo different folding conformations, our results could indicate that such flexible regions are under higher selective pressure. We found 65.7% of sites to be categorized as mixed. Residues with a mixture of different secondary structure element types and order/disordered etc. suggest that areas where structure changes from one fold type to another may be targets of adaptive evolution.
Using DSSP, we obtained the Relative Accessible Surface Area (RASA) for the plant proteins in our analysis. RASA provides a measure of how exposed to a solvent an amino acid residue is within a protein structure; the lower the RASA score, the more buried the residue. The RASA score is between 0 and 1 with 0 being completely buried. We had RASA scores for 17,738 amino acids in our data. Of the 938 positively selected residues, we found RASA scores for 454. If a particular residue had more than one RASA score, we averaged the scores for this residue, unless the residue has a PDB score, in which case we used score that only. Multiple RASA scores were the result of the multiple methods used in the structure prediction (Rosetta, Ginzu). We divided the residues from all alignments into three groups, selected sites, conservative sites (those sites which showed no amino acid replacement), and all others.
A Wilcoxon test showed a significant difference ( P = 0.0001) between the distribution of RASA scores within selected sites compared with conservative sites, with conservative sites being more buried, suggesting more selective pressure occurs on more solvent-exposed residues. This is consistent with our analysis on amino acid properties, which showed selected sites to be more hydrophilic than no selected sites. Comparison of selected sites with all other sites did not show a significant difference ( P = 0.2) (see fig. 4 ).
Case Study I: A Misannotated C2H2 Zinc-Finger Transcription Factor Family Includes TPR Domain Proteins
To demonstrate some of the capabilities of our structure database, we examine in greater detail two examples of gene families where we found evidence for positive selection. The first family is an OID group ( http://pfp.bio.nyu.edu/family/18894 ) that is a family of 11 A. thaliana proteins that are annotated on the TAIR family Web site as comprising C2H2 transcription factor proteins. C2H2 transcription factor proteins contain a zinc-finger domain and are involved in a wide range of functions, including DNA- and RNA-binding and protein–protein interactions ( Englbrecht et al. 2004 ). Of these 11 proteins, TAIR lists seven of them as containing a zinc-finger domain. Our analysis of the domain architecture of this family of 11 proteins suggests it has possibly been misannotated. Searching each sequence manually against GenBank revealed the sequences did not return a hit to zinc-finger domain–containing proteins.
Our Ginzu and Rosetta analyses predicted that these sequences had between four and seven domains, with all but two N- and C-terminal regions mapping to known domains in the PDB. Nonterminal regions were predominantly unassigned, as domain boundaries were identified using less confident heuristic methods. Seven unique domains were mapped to the N-terminal region. Checking the PDB structures for these domains, all were from proteins characterized as TPR domain–containing proteins. TPR domains, first classified in yeast, are a repeating helix-turn-helix motif containing approximately 34 amino acids that are involved in multiple biological functions ( Sikorski et al. 1990 ; Das et al. 1998 ; Main et al. 2003 ; Kajander et al. 2007 ). Functionally, TPR domains have been linked to many roles including Hsp90 mediation, scaffolding proteins, and transcription ( Das et al. 1998 ; Main et al. 2003 ; Wilson et al. 2005 ; Yang et al. 2005 ; Kajander et al. 2007 ). Within this alignment, we found 54 sites of positive selection with greater than 95% confidence, of which 41 occurred within the predicted N-terminal domain (see fig. 5 ). Within the C-terminal, domains from four PDB structures were mapped, which were described as having ubiquitin-related functions.
Case Study II: Selection in an F-Box Protein Family Is C-terminal to the F-box Domain
Our second case example is a family ( http://pfp.bio.nyu.edu/family/19187 ), which contains 23 sequences from S. bicolour and O. sativa and are annotated as F-box proteins. F-box proteins, first described as part of the SCF ( S kp, C ullin, F -box) complex, are involved in ubiquitination and are characterized by a structural motif of approximately 50 residues and constitute one of the largest plant multigene families ( Kipreos and Pagano 2000 ; Xu et al. 2009 ). F-box proteins are comprised of an N-terminal F-box domain, which interact with Skp1, a linker domain and varying C-terminal domains, which are used to recruit substrates ( Jin et al. 2004 ; Petroski and Deshaies 2005 ; Hao et al. 2007 ).
Ginzu and Rosetta mapped domains from 14 proteins in this family to domains in three PDB structures. Twelve proteins mapped to the same PDB structure: 2ovr, along the entire length of the sequence, which includes a WD40 domain at the C-terminal end. Of the other two mapped structures, 1p22 and 2e32, 1p22 also contains a WD40 domain. The third structure, 2e32, differs by having a sugar-binding domain (SBD) rather than a WD40 domain at the C-terminal end. Although these proteins differ in the C-terminal region, they were found to be in the same family due to the presence of a conserved F-box domain (see fig. 6 ).
The complete protein sequence alignment of this family is 408 residues in length, and we inferred 24 sites with positive selection at the 95% confidence level, all of which occurred within the latter half of this alignment. Adaptive evolution does not appear to target the N-terminal F-box domain but appears to be concentrated on the C-terminal domain. Analysis of the positive selection found within the SBD of the PDB structure 2e32 showed selection occurring in proximity to the substrate-binding region, although the residues involved in substrate binding where not themselves under selection ( Mizushima et al. 2004 , 2007 ).
The PDB structure 2ovr is the Fbw7-Skp1-CyclinE complex ( Hao et al. 2007 ). This protein complex is part of the CyclinE degradation pathway and is important in cell cycle regulation ( Zhang and Koepp 2006 ; Hao et al. 2007 ). The other PDB structure mapped, 1p22 is a β-TrCP1-Skp1-β-Catenin complex, also important in cell cycle regulation ( Wu et al. 2003 ). The WD40 domain contains sites of positive selection (see fig. 7 ). Although both 2ovr and 1p22 contain WD40 domains, they bind different substrates at this position, namely CyclinE and β-catenin, respectively. Prediction of positive selection within this domain could indicate that proteins with multiple binding affinities may be under increased selective pressure. Mapping positive selection to the PDB structure 2ovr, we found three amino acid sites involved in protein binding that are predicted to be positively selected.
Generating structure predictions for these proteomes allows us to delve further into the underlying occurrences of selection and selective pressure within plant proteins. Positive selection appears to be a nonrandom occurrence within proteins, occurring in cluster along the alignment length, which could be indicative of pressure on a particular fold or protein region to change. Our results demonstrate that positive selection also occurs more often within particular elements and areas of structural fold change within protein structures. In particular, the numbers of selected residues were less than expected in helical and strand elements. Interestingly, there were significantly more selected sites among residues that were associated with a change in protein structure (e.g., coil to strand or vice versa). Disordered residues also showed an increase in positively selected sites. It is known that many disordered regions become ordered upon binding ( Wright and Dyson 1999 ), as well as having affinity to bind multiple proteins ( Dunker et al. 2005 ). Within our analysis of plant protein structures, we found, similar to work previously done on Drosophila ( Ridout et al. 2010 ), that certain secondary structure elements plus disordered residues have more positive selection than others. Our results indicate that areas of possible evolutionary change, be it as a disordered region or a secondary structure region, may be under greater positive selective pressure than more structurally conserved ordered regions of the protein.
Previous studies on positive selection within plant genomes have usually focused on single, or small number, of plant families ( Zhang et al. 2001 ; Mondragon-Palomino et al. 2002 ; Mondragon-Palomino and Gaut 2005 ; Li et al. 2009 ; Palme et al. 2009 ; Kapralov et al. 2011 ; Moury and Simon 2011 ; Strasburg et al. 2011 ). To our knowledge, the largest current study published on plant adaptive evolution is by Roth and Liberles (2006) who used The Adaptive Evolution Database to predict positive selection in 4,216 seed plant gene families. They found 87 families showed positive selection; however, most of these families contained only two proteins. We have analyzed over 2,000 gene families with at least 10 sequences per family for positive selection, one of the largest plant protein family analyses to date, and found selection in 175 families.
In agreement with previous work ( Gossmann et al. 2010 ), the amount of positive selection we found within plant species appears to be low. Eight percent of our plant protein families appear to be undergoing selection, which is much smaller than estimates within nonplant species ( Smith and Eyre-Walker 2002 ; Halligan et al. 2010 ). This could be indicative of the effective population size of plants being smaller relative to other species ( Bustamante et al. 2002 ; Strasburg et al. 2011 ).
Mapping of positively selected sites to our known structures suggests that amino acid sites that show evidence for positive selection cluster within a protein sequence, a result also found in a previous analysis on Drosophila ( Ridout et al. 2010 ). Ridout et al. (2010) , however, found that N-, C-terminal regions appear to contain more positively selected sites, something we do not find in our plant analysis, where selection appears to be spread through out the protein. Due to the high amount of divergence between our original sequences, we culled our sequences by removing extraneous gapped regions. Most of these gapped deletions were in N-, C-terminals, giving us a possible sampling bias, which may have removed sites that could be undergoing selection and therefore missing additional selected regions, which could have mirrored the results found in Drosophila.
We also find that sites undergoing positive selection appear to be less buried than wholly conserved sites within protein structures. Previous work by Liu et al. (2008) , Roth and Liberles (2006) , and Petersen et al. (2007) , working with human single nucleotide polymorphism data, seed plants and Escherichia coli , respectively, all found similar results in their data, suggesting that less buried residues are under less selective pressure in multiple species.
Linking protein structure, function, and evolution has been one of the key goals of molecular evolution. The availability of whole-genome sequences has allowed investigators access to an inventory of all proteins in an organism, and the availability of data across multiple species allows for a comparative analysis in a phylogenetically informed approach. We have used several tools for protein structure domain identification and prediction that we had previously applied to multiple proteomes ( Drew et al. 2011 ) and have now developed a similar database for plant proteins. As we describe in this study, this provides us with both general trends in the evolution of plant protein families, as well as allows us to highlight evolution of specific examples—in this case a C2H2 and an F-box family. As investigators exploit this database, we may be able to identify even more compelling trends in the structural evolution of plant proteins.
We would like to thank the volunteers participating in the Human Proteome Folding Project on IBM’s World Community Grid and the FFAS03 development team, and the Rosetta Commons and the Yeast Resource Center. This work was supported in part by grant National Science Foundation (NSF) (DBI-0820757).