Abstract

Despite its importance, relatively little is known about the relationship between the structure, function, and evolution of proteins, particularly in land plant species. We have developed a database with predicted protein domains for five plant proteomes ( http://pfp.bio.nyu.edu ) and used both protein structural fold recognition and de novo Rosetta-based protein structure prediction to predict protein structure for Arabidopsis and rice proteins. Based on sequence similarity, we have identified ∼15,000 orthologous/paralogous protein family clusters among these species and used codon-based models to predict positive selection in protein evolution within 175 of these sequence clusters. Our results show that codons that display positive selection appear to be less frequent in helical and strand regions and are overrepresented in amino acid residues that are associated with a change in protein secondary structure. Like in other organisms, disordered protein regions also appear to have more selected sites. Structural information provides new functional insights into specific plant proteins and allows us to map positively selected amino acid sites onto protein structures and view these sites in a structural and functional context.

Introduction

Although genomes remain complex structures that encode a large number of genetic entities, protein-coding genes arguably represent the largest and most important component of eukaryotic genomes. A large fraction of eukaryotic proteins are encoded by gene families that evolve by gene duplication and diversification ( Doolittle 1995 ; Conery and Lynch 2001 ; Lynch and Conery 2003 ; Conrad and Antonarakis 2007 ; Hahn et al. 2007 ; Sterck et al. 2007 ). In the model plant species Arabidopsis thaliana , for example, nearly 1,000 gene families have been identified, which together account for >8,000 (∼33%) of protein-coding loci, and the numbers for rice are comparable ( Yu et al. 2002 , 2005 ).

Adaptive evolution in organisms can proceed through the diversification of these protein-coding genes and gene families, and understanding the nature of evolutionary change requires us to understand how proteins evolve, both in structure and function. Methods of phylogenetic analysis that can reconstruct protein domain families are well described, including maximum parsimony and Bayesian maximum likelihood ( Yang 2006 ) methods, and several methods have been developed that can identify positive selection in key amino acid sites in evolving proteins ( Martinez-Castilla and Alvarez-Buylla 2003 ; Hernandez-Hernandez et al. 2007 ; Kelleher et al. 2007 ; Wagstaff and Begun 2007 ).

Despite the intense interest in protein family diversification, however, detailed evolutionary analyses have only been undertaken for a few plant gene families, including those that encode the myb-like ( Rosinski and Atchley 1999 ), homeodomain ( Bharathan et al. 1999 ), MADS-box ( Purugganan et al. 1995 ; Purugganan 1997 ; Martinez-Castilla and Alvarez-Buylla 2003 ; Kramer et al. 2004 ; Nam et al. 2004 ), and proteasomal ( Swaffield and Purugganan 1997 ) proteins. The potential exists for studying how selection of amino acids can occur in a structural context, and a few studies have started to incorporate structural information in the evolutionary analyses of gene families ( Roth and Liberles 2006 ; Zhao and Su 2010 ).

A major obstacle to studying the structural evolution of proteins is the lack of well-defined structures for the vast majority of eukaryotic proteins. Although genomics projects have become adept at obtaining the primary sequence of the entire complement of protein-coding genes in genomes, annotations that depict secondary or tertiary structures remain sparse. It is thus imperative that we develop methods to extend the annotations of genomes by incorporating protein structural information ( Eisenberg et al. 2000 ; Baliga et al. 2004 ; Weston et al. 2003 ).

We have developed methods that take whole-genome protein sequences and provide structural annotation of these proteins. These methods include matching regions of protein sequence to known structures in the Protein Data Bank (PDB) ( http://www.rcsb.org/pdb ) using PSI-BLAST ( Altschul and Koonin 1998 ) and fold recognition methods (Fold and Function Assignment System [FFAS]) ( Jaroszewski et al. 2000 , 2005 ; Rychlewski et al. 2000 ) in order to predict domain boundaries and annotate predicted domains with structure information. Select domains that elude identification by BLAST and fold recognition are then predicted using Pfam ( http://pfam.sanger.ac.uk ) and heuristic approaches and are considered contenders for Rosetta de novo structure prediction ( Kuhlman and Baker 2000 ; Chevalier et al. 2002 ; Gray et al. 2003a , 2003b ; Kuhlman et al. 2003 ; Rohl et al. 2004 ; Rohl 2005 ). Using these approaches, we have previously completed structural/functional annotation of 94 proteomes ( Drew et al. 2011 ), and Rosetta-generated structure predictions have specifically produced functional insights even when function is not evident from sequence-based analyses alone ( Bonneau et al. 2001 , 2002 ).

To date, several plant proteomes have been fully sequenced, with several more currently underway. Like other organisms, however, this increase in availability of sequence information has not been matched with an increase in known plant protein structures or known protein functions. In this work, we have applied our structure-based annotation method to plant genome data to examine the evolution of selected amino acids on plant protein families in the context of their structure. We focused on the angiosperms, which are arguably the most diverse major plant group on the planet, with over 260,000 living species, in more than 450 families ( Angiosperm Phylogeny Group 2003 ; Jiao et al. 2011 ).

We have predicted structural domains for all known proteins in five flowering plant proteomes— A. thaliana , Oryza sativa , Populus trichocarpa , Sorghum bicolour , and Vitis vinifera . To our knowledge, this is the largest database of inferred protein structures currently available for plants. Using OrthologID (OID) ( Chiu et al. 2006 ), we have also identified ∼15,000 gene families (categorized as alignments with at least two sequences) within these five proteomes. Using codon-based models, we have done selection analysis on amino acid sites ( Yang 2007 ) for 2,120 gene families and examined the structural context of these positively selected sites. Finally, we placed these positively selected sites in a structural context by highlighting and visualizing these sites on corresponding three-dimensional predicted protein structures.

Materials and Methods

Genomic and Proteomic Data

Phylogenies and alignments were generated from analysis of five plant species which have complete genome sequences available: A. thaliana ( http://www.arabidopsis.org/ Version: 9), O. sativa (cv. nipponbare) ( http://rice.plantbiology.msu.edu/ , Version: MSU6), V. vinifera ( www.Gramene.org , Version: 2007-12-IGGP), P. trichocarpa ( http://genome.jgi-psf.org/ Version: 2004-12-JGI), and S. bicolour ( http://genome.jgi-psf.org/ , Version: Sbi1). The complete proteomic and genomic sequences were downloaded from the Gramene Web site ( http://www.gramene.org/ ) using BioMart ( www.biomart.org ). Both A. thaliana and O. sativa annotations were listed as fully complete, with the remaining three taxa in draft assembly.

Protein/Gene Family Choice

To identify gene families, we modified OID ( Chiu et al. 2006 ), which is a semiautomated homology search and phylogeny reconstruction pipeline. OID was modified to remove the MAFFT alignment refinement step (Katoh et al. 2002) , which removes ambiguously aligned regions. This allowed for the OID protein alignments to be mapped correctly to their respective nucleotide alignments. Using OID, gene family alignments and phylogenies were generated from the annotated protein-coding genes in our study species using BLAST ( Altschul et al. 1990 ), with an expectation cutoff of e −20 . OID produced 14,822 putative gene families, with family sizes ranging from n = 2–1,600 gene sequences (see fig. 1 ). We define gene family based on sequence information with a BLAST analysis cutoff e −20 , a stringent cutoff has been used in previous studies (for example, Kinsella et al. 2003 ; Xiao et al. 2007 ) to delineate gene/protein families. Not included in this count are the 66,478 orphan sequences ( n = 1) produced by OID. Due to the high number of OID results that contained less than 10 gene sequences, we limited our final analysis to only those gene families with at least 10 sequences regardless of whether each alignment included a representative from each species. This reduced the data set to 2,230 gene families, each corresponding to a plant protein domain family.

F IG . 1.—

Size distribution of plant protein families based on initial OID analyses.

F IG . 1.—

Size distribution of plant protein families based on initial OID analyses.

Due to the large evolutionary distances between the organisms used in this study (∼120–200 Myr between monocots and dicots), the alignments were modified to remove excessive gapped regions. CodeML estimates ancestral sequences for alignments when estimating whether positive selection has occurred within the protein family. Excessive gapped regions make this task computationally difficult (i.e., computational run times >600 h) and increase the possibilities of misalignments between homologous sequences; thus, large gapped alignments are not suitable for use in CodeML. Using multiple cutoff options, for both columns and rows, we determined the threshold percentage of gaps that allowed for feasible use with CodeML. Based on this, we culled each alignment in two ways. If the length of an aligned sequence contained at least 70% gaps, the sequence was removed. Furthermore, if more than 30% of a column in the resulting alignment was comprised of gaps, that column was removed from the alignment. Inspection of the data at this point showed alignments with greater than 100 sequences were still excessively gapped and would not be suitable for CodeML analysis, so we removed these alignments from our analysis. This further reduced our data set by 110 gene families, to 2,120 families. This left a count of 46,667 sequences in our final analysis, with alignment sizes ranging from 10 to 100 sequences.

Gene family groups with known function were initially identified using the known families listed on The Arabidopsis Information Resource (TAIR) Web site ( http://www.arabidopsis.org ). In addition, we annotated the gene families in our analysis with known functional groups gleaned from the literature, as well as homologous relationships found in the Gramene Web site. This led to the identification of 192 families with previously annotated/known functions and containing at least two members of each of the five species (ca. 30,000 sequences).

Positive Selection Analysis

Positively selected sites were predicted using CodeML from the PAML package ( Yang 2007 ). CodeML uses different evolutionary models to account for differences in transition/transversion rates in DNA, and also codon usage biases found within degenerate codons, and uses maximum likelihood to estimate the fit of levels of sequence divergence in homologous sequences given these models ( Bishop et al. 2000 ; Yang et al. 2000 ). Five models were used in this analysis: M0-M3 and M7-M8 with ambiguous residue positions included and using the F3X4 codon frequency model.

CodeML calculates the ratio of synonymous (d S ) and nonsynonymous (d N ) changes that have occurred at each codon with a d N /d S > 1 suggestive of positive selection. Each CodeML model builds on the preceding one and adds additional d N /d S (ω) classes. The most basic model M0, assuming all sites are undergoing negative/deleterious selective pressure (d N /d S < 0; 1 class of sites). M1 allows for some sites to be under neutral selection (d N /d S = 1), whereas M2 allows for some sites to be under positive selection (d N /d S > 1). As each model is applied to the data more complex parameters are applied, allowing for multiple d N /d S classes in the data set. The most complex model we used was M8, which allows for 13 classes of ω sites. If no positive selection was found using the basic M2 and M3 selection models, we did not proceed to the detailed codon selection analysis using models M7 and M8. For a more detailed description of CodeML classes, models, and the statistical analyses involved, see Yang et al. (2000) .

Protein Domain Prediction

Ginzu ( Kim et al. 2005 ) was used to predict domains for all proteins in our five proteomes (213,587 proteins). Ginzu first searches for sequence matches to known three-dimensional protein structures using PSI-BLAST ( Altschul and Koonin 1998 ), providing structural information for predicted domains. Ginzu searches then for matches to experimental structures using fold recognition ( Jaroszewski et al. 2005 ). Domains in regions of protein not matched by PSI-BLAST or fold recognition are predicted using matches to Pfam and a heuristic ( Chivian et al. 2003 ) that predicts domain boundaries based on patterns within multiple sequence alignments (MSAs). These latter domains, which lack structural information, are then exported (if <165 amino acids) for external structure prediction via Rosetta, resulting in domain and structural predictions of varying confidence for all proteins considered in the Plant Proteome Folding Pipeline (Plant PFP). Using these methods resulted in 409,017 predicted domains. Although we predicted domains for all five proteomes, protein-folding structure prediction was performed only on A. thaliana and O. sativa due to the large computational overhead required for folding proteins.

Prediction of Protein Structural Elements and Positive Site Mapping

PSIPRED ( McGuffin et al. 2000 ) and DISOPRED2 ( Ward et al. 2004 ) were used to predict protein folds and disordered regions, respectively. PSIPRED uses neural networks to analyze the position-specific scoring matrices produced from PSI-BLAST to infer secondary structure and is one of the top secondary structure prediction methods available. Disordered regions are defined as those regions that do not fold into a three-dimensional structure in their native state. Disordered regions are flexible, dynamic, and can be partially or completely unfolded in solution. DISOPRED2 uses known structural information, coupled with sequence records, to infer disordered regions.

For each residue position in an alignment, we use PSIPRED and DISOPRED2 to categorize specific amino acid sites into what secondary structure element they were found. If at least 80% of residues at a particular alignment position were predicted to be of the same fold/residue type, we classed the amino acid site as belonging to that type. For those sites that did not show outright support for a particular protein fold/class, we classified it as a mixed site.

The Dictionary of Secondary Structure Proteins (DSSP) ( Kabsch and Sander 1983 ) was also used for additional secondary structure annotation information. Using PDB atomic coordinates, DSSP defines secondary structure, geometrical features, and solvent exposure of proteins. We also obtained Structural Classification of Proteins (SCOP) structural information. The SCOP ( Murzin et al. 1995 ) database uses manual inspection, with the help of automated methods, to predicted structural and evolutionary relatedness.

Results

Database Overview

The Plant PFP Web site is available at http://pfp.bio.nyu.edu and currently represents 213,587 proteins from our five chosen organisms along with their respective protein structure predictions. The Ginzu pipeline as described by Drew et al. (2011) was used to analyze 211,140 of these proteins, skipping 2,447 proteins due to their excessive length or high percentage of residues predicted to be in disordered regions. From these 211,140 proteins, Ginzu produced 409,017 domains (listed as “Domains” in table 1 ). The 173,820 domains predicted by PSI-BLAST and fold recognition methods were automatically associated with their top matching PDB structure. The remaining 235,197 domains were considered for Rosetta de novo structure prediction. Twenty-nine thousand two hundred and two domains were returned from Rosetta with predicted structures, 4,769 of which are considered to be high confidence, where high confidence is determined by a MAMMOTH Confidence Metric (MCM) score of 0.8 or greater, which correlates to the high atomic accuracy of the predicted structure in relation to the native structure (see table 2 ) ( Drew et al. 2011 ). To evaluate the accuracy of high-confidence structure predictions, a double-blind benchmarking of the structural analyses methods were used, and these correctly predicted 47% of structures using SCOP (v1.67) superfamily classifications ( Drew et al. 2011 ), which is high for computational structure prediction. Comparison of the predicted and experimentally determined structures showed a strong correlation in structure prediction ( Drew et al. 2011 ).

Table 1

The Number of Proteins and Predicted Protein Domains for Each Organism within the Plant PFP Database

Organism Number of Proteins Number of Domains  Domains with Known PDB Structures a  Domains without Known PDB Structures b 
Arabidopsis thaliana 36,350 63,748 31,200 32,548 
Oryza sativa 67,393 150,986 57,828 93,158 
Populus trichocarpa 43,000 71,171 32,443 38,728 
Sorghum bicolour 36,410 68,644 27,722 40,922 
Vitis vinifera 30,434 54,468 24,627 29,841 
Organism Number of Proteins Number of Domains  Domains with Known PDB Structures a  Domains without Known PDB Structures b 
Arabidopsis thaliana 36,350 63,748 31,200 32,548 
Oryza sativa 67,393 150,986 57,828 93,158 
Populus trichocarpa 43,000 71,171 32,443 38,728 
Sorghum bicolour 36,410 68,644 27,722 40,922 
Vitis vinifera 30,434 54,468 24,627 29,841 
a

Based on PSI-BLAST, fold recognition methods.

b

Domain structures based on Pfam, MSA, and heuristic methods.

Table 2

De Novo Structure Predictions for Domain That Did Not Map to Known PDB Structures

Organism All De Novo Domain Predictions De Novo Domain Predictions (>0.8 confidence) 
Arabidopsis thaliana 9,631 1,618 
Oryza sativa 19,541 3,146 
Populus trichocarpa 14 
Sorghum bicolour 16 
Vitis vinifera 
Organism All De Novo Domain Predictions De Novo Domain Predictions (>0.8 confidence) 
Arabidopsis thaliana 9,631 1,618 
Oryza sativa 19,541 3,146 
Populus trichocarpa 14 
Sorghum bicolour 16 
Vitis vinifera 

Data Visualization

To facilitate the exploration of predicted sites of positive selection mapped onto structures and the exploration of our predicted domain families, we have constructed a web interface to our resource ( http://pfp.bio.nyu.edu ). This web interface allows searching via accession or ontology term, by sequence using BLAST, or by searching the list of predetermined functional families by name (i.e., “bHLH transcription factor”). User selection of a family group produces an initial page showing the phylogeny and sequence identifiers, with each identifier listing the mapped protein domains.

This initial window contains three tabs—JALview, JMol, and Phylowidget. JALview ( Waterhouse et al. 2009 ) is used to display the MSA of the proteins in the family, Phylowidget ( Jordan and Piel 2008 ) to display the family's phylogenetic tree, and JMol ( http://www.jmol.org ) to view the three-dimensional structure (either a PDB structure or a Rosetta de novo predicted structure) of the individual proteins of the family. The Web site also provides a display that shows the division of family proteins into domains via the Ginzu program and displays the methods by which Ginzu predicted each domain.

Selecting a mapped protein domain within a sequence that has a structure annotation and then selecting the “Load in JMol” link loads the domain into the JMol tab, with the sites of positive selection highlighted (see fig. 2 for a JMol view). If a domain maps to a known protein structure in PDB, a link to this protein is included. Positively selected sites are highlighted along the structure in blue, whereas the atoms of selected residues are circled in yellow, which enables the choosing of specific positively selected residues within the alignment to be viewed easily on the structure.

F IG. 2.—

Sample structure output in the Plant PFP. This example is of an MADS-box domain interacting with DNA, as depicted in PDB. The highlighted regions in blue spheres show regions of predicted positive selection. The gray indicates regions in the PDB structure that were not predicted to structurally map to the plant MADS-box domain or an alternate PDB chain.

F IG. 2.—

Sample structure output in the Plant PFP. This example is of an MADS-box domain interacting with DNA, as depicted in PDB. The highlighted regions in blue spheres show regions of predicted positive selection. The gray indicates regions in the PDB structure that were not predicted to structurally map to the plant MADS-box domain or an alternate PDB chain.

Positive Selection Prediction for Protein Families

Using likelihood ratio tests and d N /d S values from model M8 of CodeML, we determined selection had occurred in 175 gene families, which indicated that 8.2% of families show selection. In total, there were over 4,000 sites that showed positive selection. Ignoring families that did not contain any selected sites with >95% confidence in prediction ( P > 0.95 for d N /d S > 1), resulted in 938 selected sites in 122 gene families. The majority of families (97) had 10 or less sites of positive selection. Although this created a very conservative data set, it increased confidence in the resulting analyses. There were 10–95 sequences per alignment, with a mean size of 20. Alignment length ranged from 51 to 1,372 residues, with a mean sequence length of 321 residues. Mapping these results back to the families where a gene function has been associated, we found 43 gene families mapped to 19 known functions ( supplementary table 1 , Supplementary Material online).

Initial codeML undertook analyses using sequence alignments for gene families that contained both paralogues as well as orthologues. With the high level of gene duplication and polyploidization that has occurred within plant genomes, identification of orthologous genes can prove difficult. Previous work on a comparison of paralogous and orthologous evolutionary rates by Conant et al. (2007) found that the evolutionary rates do not differ significantly between orthologs and paralogs. Nevertheless, we also expanded on the current analysis, by separating gene families with positively selected sites into separate paralogous sequence alignments from each individual species. We used only alignments that 1) originally contained more than one species and 2) contained at least seven sequences from a species and undertook codeML analyses using the same parameters as previously described. From the 122 alignments, we found that 101 met our criteria. We then separated out 142 separate paralogous gene families of individual species from these 101 alignments and reanalyzed the individual sequence alignments for positive selection. A reanalysis for sites of positive selection that included alignments from multiple species (i.e., potential orthologues) had 923 sites of positive selection ( P > 0.95). We find that 722 identified selected sites overlapped in the two analyses, which suggests that our original analyses are able to identify ∼80% of the positively selected sites found in paralogous gene families (while also identifying sites that were selected between putative orthologues). It must be noted that analyzing only the paralogues also identified sites that were not observed in the larger data sets; however, we feel that the results from the larger data sets that also include putative orthologues provide greater power and confidence in the selection analyses.

One other issue is whether the sequences we were analyzing were too divergent, and the synonymous sites were saturated. Previous work suggests that codon-based models estimating selection on protein sequences are valid only when synonymous site divergence of d S < 5 ( Alvarez-Ponce et al. 2011 ). We calculated the mean pairwise d S for each of our gene families and only 12.4% having d S < 5. However, sequence alignments for which we found positive selection had 45.8% with a mean pairwise d S < 5, and the majority of these had d S values between 0 and 1. If we only consider sequence alignments at the range of d S < 5, we find that ∼21% of these gene families show evidence for positive selection.

Amino Acid Properties of Positively Selected Sites

Using amino acid properties from Lise and Jones (2005) , we categorized selected and nonselected residues using eight properties (hydrophobic, polar, small, aliphatic, aromatic, positive, negative, and unique). We changed the property “proline” from Lise and Jones to “unique,” which included glycine and proline, due to their distinctive properties.

Comparing selected versus nonselected residues for each property showed statistically significant differences in five properties ( P < 0.0001): hydrophobic, polar, small, aliphatic, and positive. There was a significant increase in selected residues with polar, small, and positive categories and a less than expected number of residues within aliphatic and hydrophobic categories. There was a slight increase in negatively charged residues within selected sites, but the results were not much different from expected (see table 3 ). These results indicate that hydrophilic, small, and positively charged residues are possibly more prone to evolve with positive selection than other residue types.

Table 3

Each Amino Acid in an Alignment Was Categorized into Eight Amino Acid Properties

Properties Observed Expected P value a 
Hydrophobic 12,671 14,589 <0.00001 
Polar 13,021 10,774 <0.00001 
Small 11,231 10,795 <0.00001 
Aliphatic 3,175 4,760 <0.00001 
Aromatic 2,449 2,459 <0.8384 
Positive 3,695 3,053 <0.00001 
Negative 2,683 2,511 <0.0002237 
Unique 2,513 2,489 <0.5602 
Properties Observed Expected P value a 
Hydrophobic 12,671 14,589 <0.00001 
Polar 13,021 10,774 <0.00001 
Small 11,231 10,795 <0.00001 
Aliphatic 3,175 4,760 <0.00001 
Aromatic 2,449 2,459 <0.8384 
Positive 3,695 3,053 <0.00001 
Negative 2,683 2,511 <0.0002237 
Unique 2,513 2,489 <0.5602 
a

Significance of positive correlation between number of positively selected sites and polarity, size, and positive charge while hydrophobic and aliphatic residues showed a strong negative correlation with positive selection.

Clustering of Selected Sites

To examine the possibility that the length of the protein sequence impacts the relative number of selected sites, we examined the distribution of selected sites in relation to the length of the sequence alignments, both the original alignments and the alignments that were culled of excessively gapped regions. We used a Spearman's rank correlation test to examine the relationship between sequence length and number of selected sites (i.e., as protein length increases, does also the number of positively selected sites). For proteins with the same length, the mean number of selected sites was used. The closer the score is to ±1, the more correlated the two variables. Spearman’s rank correlation scores of 0.16 ( P = 0.07) and 0.33 ( P = 0.0002) were found for the culled alignment lengths and the original alignment lengths, respectively. These results indicate that, at least when considering the unculled alignment lengths, there does appear to be a moderately low but significant correlation between increased length and number of positively selected sites.

We looked to see whether the sites under positive selection showed evidence for clustering along the sequence. Using a pairwise distance measure, for each alignment we found the average distance between sites of selection along the primary sequence. We chose an equal number of sites at random and found the average pairwise distance in the data set, and we repeated this random selection of sites 1,000 times. Using a 95% cutoff, we found 40% of alignments had a smaller pairwise distance between selected sites compared with 95% of the random distances. This increased to 58% if we use a 90% cutoff. This suggests that there is significant sequence-space clustering of positively selected sites in a substantial number of proteins.

To corroborate this result, we did a sliding window analysis using a window size of five residues moving one amino acid at a time. Comparing the location of selected sites, with 10,000 permutations based on random site selection, we found that in only one case did the number of windows containing random amino acid sites equal the number of windows containing actual sites of selection. Together, these results indicate that the majority of selected sites are clustered within plant protein gene families.

Secondary Structure Prediction of Positively Selected Sites

We predicted the secondary structure of all sequences using PSIPRED and created secondary structure alignments. When comparing the secondary structure distribution of positively selected versus nonpositively selected residues, we found a significant reduction in predicted helical ( P < 0.0001) and strand residues ( P < 0.0001) that contain selected sites (see fig. 3 ). There appears to be slight reduction of predicted coiled residues with selected sites, although the results were not significantly different than expected. Interestingly, 66% of sites could not be classed within any definitive secondary structure type and were classed as of mixed structure.

F IG. 3.—

Distribution of positively selected residue positions in protein structure elements. ( A ) Distribution according to secondary structural elements. ( B ) Distribution of positively selected residues in ordered and disordered protein regions. The gray bars are the observed and the black bars are expected numbers.

F IG. 3.—

Distribution of positively selected residue positions in protein structure elements. ( A ) Distribution according to secondary structural elements. ( B ) Distribution of positively selected residues in ordered and disordered protein regions. The gray bars are the observed and the black bars are expected numbers.

Looking more closely at this “mixed” group, we divided the data into amino acid positions that showed only combinations of coils and sheets (CS), coils and helices (CH), sheets and helices (SH), and positions that still contained all types of secondary structure elements (ALL). In each of these mixed group categories, the positively selected sites were significantly overrepresented (CH, P < 0.006; CS and ALL, P < 0.0001; SH, P <0.03).

Initially if an amino acid position was found to occur in particular secondary structure type in at least 80% of the sequences in an alignment, we classified that amino acid position as being of that type. Modifying this to include 60%, 70%, 90%, and 100% cutoffs, we found all secondary structure element counts remained the same, except for counts for coils, and within the mixed group CS. As the stringency of the cutoff increased, the number of coiled-only regions decreased and the number of CS regions increased within both selected and nonselected positions, indicating that the change from coils to sheets or vice versa may be more flexible than other structural changes.

Using DISOPRED2 we found that positively selected amino acid sites (compared with nonselected sites) were found largely in disordered regions. We found an increased number of sites of positive selection predicted to be disordered than expected ( P < 0.0001) (see fig. 3 ). As disordered regions can undergo different folding conformations, our results could indicate that such flexible regions are under higher selective pressure. We found 65.7% of sites to be categorized as mixed. Residues with a mixture of different secondary structure element types and order/disordered etc. suggest that areas where structure changes from one fold type to another may be targets of adaptive evolution.

Solvent Accessibility

Using DSSP, we obtained the Relative Accessible Surface Area (RASA) for the plant proteins in our analysis. RASA provides a measure of how exposed to a solvent an amino acid residue is within a protein structure; the lower the RASA score, the more buried the residue. The RASA score is between 0 and 1 with 0 being completely buried. We had RASA scores for 17,738 amino acids in our data. Of the 938 positively selected residues, we found RASA scores for 454. If a particular residue had more than one RASA score, we averaged the scores for this residue, unless the residue has a PDB score, in which case we used score that only. Multiple RASA scores were the result of the multiple methods used in the structure prediction (Rosetta, Ginzu). We divided the residues from all alignments into three groups, selected sites, conservative sites (those sites which showed no amino acid replacement), and all others.

A Wilcoxon test showed a significant difference ( P = 0.0001) between the distribution of RASA scores within selected sites compared with conservative sites, with conservative sites being more buried, suggesting more selective pressure occurs on more solvent-exposed residues. This is consistent with our analysis on amino acid properties, which showed selected sites to be more hydrophilic than no selected sites. Comparison of selected sites with all other sites did not show a significant difference ( P = 0.2) (see fig. 4 ).

F IG . 4.—

Distribution of RASA values across amino acid sites in protein structures. The distributions are shown for all sites, conserved, and positively selected amino acid sites. Although the distributions are similar, a t -test showed significant differences in RASA values between conserved and positively selected amino acid sites.

F IG . 4.—

Distribution of RASA values across amino acid sites in protein structures. The distributions are shown for all sites, conserved, and positively selected amino acid sites. Although the distributions are similar, a t -test showed significant differences in RASA values between conserved and positively selected amino acid sites.

Case Study I: A Misannotated C2H2 Zinc-Finger Transcription Factor Family Includes TPR Domain Proteins

To demonstrate some of the capabilities of our structure database, we examine in greater detail two examples of gene families where we found evidence for positive selection. The first family is an OID group ( http://pfp.bio.nyu.edu/family/18894 ) that is a family of 11 A. thaliana proteins that are annotated on the TAIR family Web site as comprising C2H2 transcription factor proteins. C2H2 transcription factor proteins contain a zinc-finger domain and are involved in a wide range of functions, including DNA- and RNA-binding and protein–protein interactions ( Englbrecht et al. 2004 ). Of these 11 proteins, TAIR lists seven of them as containing a zinc-finger domain. Our analysis of the domain architecture of this family of 11 proteins suggests it has possibly been misannotated. Searching each sequence manually against GenBank revealed the sequences did not return a hit to zinc-finger domain–containing proteins.

Our Ginzu and Rosetta analyses predicted that these sequences had between four and seven domains, with all but two N- and C-terminal regions mapping to known domains in the PDB. Nonterminal regions were predominantly unassigned, as domain boundaries were identified using less confident heuristic methods. Seven unique domains were mapped to the N-terminal region. Checking the PDB structures for these domains, all were from proteins characterized as TPR domain–containing proteins. TPR domains, first classified in yeast, are a repeating helix-turn-helix motif containing approximately 34 amino acids that are involved in multiple biological functions ( Sikorski et al. 1990 ; Das et al. 1998 ; Main et al. 2003 ; Kajander et al. 2007 ). Functionally, TPR domains have been linked to many roles including Hsp90 mediation, scaffolding proteins, and transcription ( Das et al. 1998 ; Main et al. 2003 ; Wilson et al. 2005 ; Yang et al. 2005 ; Kajander et al. 2007 ). Within this alignment, we found 54 sites of positive selection with greater than 95% confidence, of which 41 occurred within the predicted N-terminal domain (see fig. 5 ). Within the C-terminal, domains from four PDB structures were mapped, which were described as having ubiquitin-related functions.

F IG . 5.—

PDB structure 1na0:B, an idealized TRP domain, is one of the domains mapped to the N-terminal of this OID family. Highlighted regions in blue are sites of positive selection with at least 95% confidence of prediction support. In this example, the gray regions were not predicted by our analysis to map structurally to our protein.

F IG . 5.—

PDB structure 1na0:B, an idealized TRP domain, is one of the domains mapped to the N-terminal of this OID family. Highlighted regions in blue are sites of positive selection with at least 95% confidence of prediction support. In this example, the gray regions were not predicted by our analysis to map structurally to our protein.

Case Study II: Selection in an F-Box Protein Family Is C-terminal to the F-box Domain

Our second case example is a family ( http://pfp.bio.nyu.edu/family/19187 ), which contains 23 sequences from S. bicolour and O. sativa and are annotated as F-box proteins. F-box proteins, first described as part of the SCF ( S kp, C ullin, F -box) complex, are involved in ubiquitination and are characterized by a structural motif of approximately 50 residues and constitute one of the largest plant multigene families ( Kipreos and Pagano 2000 ; Xu et al. 2009 ). F-box proteins are comprised of an N-terminal F-box domain, which interact with Skp1, a linker domain and varying C-terminal domains, which are used to recruit substrates ( Jin et al. 2004 ; Petroski and Deshaies 2005 ; Hao et al. 2007 ).

Ginzu and Rosetta mapped domains from 14 proteins in this family to domains in three PDB structures. Twelve proteins mapped to the same PDB structure: 2ovr, along the entire length of the sequence, which includes a WD40 domain at the C-terminal end. Of the other two mapped structures, 1p22 and 2e32, 1p22 also contains a WD40 domain. The third structure, 2e32, differs by having a sugar-binding domain (SBD) rather than a WD40 domain at the C-terminal end. Although these proteins differ in the C-terminal region, they were found to be in the same family due to the presence of a conserved F-box domain (see fig. 6 ).

F IG . 6.—

SBD of F-box protein 2e32. Areas highlighted in blue are positively selected sites with high confidence and in red are those residues indicated to be involved in substrate binding ( Mizushima et al. 2004 , 2007 ). The gray regions were not predicted by our analysis to map structurally to our protein. Not included here is the F-box domain, which did not have sites inferred to have evolved with positive selection.

F IG . 6.—

SBD of F-box protein 2e32. Areas highlighted in blue are positively selected sites with high confidence and in red are those residues indicated to be involved in substrate binding ( Mizushima et al. 2004 , 2007 ). The gray regions were not predicted by our analysis to map structurally to our protein. Not included here is the F-box domain, which did not have sites inferred to have evolved with positive selection.

The complete protein sequence alignment of this family is 408 residues in length, and we inferred 24 sites with positive selection at the 95% confidence level, all of which occurred within the latter half of this alignment. Adaptive evolution does not appear to target the N-terminal F-box domain but appears to be concentrated on the C-terminal domain. Analysis of the positive selection found within the SBD of the PDB structure 2e32 showed selection occurring in proximity to the substrate-binding region, although the residues involved in substrate binding where not themselves under selection ( Mizushima et al. 2004 , 2007 ).

The PDB structure 2ovr is the Fbw7-Skp1-CyclinE complex ( Hao et al. 2007 ). This protein complex is part of the CyclinE degradation pathway and is important in cell cycle regulation ( Zhang and Koepp 2006 ; Hao et al. 2007 ). The other PDB structure mapped, 1p22 is a β-TrCP1-Skp1-β-Catenin complex, also important in cell cycle regulation ( Wu et al. 2003 ). The WD40 domain contains sites of positive selection (see fig. 7 ). Although both 2ovr and 1p22 contain WD40 domains, they bind different substrates at this position, namely CyclinE and β-catenin, respectively. Prediction of positive selection within this domain could indicate that proteins with multiple binding affinities may be under increased selective pressure. Mapping positive selection to the PDB structure 2ovr, we found three amino acid sites involved in protein binding that are predicted to be positively selected.

F IG . 7.—

PDB structures of the C-terminal WD40 domain in the F-box protein. Residues highlighted in blue are positively selected residues. Residues highlighted in red are substrate-binding residues ( Wu et al. 2003 ; Hao et al. 2007 ). The gray regions were not predicted by our analysis to map structurally to our protein. ( A ) The PDB structure of 1p22. ( B ) The WD40 C-terminal domain of the PDB structure 2ovr. Positively selected residues appear in close proximity to substrate-binding residues. Highlighted in black are three residues that show evidence for positive selection and are also involved in substrate binding.

F IG . 7.—

PDB structures of the C-terminal WD40 domain in the F-box protein. Residues highlighted in blue are positively selected residues. Residues highlighted in red are substrate-binding residues ( Wu et al. 2003 ; Hao et al. 2007 ). The gray regions were not predicted by our analysis to map structurally to our protein. ( A ) The PDB structure of 1p22. ( B ) The WD40 C-terminal domain of the PDB structure 2ovr. Positively selected residues appear in close proximity to substrate-binding residues. Highlighted in black are three residues that show evidence for positive selection and are also involved in substrate binding.

Discussion

Generating structure predictions for these proteomes allows us to delve further into the underlying occurrences of selection and selective pressure within plant proteins. Positive selection appears to be a nonrandom occurrence within proteins, occurring in cluster along the alignment length, which could be indicative of pressure on a particular fold or protein region to change. Our results demonstrate that positive selection also occurs more often within particular elements and areas of structural fold change within protein structures. In particular, the numbers of selected residues were less than expected in helical and strand elements. Interestingly, there were significantly more selected sites among residues that were associated with a change in protein structure (e.g., coil to strand or vice versa). Disordered residues also showed an increase in positively selected sites. It is known that many disordered regions become ordered upon binding ( Wright and Dyson 1999 ), as well as having affinity to bind multiple proteins ( Dunker et al. 2005 ). Within our analysis of plant protein structures, we found, similar to work previously done on Drosophila ( Ridout et al. 2010 ), that certain secondary structure elements plus disordered residues have more positive selection than others. Our results indicate that areas of possible evolutionary change, be it as a disordered region or a secondary structure region, may be under greater positive selective pressure than more structurally conserved ordered regions of the protein.

Previous studies on positive selection within plant genomes have usually focused on single, or small number, of plant families ( Zhang et al. 2001 ; Mondragon-Palomino et al. 2002 ; Mondragon-Palomino and Gaut 2005 ; Li et al. 2009 ; Palme et al. 2009 ; Kapralov et al. 2011 ; Moury and Simon 2011 ; Strasburg et al. 2011 ). To our knowledge, the largest current study published on plant adaptive evolution is by Roth and Liberles (2006) who used The Adaptive Evolution Database to predict positive selection in 4,216 seed plant gene families. They found 87 families showed positive selection; however, most of these families contained only two proteins. We have analyzed over 2,000 gene families with at least 10 sequences per family for positive selection, one of the largest plant protein family analyses to date, and found selection in 175 families.

In agreement with previous work ( Gossmann et al. 2010 ), the amount of positive selection we found within plant species appears to be low. Eight percent of our plant protein families appear to be undergoing selection, which is much smaller than estimates within nonplant species ( Smith and Eyre-Walker 2002 ; Halligan et al. 2010 ). This could be indicative of the effective population size of plants being smaller relative to other species ( Bustamante et al. 2002 ; Strasburg et al. 2011 ).

Mapping of positively selected sites to our known structures suggests that amino acid sites that show evidence for positive selection cluster within a protein sequence, a result also found in a previous analysis on Drosophila ( Ridout et al. 2010 ). Ridout et al. (2010) , however, found that N-, C-terminal regions appear to contain more positively selected sites, something we do not find in our plant analysis, where selection appears to be spread through out the protein. Due to the high amount of divergence between our original sequences, we culled our sequences by removing extraneous gapped regions. Most of these gapped deletions were in N-, C-terminals, giving us a possible sampling bias, which may have removed sites that could be undergoing selection and therefore missing additional selected regions, which could have mirrored the results found in Drosophila.

We also find that sites undergoing positive selection appear to be less buried than wholly conserved sites within protein structures. Previous work by Liu et al. (2008) , Roth and Liberles (2006) , and Petersen et al. (2007) , working with human single nucleotide polymorphism data, seed plants and Escherichia coli , respectively, all found similar results in their data, suggesting that less buried residues are under less selective pressure in multiple species.

Linking protein structure, function, and evolution has been one of the key goals of molecular evolution. The availability of whole-genome sequences has allowed investigators access to an inventory of all proteins in an organism, and the availability of data across multiple species allows for a comparative analysis in a phylogenetically informed approach. We have used several tools for protein structure domain identification and prediction that we had previously applied to multiple proteomes ( Drew et al. 2011 ) and have now developed a similar database for plant proteins. As we describe in this study, this provides us with both general trends in the evolution of plant protein families, as well as allows us to highlight evolution of specific examples—in this case a C2H2 and an F-box family. As investigators exploit this database, we may be able to identify even more compelling trends in the structural evolution of plant proteins.

Supplementary Material

Supplementary figure S1 is available at Genome Biology and Evolution online ( http://www.gbe.oxfordjournals.org/ ).

We would like to thank the volunteers participating in the Human Proteome Folding Project on IBM’s World Community Grid and the FFAS03 development team, and the Rosetta Commons and the Yeast Resource Center. This work was supported in part by grant National Science Foundation (NSF) (DBI-0820757).

References

Altschul
SF
Gish
W
Miller
W
Myers
EW
Lipman
DJ
Basic local alignment search tool
J Mol Biol.
 , 
1990
, vol. 
215
 (pg. 
403
-
410
)
Altschul
SF
Koonin
EV
Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases
Trends Biochem Sci.
 , 
1998
, vol. 
23
 (pg. 
444
-
447
)
Alvarez-Ponce
D
Aguade
M
Rozas
J
Comparative genomics of the vertebrate insulin/TOR signal transduction pathway: a network-level analysis of selective pressures
Genome Biol Evol.
 , 
2011
, vol. 
3
 (pg. 
87
-
101
)
Angiosperm Phylogeny Group
An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: aPG II
Bot J Linn Soc.
 , 
2003
, vol. 
141
 (pg. 
399
-
436
)
Baliga
NS
, et al.  . 
Systems level insights into the stress response to UV radiation in the halophilic archaeon Halobacterium NRC-1
Genome Res.
 , 
2004
, vol. 
14
 (pg. 
1025
-
1035
)
Bharathan
G
Janssen
BJ
Kellogg
EA
Sinha
N
Phylogenetic relationships and evolution of the KNOTTED class of plant homeodomain proteins
Mol Biol Evol.
 , 
1999
, vol. 
16
 (pg. 
553
-
563
)
Bishop
JG
Dean
AM
Mitchell-Olds
T
Rapid evolution in plant chitinases: molecular targets of selection in plant-pathogen coevolution
Proc Natl Acad Sci U S A.
 , 
2000
, vol. 
97
 (pg. 
5322
-
5327
)
Bonneau
R
Tsai
J
Ruczinski
I
Baker
D
Functional inferences from blind ab initio protein structure predictions
J Struct Biol.
 , 
2001
, vol. 
134
 (pg. 
186
-
190
)
Bonneau
R
, et al.  . 
De novo prediction of three-dimensional structures for major protein families
J Mol Biol.
 , 
2002
, vol. 
322
 (pg. 
65
-
78
)
Bustamante
CD
, et al.  . 
The cost of inbreeding in Arabidopsis
Nature
 , 
2002
, vol. 
416
 (pg. 
531
-
534
)
Chevalier
BS
, et al.  . 
Design, activity, and structure of a highly specific artificial endonuclease
Mol Cell.
 , 
2002
, vol. 
10
 (pg. 
895
-
905
)
Chiu
JC
, et al.  . 
OrthologID: automation of genome-scale ortholog identification within a parsimony framework
Bioinformatics
 , 
2006
, vol. 
22
 (pg. 
699
-
707
)
Chivian
D
Robertson
T
Bonneau
R
Baker
D
Ab initio methods
Methods Biochem Anal.
 , 
2003
, vol. 
44
 (pg. 
547
-
557
)
Conant
GC
Wagner
GP
Stadler
PF
Modeling amino acid substitution patterns in orthologous and paralogous genes
Mol Phylogenet Evol.
 , 
2007
, vol. 
42
 (pg. 
298
-
307
)
Conery
JS
Lynch
M
Nucleotide substitutions and the evolution of duplicate genes
Pac Symp Biocomput.
 , 
2001
, vol. 
6
 (pg. 
167
-
178
)
Conrad
B
Antonarakis
SE
Gene duplication: a drive for phenotypic diversity and cause of human disease
Annu Rev Genomics Hum Genet.
 , 
2007
, vol. 
8
 (pg. 
17
-
35
)
Das
AK
Cohen
PW
Barford
D
The structure of the tetratricopeptide repeats of protein phosphatase 5: implications for TPR-mediated protein-protein interactions
EMBO J.
 , 
1998
, vol. 
17
 (pg. 
1192
-
1199
)
Doolittle
RF
The origins and evolution of eukaryotic proteins
Philos Trans R Soc Lond B Biol Sci.
 , 
1995
, vol. 
349
 (pg. 
235
-
240
)
Drew
K
, et al.  . 
The Proteome Folding Project: proteome-scale prediction of structure and function
Genome Res.
 , 
2011
, vol. 
21
 (pg. 
1981
-
1994
)
Dunker
AK
Cortese
MS
Romero
P
Iakoucheva
LM
Uversky
VN
Flexible nets. The roles of intrinsic disorder in protein interaction networks
FEBS J.
 , 
2005
, vol. 
272
 (pg. 
5129
-
5148
)
Eisenberg
D
Marcotte
EM
Xenarios
I
Yeates
TO
Protein function in the post-genomic era
Nature
 , 
2000
, vol. 
405
 (pg. 
823
-
826
)
Englbrecht
CC
Schoof
H
Bohm
S
Conservation, diversification and expansion of C2H2 zinc finger proteins in the Arabidopsis thaliana genome
BMC Genomics
 , 
2004
, vol. 
5
 pg. 
39
 
Gossmann
TI
, et al.  . 
Genome wide analyses reveal little evidence for adaptive evolution in many plant species
Mol Biol Evol.
 , 
2010
, vol. 
27
 (pg. 
1822
-
1832
)
Gray
JJ
, et al.  . 
Protein-protein docking predictions for the CAPRI experiment
Proteins
 , 
2003
, vol. 
52
 (pg. 
118
-
122
)
Gray
JJ
, et al.  . 
Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations
J Mol Biol.
 , 
2003
, vol. 
331
 (pg. 
281
-
299
)
Hahn
MW
Han
MV
Han
SG
Gene family evolution across 12 Drosophila genomes
PLoS Genet.
 , 
2007
, vol. 
3
 pg. 
e197
 
Halligan
DL
Oliver
F
Eyre-Walker
A
Harr
B
Keightley
PD
Evidence for pervasive adaptive protein evolution in wild mice
PLoS Genet.
 , 
2010
, vol. 
6
 pg. 
e1000825
 
Hao
B
Oehlmann
S
Sowa
ME
Harper
JW
Pavletich
NP
Structure of a Fbw7-Skp1-cyclin E complex: multisite-phosphorylated substrate recognition by SCF ubiquitin ligases
Mol Cell.
 , 
2007
, vol. 
26
 (pg. 
131
-
143
)
Hernandez-Hernandez
T
Martinez-Castilla
LP
Alvarez-Buylla
ER
Functional diversification of B MADS-box homeotic regulators of flower development: adaptive evolution in protein-protein interaction domains after major gene duplication events
Mol Biol Evol.
 , 
2007
, vol. 
24
 (pg. 
465
-
481
)
Jaroszewski
L
Rychlewski
L
Godzik
A
Improving the quality of twilight-zone alignments
Protein Sci.
 , 
2000
, vol. 
9
 (pg. 
1487
-
1496
)
Jaroszewski
L
Rychlewski
L
Li
Z
Li
W
Godzik
A
FFAS03: a server for profile–profile sequence alignments
Nucleic Acids Res.
 , 
2005
, vol. 
33
 (pg. 
W284
-
W288
)
Jiao
Y
, et al.  . 
Ancestral polyploidy in seed plants and angiosperms
Nature
 , 
2011
, vol. 
473
 (pg. 
97
-
100
)
Jin
J
Cardozo
T
Lovering
RC
Elledge
SJ
Pagano
M
Harper
JW
Systematic analysis and nomenclature of mammalian F-box proteins
Genes Dev.
 , 
2004
, vol. 
18
 (pg. 
2573
-
2580
)
Jordan
GE
Piel
WH
PhyloWidget: web-based visualizations for the tree of life
Bioinformatics
 , 
2008
, vol. 
24
 (pg. 
1641
-
1642
)
Kabsch
W
Sander
C
Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features
Biopolymers
 , 
1983
, vol. 
22
 (pg. 
2577
-
2637
)
Kajander
T
Cortajarena
AL
Mochrie
S
Regan
L
Structure and stability of designed TPR protein superhelices: unusual crystal packing and implications for natural TPR proteins
Acta Crystallogr D Biol Crystallogr.
 , 
2007
, vol. 
63
 (pg. 
800
-
811
)
Kapralov
MV
Kubien
DS
Andersson
I
Filatov
DA
Changes in Rubisco kinetics during the evolution of C4 photosynthesis in Flaveria (Asteraceae) are associated with positive selection on genes encoding the enzyme
Mol Biol Evol.
 , 
2011
, vol. 
28
 (pg. 
1491
-
1503
)
Katoh
K
Misawa
K
Kuma
K
Miyata
T
MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform
Nucleic Acids Res.
 , 
2002
, vol. 
30
 (pg. 
3059
-
3066
)
Kelleher
ES
Swanson
WJ
Markow
TA
Gene duplication and adaptive evolution of digestive proteases in Drosophila arizonae female reproductive tracts
PloS Genet.
 , 
2007
, vol. 
3
 pg. 
e148
 
Kim
DE
Chivian
D
Malmstrom
L
Baker
D
Automated prediction of domain boundaries in CASP6 targets using Ginzu and RosettaDOM
Proteins
 , 
2005
, vol. 
7
 
61 Suppl
(pg. 
193
-
200
)
Kinsella
RJ
Fitzpatrick
DA
Creevey
CJ
McInerney
JO
Fatty acid biosynthesis in Mycobacterium tuberculosis: lateral gene transfer, adaptive evolution, and gene duplication
Proc Natl Acad Sci U S A.
 , 
2003
, vol. 
100
 (pg. 
10320
-
10325
)
Kipreos
ET
Pagano
M
The F-box protein family
Genome Biol.
 , 
2000
, vol. 
1
  
reviews3002.1–3002.7
Kramer
EM
Jaramillo
MA
Di Stilio
VS
Patterns of gene duplication and functional evolution during the diversification of the AGAMOUS subfamily of MADS box genes in angiosperms
Genetics
 , 
2004
, vol. 
166
 (pg. 
1011
-
1023
)
Kuhlman
B
Baker
D
Native protein sequences are close to optimal for their structures
Proc Natl Acad Sci U S A.
 , 
2000
, vol. 
97
 (pg. 
10383
-
10388
)
Kuhlman
B
, et al.  . 
Design of a novel globular protein fold with atomic-level accuracy
Science
 , 
2003
, vol. 
302
 (pg. 
1364
-
1368
)
Li
W
, et al.  . 
Phylogenetic analysis, structural evolution and functional divergence of the 12-oxo-phytodienoate acid reductase gene family in plants
BMC Evol Biol.
 , 
2009
, vol. 
9
 pg. 
90
 
Lise
S
Jones
DT
Sequence patterns associated with disordered regions in proteins
Proteins
 , 
2005
, vol. 
58
 (pg. 
144
-
150
)
Liu
J
Zhang
Y
Lei
X
Zhang
Z
Natural selection of protein structural and functional properties: a single nucleotide polymorphism perspective
Genome Biol.
 , 
2008
, vol. 
9
 pg. 
R69
 
Lynch
M
Conery
JS
The evolutionary demography of duplicate genes
J Struct Funct Genomics.
 , 
2003
, vol. 
3
 (pg. 
35
-
44
)
Main
ER
Xiong
Y
Cocco
MJ
D'Andrea
L
Regan
L
Design of stable alpha-helical arrays from an idealized TPR motif
Structure
 , 
2003
, vol. 
11
 (pg. 
497
-
508
)
Martinez-Castilla
LP
Alvarez-Buylla
ER
Adaptive evolution in the Arabidopsis MADS-box gene family inferred from its complete resolved phylogeny
Proc Natl Acad Sci U S A.
 , 
2003
, vol. 
100
 (pg. 
13407
-
13412
)
McGuffin
LJ
Bryson
K
Jones
DT
The PSIPRED protein structure prediction server
Bioinformatics
 , 
2000
, vol. 
16
 (pg. 
404
-
405
)
Mizushima
T
, et al.  . 
Structural basis of sugar-recognizing ubiquitin ligase
Nat Struct Mol Biol.
 , 
2004
, vol. 
11
 (pg. 
365
-
370
)
Mizushima
T
, et al.  . 
Structural basis for the selection of glycosylated substrates by SCF(Fbs1)ubiquitin ligase
Proc Natl Acad Sci U S A.
 , 
2007
, vol. 
104
 (pg. 
5777
-
5781
)
Mondragon-Palomino
M
Gaut
BS
Gene conversion and the evolution of three leucine-rich repeat gene families in Arabidopsis thaliana
Mol Biol Evol.
 , 
2005
, vol. 
22
 (pg. 
2444
-
2456
)
Mondragon-Palomino
M
Meyers
BC
Michelmore
RW
Gaut
BS
Patterns of positive selection in the complete NBS-LRR gene family of Arabidopsis thaliana
Genome Res.
 , 
2002
, vol. 
12
 (pg. 
1305
-
1315
)
Moury
B
Simon
V
dN/dS-based methods detect positive selection linked to trade-offs between different fitness traits in the coat protein of Potato virus Y
Mol Biol Evol.
 , 
2011
, vol. 
28
 (pg. 
2707
-
2717
)
Murzin
AG
Brenner
SE
Hubbard
T
Chothia
C
SCOP: a structural classification of proteins database for the investigation of sequences and structures
J Mol Biol.
 , 
1995
, vol. 
247
 (pg. 
536
-
540
)
Nam
J
, et al.  . 
Type I MADS-box genes have experienced faster birth-and-death evolution than type II MADS-box genes in angiosperms
Proc Natl Acad Sci U S A.
 , 
2004
, vol. 
101
 (pg. 
1910
-
1915
)
Palme
AE
Pyhajarvi
T
Wachowiak
W
Savolainen
O
Selection on nuclear genes in a Pinus phylogeny
Mol Biol Evol.
 , 
2009
, vol. 
26
 (pg. 
893
-
905
)
Petersen
L
Bollback
JP
Dimmic
M
Hubisz
M
Nielsen
R
Genes under positive selection in Escherichia coli
Genome Res.
 , 
2007
, vol. 
17
 (pg. 
1336
-
1343
)
Petroski
MD
Deshaies
RJ
Function and regulation of cullin-RING ubiquitin ligases
Nat Rev Mol Cell Biol.
 , 
2005
, vol. 
6
 (pg. 
9
-
20
)
Purugganan
MD
The MADS-box floral homeotic gene lineages predate the origin of seed plants: phylogenetic and molecular clock estimates
J Mol Evol.
 , 
1997
, vol. 
45
 (pg. 
392
-
396
)
Purugganan
MD
Rounsley
SD
Schmidt
RJ
Yanofsky
MF
Molecular evolution of flower development: diversification of the plant MADS-box regulatory gene family
Genetics
 , 
1995
, vol. 
140
 (pg. 
345
-
356
)
Ridout
KE
Dixon
CJ
Filatov
DA
Positive selection differs between protein secondary structure elements in Drosophila
Genome Biol Evol.
 , 
2010
, vol. 
2
 (pg. 
166
-
179
)
Rohl
CA
Protein structure estimation from minimal restraints using Rosetta
Methods Enzymol.
 , 
2005
, vol. 
394
 (pg. 
244
-
260
)
Rohl
CA
Strauss
CE
Chivian
D
Baker
D
Modeling structurally variable regions in homologous proteins with rosetta
Proteins
 , 
2004
, vol. 
55
 (pg. 
656
-
677
)
Rosinski
JA
Atchley
WR
Molecular evolution of helix-turn-helix proteins
J Mol Evol.
 , 
1999
, vol. 
49
 (pg. 
301
-
309
)
Roth
C
Liberles
DA
A systematic search for positive selection in higher plants (Embryophytes)
BMC Plant Biol.
 , 
2006
, vol. 
6
 pg. 
12
 
Rychlewski
L
Jaroszewski
L
Li
W
Godzik
A
Comparison of sequence profiles. Strategies for structural predictions using sequence information
Protein Sci.
 , 
2000
, vol. 
9
 (pg. 
232
-
241
)
Sikorski
RS
Boguski
MS
Goebl
M
Hieter
P
A repeating amino acid motif in CDC23 defines a family of proteins and a new relationship among genes required for mitosis and RNA synthesis
Cell
 , 
1990
, vol. 
60
 (pg. 
307
-
317
)
Smith
NG
Eyre-Walker
A
Adaptive protein evolution in Drosophila
Nature
 , 
2002
, vol. 
415
 (pg. 
1022
-
1024
)
Sterck
L
Rombauts
S
Vandepoele
K
Rouze
P
Van de Peer
Y
How many genes are there in plants (…and why are they there)?
Curr Opin Plant Biol.
 , 
2007
, vol. 
10
 (pg. 
199
-
203
)
Strasburg
JL
, et al.  . 
Effective population size is positively correlated with levels of adaptive divergence among annual sunflowers
Mol Biol Evol.
 , 
2011
, vol. 
28
 (pg. 
1569
-
1580
)
Swaffield
JC
Purugganan
MD
The evolution of the conserved ATPase domain (CAD): reconstructing the history of an ancient protein module
J Mol Evol.
 , 
1997
, vol. 
45
 (pg. 
549
-
563
)
Wagstaff
BJ
Begun
DJ
Adaptive evolution of recently duplicated accessory gland protein genes in desert Drosophila
Genetics
 , 
2007
, vol. 
177
 (pg. 
1023
-
1030
)
Ward
JJ
McGuffin
LJ
Bryson
K
Buxton
BF
Jones
DT
The DISOPRED server for the prediction of protein disorder
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
2138
-
2139
)
Waterhouse
AM
Procter
JB
Martin
DM
Clamp
M
Barton
GJ
Jalview version 2—a multiple sequence alignment editor and analysis workbench
Bioinformatics
 , 
2009
, vol. 
25
 (pg. 
1189
-
1191
)
Weston
AD
Baliga
NS
Bonneau
R
Hood
L
Systems approaches applied to the sudy of Saccharomyces cerevisiae and Halobacterium sp
Cold Spring Harb Symp Quant Biol.
 , 
2003
, vol. 
68
 (pg. 
345
-
358
)
Wilson
CG
Kajander
T
Regan
L
The crystal structure of NlpI. A prokaryotic tetratricopeptide repeat protein with a globular fold
FEBS J.
 , 
2005
, vol. 
272
 (pg. 
166
-
179
)
Wright
PE
Dyson
HJ
Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm
J Mol Biol.
 , 
1999
, vol. 
293
 (pg. 
321
-
331
)
Wu
G
, et al.  . 
Structure of a beta- TrCP1-Skp1-beta-catenin complex: destruction motif binding and lysine specificity of the SCF(beta-TrCP1) ubiquitin ligase
Mol Cell.
 , 
2003
, vol. 
11
 (pg. 
1445
-
1456
)
Xiao
W
, et al.  . 
Mapping of genome-wide resistance gene analogs (RGAs) in maize (Zea mays L.)
Theor Appl Genet.
 , 
2007
, vol. 
115
 (pg. 
501
-
508
)
Xu
G
Ma
H
Nei
M
Kong
H
Evolution of F-box genes in plants: different modes of sequence divergence and their relationships with functional diversification
Proc Natl Acad Sci U S A.
 , 
2009
, vol. 
106
 (pg. 
835
-
840
)
Yang
J
, et al.  . 
Molecular basis for TPR domain-mediated regulation of protein phosphatase 5
EMBO J.
 , 
2005
, vol. 
24
 (pg. 
1
-
10
)
Yang
Z
Computational molecular evolution
 , 
2006
Oxford
Oxford University Press
Yang
Z
PAML 4: phylogenetic analysis by maximum likelihood
Mol Biol Evol.
 , 
2007
, vol. 
24
 (pg. 
1586
-
1591
)
Yang
Z
Nielsen
R
Goldman
N
Pedersen
AM
Codon-substitution models for heterogeneous selection pressure at amino acid sites
Genetics
 , 
2000
, vol. 
155
 (pg. 
431
-
449
)
Yu
J
, et al.  . 
A draft sequence of the rice genome (Oryza sativa L. ssp. indica)
Science
 , 
2002
, vol. 
296
 (pg. 
79
-
92
)
Yu
J
, et al.  . 
The Genomes of Oryza sativa: a history of duplications
PLoS Biol.
 , 
2005
, vol. 
3
 pg. 
e38
 
Zhang
L
Pond
SK
Gaut
BS
A survey of the molecular evolutionary dynamics of twenty-five multigene families from four grass taxa
J Mol Evol.
 , 
2001
, vol. 
52
 (pg. 
144
-
156
)
Zhang
W
Koepp
DM
Fbw7 isoform interaction contributes to cyclin E proteolysis
Mol Cancer Res.
 , 
2006
, vol. 
4
 (pg. 
935
-
943
)
Zhao
JP
Su
XH
Patterns of molecular evolution and predicted function in thaumatin-like proteins of Populus trichocarpa
Planta
 , 
2010
, vol. 
232
 (pg. 
949
-
962
)

Author notes

Associate editor: George Zhang
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.