-
PDF
- Split View
-
Views
-
Cite
Cite
Deborah A. Johnson, Michael A. Thomas, The Monosaccharide Transporter Gene Family in Arabidopsis and Rice: A History of Duplications, Adaptive Evolution, and Functional Divergence, Molecular Biology and Evolution, Volume 24, Issue 11, November 2007, Pages 2412–2423, https://doi.org/10.1093/molbev/msm184
- Share Icon Share
Abstract
Current hypotheses of gene duplicate divergence propose that surviving members of a gene duplicate pair may evolve, under conditions of purifying or nearly neutral selection, in one of two ways: with new function arising in one duplicate while the other retains original function (neofunctionalization [NF]) or partitioning of the original function between the 2 paralogs (subfunctionalization [SF]). More recent studies propose that SF followed by NF (subneofunctionalization [SNF]) explains the divergence of many duplicate genes. In this analysis, we evaluate these hypotheses in the context of the large monosaccharide transporter (MST) gene families in Arabidopsis and rice. MSTs have an ancient origin, predating plants, and have evolved in the seed plant lineage to comprise 7 subfamilies. In Arabidopsis, 53 putative MST genes have been identified, with one subfamily greatly expanded by tandem gene duplications. We searched the rice genome for members of the MST gene family and compared them with the MST gene family in Arabidopsis to determine subfamily expansion patterns and estimate gene duplicate divergence times. We tested hypotheses of gene duplicate divergence in 24 paralog pairs by comparing protein sequence divergence rates, estimating positive selection on codon sites, and analyzing tissue expression patterns. Results reveal the MST gene family to be significantly larger (65) in rice with 2 subfamilies greatly expanded by tandem duplications. Gene duplicate divergence time estimates indicate that early diversification of most subfamilies occurred in the Proterozoic (2500–540 Myr) and that expansion of large subfamilies continued through the Cenozoic (65–0 Myr). Two-thirds of paralog pairs show statistically symmetric rates of sequence evolution, most consistent with the SF model, with half of those showing evidence for positive selection in one or both genes. Among 8 paralog pairs showing asymmetric divergence rates, most consistent with the NF model, nearly half show evidence of positive selection. Positive selection does not appear in any duplicate pairs younger than ∼34 Myr. Our data suggest that the NF, SF, and SNF models describe different outcomes along a continuum of divergence resulting from initial conditions of relaxed constraint after duplication.
Introduction
The ability to transport simple sugars across membranes through sugar transporter proteins arose very early in the evolution of life. Monosaccharide transporters (MSTs) are integral membrane proteins found in all 3 domains of life, whose functional importance cannot be overstated. In the seed plants, this ancient group of proteins is comprised of 7 subfamilies with evidence for lineage-specific subfamily expansions in the gymnosperm, monocot, and dicot lineages (Johnson et al. 2006). One subfamily of plant MSTs, the plastidic glucose translocators (pGlcT subfamily), is known to have greater similarity to some mammalian glucose transporters than to other plant MSTs (Weber et al. 2000). Three hexose transporters from the unicellular green alga Chlorella kessleri (CkHUP1–3) (Tanner 2000) form an outgroup at the base of the STP subfamily in higher plants (Williams et al. 2000). The STP subfamily in plants is the best studied, with all STPs characterized to date shown to be sugar/proton symporters located in the plasma membrane and expressed mostly in nonphotosynthetic “sink” tissues (Buttner and Sauer 2000). Some MSTs exhibit broad-spectrum transport function, transporting many different sugars at low affinity and rate, whereas others are more specific, transporting one or a few sugars at higher rates. In Arabidopsis thaliana, 53 MST genes form 2 large and 5 smaller subfamilies (Johnson et al. 2006).
The increasingly rapid pace of genome sequencing across the tree of life has revealed that most genes are members of gene families and that gene families evolve through repeated episodes of large- and small-scale gene duplication events. Large-scale duplication events include whole-genome duplications (WGDs) and segmental duplications (Van de Peer and Meyer 2005). In Arabidopsis, a model plant species long thought to be a classic diploid organism because of its small genome size and low chromosome number, there is evidence for multiple WGDs (Vision et al. 2000; Simillion et al. 2002; Blanc et al. 2003; Bowers et al. 2003). In rice (Oryza sativa L.), analyses of the recently completed genome sequence provide evidence for a WGD dating to 53–94 Myr (Guyot and Keller 2004; Paterson et al. 2004; Yu et al. 2005). Small-scale gene duplication events are also prevalent in many lineages and play an important role in gene family expansions (Taylor and Raes 2005). Tandem gene duplications may result in arrays of functionally related genes such as olfactory receptor genes in vertebrates (Niimura and Nei 2005) and secondary metabolism genes in plants (Ober 2005). In Arabidopsis and rice, 16.2% and 16.5% of genes, respectively, are tandem duplicates. In rice, the sum of tandem and other small-scale duplications are estimated to contribute duplicates on a scale comparable to large segmental duplications (Yu et al. 2005).
Current models of gene duplicate divergence include the neofunctionalization (NF) model, in which one duplicate remains conserved, whereas the redundant copy, freed from selective constraint, diverges under relaxed purifying selection (Ohno 1970). In this model, most duplicates are lost due to the accumulation of degenerative mutations and new function arises only when rare beneficial mutations occur through neutral processes (Ohta 1988; Walsh 1995). An alternative model is subfunctionalization (SF), where both duplicates diverge through complementary degenerative mutations, resulting in a partitioning of multiple functions and/or expression of the progenitor gene (Force et al. 1999; Lynch and Force 2000). In this model, both copies of the gene experience an initial period of relaxed constraint and accelerated evolution after duplication but evolve under purifying selection. However, recent studies suggest that NF often accompanies SF. An analysis of genome-wide protein–protein interaction data from yeast reveals that SF is accompanied by prolonged NF in a large fraction of duplicates (He and Zhang 2005), and a simulation study has shown that a model allowing SF results in higher rates of NF than a model where SF is not permitted (Rastogi and Liberles 2005). A new model termed subneofunctionalization (SNF) has been proposed to describe these observations (He and Zhang 2005).
A number of studies have used asymmetric divergence, where one duplicate is evolving faster than the other, as an indication of NF (Van de Peer et al. 2001; Kondrashov et al. 2002; Kellis et al. 2004). However, asymmetric divergence may also occur in SF, depending upon the functional constraint acting upon each duplicate. Although most gene duplicate divergence models do not exclude positive selection in the evolution of new function, they propose that new function evolves primarily, especially in the early stages of divergence, through changes occurring under relaxed purifying selection. The role and importance of positive selection in the evolution of new function remains to be established (Zhang 2003).
The models of gene duplicate divergence described above are summarized in figure 1, modified from models proposed by He and Zhang (2005). He and Zhang propose a more nuanced view of NF, where one duplicate always retains the original function, while the other duplicate may diverge in 3 different ways (Models I, II, and III). In NF Model I, both duplicates retain all original function, whereas one duplicate also gains new function. In NF Model II, some original function is retained by the duplicate, which also gains new function. In NF Model III, the Ohno model, one duplicate retains all original function, whereas the other gains new function while losing all of its original function. In figure 1, we show a more detailed view of SF, which may be symmetric or asymmetric. In symmetric SF, each duplicate assumes an equal proportion of both highly conserved and less conserved functions. In asymmetric SF, one duplicate retains highly conserved functions, whereas the other assumes less conserved functions and is, thus, under more relaxed constraint. In either SF case, new function may arise later in one or both genes (SNF).

Diagram depicting current models of gene duplicate divergence, modified from models described by He and Zhang (2005). Open circles represent genes, and squares represent modular gene functions or expression sites. Black-filled or black-striped squares represent highly conserved gene functions, and gray-filled or gray-striped squares represent gene functions that are less conserved. Stippled or grid-filled squares represent newly acquired function. The duplicate divergence path on the left shows possible outcomes of SF and later SNF and the path on the right possible outcomes of NF.
In this study, we compare the large MST gene families of Arabidopsis, a dicot flowering plant, and rice, a monocot. The monocots, a monophyletic clade within the greater angiosperm clade (Chase et al. 1995; Soltis et al. 2000), have recently been proposed to have diverged from their dicot ancestor 140–150 MYA (Chaw et al. 2004). These authors used 61 protein-coding genes from 12 fully sequenced chloroplast genomes, 3 reliable split events as calibration points and an unequal-rate method to estimate a divergence date. Earlier studies using fewer genes from mitochondrial, chloroplast, and/or nuclear genomes have estimated the divergence at between 140 and 238 Myr (Wolfe et al. 1989; Laroche et al. 1995; Yang et al. 1999; Sanderson and Doyle 2001; Chaw et al. 2004). We chose the 140–150 Myr date (145 Myr) as a calibration point in our estimation of gene duplicate divergence dates, given the large amount of gene sequence data used, the reliable calibration points, and the use of an unequal-rate method.
The objectives of our study are to define the MST gene family in rice and compare it with the MST gene family in Arabidopsis in order to: 1) determine subfamily structure and mechanisms of subfamily expansion in each lineage, 2) estimate divergence times of gene duplication events, and 3) test hypotheses of gene duplicate divergence, described above. Specifically, in 24 duplicate pairs of varying age, we estimate sequence divergence rates, looking for the presence or absence of rate symmetry, evaluate tissue expression profiles for overlapping expression, and estimate nonsynonymous and synonymous substitution rate ratios to detect the presence of positive selection as an indicator of adaptive evolution and likely NF.
Materials and Methods
Description of the MST Gene Family in Rice
A search for all MST genes in the rice genome was performed using the Blast search tool on the Gramene Web site (Ware et al. 2002) and previously constructed profile hidden Markov model (HMM) consensus protein sequences (Johnson et al. 2006) for each MST subfamily. Nucleotide coding and protein sequences for each rice locus with a significant match to a MST subfamily consensus sequence were downloaded. Each putative MST protein sequence was then used as a query sequence in a BlastP (Altschul et al. 1997) search against the Arabidopsis refseq protein database to find its best-match sequence. Each putative rice MST sequence was then analyzed in comparison with its best-match Arabidopsis ortholog. Rice sequences that appeared to be partial or to include non-MST sequence were removed from the analysis. MSTs in rice were identified as tandem duplications using The Institute for Genomic Research (TIGR) Rice Genome Annotation Project's mapping tool and the TIGR locus tags for each putative MST locus. Segmental MST duplications were identified by searching for identified putative MSTs in the list of collinear gene pairs found within each duplicated region identified by the TIGR Rice Genome Annotation Project (distance = 500 KB). Tandem and segmental MST gene duplications in Arabidopsis were identified in a previous study (Johnson et al. 2006).
Phylogenetic Analysis and Estimation of Gene Duplicate Divergence Times
To find outgroup sequences for the MST subfamilies, we searched the GenBank protein database for homologous MSTs in nonembryophyte (i.e., nonland plant) lineages using BlastP (Altschul et al. 1997) and subfamily profile HMM consensus sequences (Johnson et al. 2006). MST protein sequences having highly significant matches were further analyzed as potential outgroup sequences with a heuristic search for the maximum likelihood (ML) tree, using the JTT amino acid substitution matrix (Jones et al. 1992), discrete gamma model with 4 categories, and estimated alpha parameter with PHYML (Guindon and Gascuel 2003), including all rice and Arabidopsis MST sequences. Those sequences grouping at the base of a subfamily were chosen as outgroup sequences. A final combined phylogenetic analysis was then performed using the method described above, with 100 bootstrap replicates, after all outgroup, Arabidopsis and rice sequences were aligned with ClustalW (Thompson et al. 1994) using default settings. Of note is that we found no putative MST sequences within the land plants that would form an outgroup for any of the MST subfamilies.
Dates of gene duplicate divergence were estimated on protein sequences using ML with a local clock model and the JTT amino acid substitution matrix as implemented in PAML (Yang and Yoder 2003). Sequences were aligned using ClustalW with default settings. Analyses were run separately on 3 clades of MST protein sequences: the STP–XyloseTP homolog subfamilies (with the 3 green algal hexose transporters, the HXT7 gene from Saccharomyces cerevisiae, and D–xylose transporter from Lactobacillus brevis as outgroup sequences), the ERD-pGlcT subfamilies (with the Danio rerio and Homo sapiens outgroup sequences), and the INT-PLT-AZT subfamilies (with the ITR1 outgroup sequence from S. cerevisiae). A local clock model was chosen after likelihood ratio tests (LRTs) rejected a global clock model for each clade. Separate branch-rate groups were assigned for the well-defined bacterial, yeast, animal, green algal, dicot (Arabidopsis), and monocot (rice) clades, as applicable, in each analysis (figs. 3–5). Calibration points were the prokaryote–eukaryote divergence estimated at >3,000 Myr (Feng et al. 1997; Nei et al. 2001), the plant–animal–fungus divergence estimated at 1576 (±88 Myr) (Wang et al. 1999), the chlorophyte–charophyte divergence estimated at 1070 (±60 Myr) (Yoon et al. 2004), and the divergence between monocots and basal dicots estimated at 145 (±5 Myr) (Chaw et al. 2004). Estimation of divergence dates using the ML method is most accurate at nodes closest to the calibration points (Yang and Yoder 2003). Thus, estimated dates at nodes in which the monocot–dicot divergence is not recoverable in an ancestral lineage (because of loss in either rice or Arabidopsis) are likely too old because the nearest calibration points are quite distant. Because the ML date estimation method also allows only fixed calibration dates, upper and lower bounds on these dates could not be included in the analysis. Thus, the variances around the parameter estimates are underestimated and the uncertainties surrounding our dates are too narrow (Yang and Yoder 2003). However, we chose the ML method of date estimation over the Bayesian method (where upper and lower bounds can be incorporated into the analysis) because we needed to hold the multiple monocot–dicot divergence calibration points constant across the trees. The ML method with a local clock model has been shown to perform comparably to the Bayesian method when applied to analyses of multiple gene loci with multiple calibration points (Yang and Yoder 2003).
Estimation of Positive Selection at Codon Sites
ML estimation of nonsynonymous (dN) and synonymous (dS) nucleotide substitution rate ratios (dN/dS or ω) for paralog pairs at terminal nodes (in which each sequence arises from a common duplication event) in the Arabidopsis–rice MST protein phylogeny (star symbols in figs. 3–5) were calculated using the improved branch-site method (Zhang et al. 2005). We used the branch-site test of positive selection (Test 2), comparing the null model A (model = 2, NS sites = 2, with omega fixed to 1.0) with the alternative model A (model = 2, NS sites = 2) to find codon sites under probable positive selection. The null model allows sites in the background lineages to evolve under relaxed constraint and sites in the foreground lineages to evolve neutrally, whereas the alternative model allows some sites on the foreground lineages to evolve under positive selection. Omega values on codon sites were estimated using nucleotide sequence alignments and ML trees for each subfamily. Gaps contained by a majority of sequences were eliminated and branch lengths were estimated. Test 2 is a 1-tailed test for positive selection only, with significance cutoffs of 2.71 and 5.41 at the 5% and 1% levels, respectively (Zhang et al. 2005). We identified genes with positive selection at the 5% level and identified codon sites under probable positive selection using the Bayes Empirical Bayes method as implemented in PAML (Yang et al. 2005).
Testing Symmetry in Amino Acid Divergence Rates in Gene Duplicate Pairs
Amino acid divergence rates were estimated using ML and the JTT substitution matrix (PAML). Multiple sequence alignments for each subfamily, excluding outgroup sequences, generated with ClustalW (default settings), were used along with a subfamily phylogenetic tree generated from the all subfamily PHYML analysis described above. Gaps present in a majority of the subfamily sequences were removed from the alignments using JalView (Clamp et al. 2004). Gap removal causes some underestimate of divergence but reduces uncertainty in the reconstruction of sequences at internal nodes. Branch lengths were estimated with PAML for each analysis. We tested the null hypothesis that rates of amino acid sequence divergence for MSTs in gene duplicate pairs were equal against the alternative hypothesis that they were different using a local clock model in PAML. We performed a LRT using a χ2 distribution with one degree of freedom, with the null hypothesis rejected at the 5% level. Rejection of the null hypothesis was deemed to indicate asymmetric divergence among genes in paralog pairs.
Analysis of Overlapping Tissue Expression in 24 Paralog Pairs
We evaluated MST tissue expression in Arabidopsis using the AtGenExpress microarray developmental data set (Schmid et al. 2005). We evaluated 42 tissue samples from separate organs and/or developmental stages (root, stem, leaf, individual floral organ, pollen, and seeds) (see supplementary material 2, Supplementary Material online) and used the Affymetrix present and absent flags as an indicator of expression (Weigel D, personal communication). In rice, we evaluated tissue samples from the 20-bp signatures in the rice MPSS database (Nakano et al. 2006). Of the 20 libraries available at the time of analysis, we chose 16 representing diverse tissues (leaf, root, mature pollen, germinating seed/seedling, panicle, ovary, and mature stigma), different developmental stages, abiotic stresses (cold, drought, and salt), and growth conditions (light and dark) (see supplementary material 2, Supplementary Material online). Our comparison of the Arabidopsis microarray and rice MPSS data with published MST gene functional studies reveals the microarray data are likely very sensitive to low expression levels, whereas the MPSS data are likely much less sensitive (unpublished data). Overlapping (i.e., shared) tissue expression for MST genes in each of the 24 paralog pairs was determined by calculating the ratio of the number of tissues in which both duplicates were expressed to the number of tissues in which at least one duplicate was expressed. Low values were taken as an indication of (nearly) complete SF or NF (Models II and III), whereas high values were taken to indicate likely SNF or Model I NF.
Results and Discussion
Comparative Analysis of the MST Gene Family in Arabidopsis and Rice
We found a total of 65 MST genes in the rice genome (supplementary material 1, Supplementary Material online) that group into the 7 known Arabidopsis subfamilies upon combined phylogenetic analysis (fig. 2). Both the STP and PLT subfamilies are greatly expanded in rice with tandem duplicates comprising 48% and 67% of these subfamilies, respectively (fig. 2 and table 1). The ERD subfamily is much larger in Arabidopsis with tandem duplicates comprising 68% of this subfamily (table 1). The AZT, INT, pGlcT, and XyloseTP homolog subfamilies are comparatively small in both lineages, although the AZT subfamily in rice (6 genes) is twice as large as in Arabidopsis. Discernible segmental duplications are seen in all subfamilies, in both lineages, except in the pGlcT subfamily (fig. 2).
Number of Genes | Number (%) of Tandem Duplicates | |||
Subfamily | Arabidopsis | Rice | Arabidopsis | Rice |
STP | 14 | 29 | 2 (14%) | 14 (48%) |
XyloseTP | 3 | 2 | 0 | 0 |
PLT | 6 | 15 | 2 (33%) | 10 (67%) |
INT | 4 | 3 | 0 | 0 |
AZT | 3 | 6 | 0 | 0 |
ERD | 19 | 6 | 13 (68%) | 4 (67%) |
pGlcT | 4 | 4 | 0 | 0 |
Total | 53 | 65 | 17 (32%) | 28 (43%) |
Number of Genes | Number (%) of Tandem Duplicates | |||
Subfamily | Arabidopsis | Rice | Arabidopsis | Rice |
STP | 14 | 29 | 2 (14%) | 14 (48%) |
XyloseTP | 3 | 2 | 0 | 0 |
PLT | 6 | 15 | 2 (33%) | 10 (67%) |
INT | 4 | 3 | 0 | 0 |
AZT | 3 | 6 | 0 | 0 |
ERD | 19 | 6 | 13 (68%) | 4 (67%) |
pGlcT | 4 | 4 | 0 | 0 |
Total | 53 | 65 | 17 (32%) | 28 (43%) |
Number of Genes | Number (%) of Tandem Duplicates | |||
Subfamily | Arabidopsis | Rice | Arabidopsis | Rice |
STP | 14 | 29 | 2 (14%) | 14 (48%) |
XyloseTP | 3 | 2 | 0 | 0 |
PLT | 6 | 15 | 2 (33%) | 10 (67%) |
INT | 4 | 3 | 0 | 0 |
AZT | 3 | 6 | 0 | 0 |
ERD | 19 | 6 | 13 (68%) | 4 (67%) |
pGlcT | 4 | 4 | 0 | 0 |
Total | 53 | 65 | 17 (32%) | 28 (43%) |
Number of Genes | Number (%) of Tandem Duplicates | |||
Subfamily | Arabidopsis | Rice | Arabidopsis | Rice |
STP | 14 | 29 | 2 (14%) | 14 (48%) |
XyloseTP | 3 | 2 | 0 | 0 |
PLT | 6 | 15 | 2 (33%) | 10 (67%) |
INT | 4 | 3 | 0 | 0 |
AZT | 3 | 6 | 0 | 0 |
ERD | 19 | 6 | 13 (68%) | 4 (67%) |
pGlcT | 4 | 4 | 0 | 0 |
Total | 53 | 65 | 17 (32%) | 28 (43%) |

ML tree of all Arabidopsis and rice amino acid MST sequences created with PHYML using the JTT substitution matrix, gamma distribution with 4 categories, estimated alpha parameter and 100 bootstrap replicates. Outgroup sequences from bacteria (Lactobacillus brevis), yeast (Saccharomyces cerevisiae), green algae (Chlorella kessleri), zebra fish (Danio rerio), and humans (Homo sapiens) were included. Black squares indicate the divergence of monocots from basal dicots at ∼140–150 Myr. Red highlights on leaf text indicate tandem duplicates, and brackets indicate segmental duplicates.
Of interest is that large subfamily expansions in both lineages seem to be the result of one or more tandem arrays arising from a single ancestral gene lineage, sometimes in combination with one or more segmental duplications. For example, in the STP subfamily in rice, a single ancestral ortholog to the AtSTP5 gene radiated into 10 descendant genes as a result of multiple tandem duplications and one segmental duplication involving 2 tandem duplicates (figs. 2 and 4). In the ERD subfamily in Arabidopsis, 14 genes radiated from an ancestral gene lineage that was lost in rice (thick branch lines in fig. 3). This complex set of duplications involves tandem duplications on chromosomes 1, 3, 4, and 5 as well as an apparent segmental duplication involving 2 genes on chromosomes 3 and 5. In the PLT subfamily in rice, an ancestral gene lineage lost in Arabidopsis gave rise to a group of 6 genes, 4 of which are in a tandem array on chromosome 11 (figs. 2 and 5).

Calibrated phylogenetic tree with gene duplicate divergence time estimates for the ERD-pGlcT clade. Dates were estimated using ML and a local clock with 5 branch-rate groups for background (basal internal branches), fish, human, dicot, and monocot lineages. Calibration points are indicated by geometric symbols: triangle = fungal/plant/animal divergence at ∼1,576 Myr and square = basal dicot–monocot divergence at ∼145 Myr. Star symbols at external nodes indicate paralog pairs included in the branch-site nucleotide substitution analysis. Bold branches indicate a subclade with a very distant calibration point resulting in estimated dates that are likely too old. Gray highlights on leaf text indicate tandem duplicates, and brackets indicate segmental duplicates.

Calibrated phylogenetic tree with gene duplicate divergence time estimates for the STP-XyloseTP homolog clade. Dates were estimated using ML and a local clock with 6 branch-rate groups for background (basal internal branches), bacterial, yeast, green algal, dicot, and monocot lineages. Calibration points are indicated by geometric symbols: pentagon = bacteria–eukaryote divergence at ∼3,000 Myr, triangle = fungal/plant/animal divergence at ∼1,576 Myr, circle = chlorophyte–charophyte divergence at ∼1,070 Myr, and square = basal dicot–monocot divergence at ∼145 Myr. Star symbols at external nodes indicate paralog pairs included in the branch-site nucleotide substitution analysis. Gray highlights on leaf text indicate tandem duplicates, and brackets indicate segmental duplicates.

Calibrated phylogenetic tree with gene duplicate divergence time estimates for the PLT-INT-AZT clade. Dates were estimated using ML and a local clock with 4 branch-rate groups for background (basal internal branches), yeast, dicot, and monocot lineages. Calibration points are indicated by geometric symbols: triangle = fungal/plant/animal divergence at ∼1,576 Myr and square = basal dicot–monocot divergence at ∼145 Myr. Star symbols at external nodes indicate paralog pairs included in the branch-site nucleotide substitution analysis. Gray highlights on leaf text indicate tandem duplicates, and brackets indicate segmental duplicates.
History of MST Gene Duplications
A search of the GenBank database for homologous transporters in nonembryophyte lineages reveals a number of hexose transporters from the unicellular green alga C kessleri (CkHUP1-3: P15686, Q39524, and Q39525) and the yeast S. cerevisiae (HXT7: YDR342C) that fall at the root of the STP subfamily on phylogenetic analysis (fig. 2). An inositol transporter from S. cerevisiae (ScINTR1: YDR497C) groups at the base of the INT subfamily, a facilitated glucose transporter (member 8-type) from zebra fish (D. rerio: AAH49409) is most closely related to the ERD subfamily, human glucose transporters (GLUT1-4: NP006507, NP000331, NP008862, and NP001033) group at the base of the pGlcT subfamily, and a bacterial D-xylose/proton transporter (L. brevis: AAC95127) shows high similarity to the XyloseTP homolog subfamily (fig. 2). The presence of a prokaryotic outgroup sequence for the XyloseTP homolog subfamily and outgroup sequences stemming from early eukaryotes for the STP, INT, pGlcT, and ERD subfamilies indicates a very ancient origin for transporters in these subfamilies. No nonland plant genes were found as outgroup sequences for the AZT and PLT subfamilies, perhaps because these subfamily transporters have diverged considerably in the land plants and, thus, sequence similarity matches to related transporters in other eukaryotes have lower scores.
Estimated amino acid substitution rates in the monocot and dicot lineages differed slightly among the 3 analyses. In the STP-XyloseTP and PLT-INT-AZT clades, the estimated dicot lineage substitution rate (0.010 and 0.011 substitutions/site/Myr, respectively) was faster than in the monocot lineage (0.009 and 0.010, respectively). In the ERD-pGlcT clade, the estimated substitution rate was faster in the monocot (0.009) than dicot (0.007) lineage. These differences may reflect real differences in substitution rates among members of one or more of the subfamilies.
We estimated the ages of 4 segmental MST duplications identified in Arabidopsis at 71.66 ± 4.02 (STP), 41.86 ± 3.90 (XyloseTP), 39.60 ± 3.25 (ERD), and 57.34 ± 3.39 (INT) Myr (table 2). The STP segmental duplicates belong to a paralogous block identified as belonging to an “old” large-scale duplication event by the Wolfe laboratory (http://wolfe.gen.tcd.ie/athal/dup). The old blocks in Arabidopsis are hard to date but have been estimated to have occurred after the monocot–dicot split (Blanc et al. 2003). The other MST segmental duplications (XyloseTP, ERD, and INT) all belong to blocks from a “recent” large-scale duplication event estimated by various groups to be >100 Myr (Vision et al. 2000), 75 ± 22 Myr (Simillion et al. 2002), 65 Myr (Lynch and Conery 2000), and 25.0–26.7 Myr (Blanc and Wolfe 2004). Our dates vary between ∼57 and ∼39 Myr, falling in between the estimates by Lynch and Conery (2000) and Blanc and Wolfe (2004). Of note is that 2 of the recent segmental MST paralog pairs (XyloseTP and INT) show the presence of a large number of codon sites under positive selection (table 2) and substitution rates among all duplicates in these recent blocks vary from 0.04 to 0.23 substitutions/site since duplication (table 2). We also estimated a rice segmental duplication at 84.89 ± 5.62 Myr, which is consistent with a previously estimated age of an ancient WGD in rice of 53–94 Myr (Yu et al. 2005). Age estimates of tandem duplications at terminal nodes range from 11.48 ± 1.13 to 131.39 ± 8.60 Myr in Arabidopsis and 0 ± 0.04 to 64.81 ± 3.79 Myr in rice. Of note is that our estimated dates, based on protein sequences, may appear older than they actually are if protein sequence divergence occurred among alleles of progenitor genes prior to duplication and each divergent allele was subsequently fixed at different daughter loci (Proulx and Phillips 2006).
Paralog pair | Positive selection? | Number of codon sitesa | Asymmetric divergence? | Substitutions/site since duplicationb | Divergence model? | Tissue expression overlapc | Likely age of duplication (Myr) | Duplicate type |
At4g21480-STP12 | Yes | 5/1* | No | 0.194024* | SNF | 88% (37/42) | 71.66 ± 4.02 | Segmental |
At1g11260-STP1 | Yes | 8 | 0.130419* | |||||
Os01g38670 | No | Yes | 0.028451 | NF | 0% (0/1) | 16.39 ± 1.64 | Tandem | |
OS01g38680 | No | 0.115294 | ||||||
Os02g36414 | No | No | 0.000004 | SF | ? | 0 ± 0.04 | Tandem | |
Os02g36420 | No | 0.000004 | ||||||
Os02g36440 | No | No | 0.022562* | SF | 0% (0/1) | 9.09 ± 1.06 | Tandem | |
Os02g36450 | No | 0.054165* | ||||||
Os04g37990 | No | No | 0.223197 | SF | 0% (0/11) | 55.42 ± 3.60 | Tandem | |
Os04g38010 | No | 0.292604 | ||||||
Os04g37970 | Yes | 16 | No | 0.276829 | SNF | 43% (3/7) | 50.24 ± 3.19 | Tandem |
Os04g37980 | Yes | 10/1* | 0.208236 | |||||
Os07g03910 | No | No | 0.010490* | SF | 69% (9/13) | 1.61 ± 0.47 | Tandem | |
Os07g03960 | No | 0.000004* | ||||||
Os10g41190 | No | Yes | 0.240653 | NF | 13% (2/15) | 84.89 ± 5.62 | Segmental | |
Os03g01170 | No | 0.137437 | ||||||
At1g08890 | Yes | 1 | No | 0.099349 | SNF | ? | 36.45 ± 4.06 | Tandem |
At1g08900 | No | 0.131187 | ||||||
At1g08920 | No | 5 | No | 0.246047 | SNF | 88% (36/41) | 131.39 ± 8.60 | Tandem |
At1g08930-ERD6 | Yes | 0.297608 | ||||||
At1g19450 | No | No | 0.036980 | SF | 100% (42/42) | 39.60 ± 3.25 | Segmental | |
At1g75220 | No | 0.044701 | ||||||
At3g05155 | No | 12/2* | No | 0.205523 | SNF | ? | 55.68 ± 4.98 | Tandem |
At3g05400 | Yes | 0.163367 | ||||||
At3g05160 | No | No | 0.045116 | SF | ? | 20.39 ± 2.10 | Tandem | |
At3g05165 | No | 0.057165 | ||||||
At4g04750 | Yes | 464/4*/1** | No | 0.088799* | SNF (NFP?) | 0% (0/30) | 52.29 ± 4.14 | Tandem |
At4g04760 | Yes | 44/8*/9** | 0.216593* | |||||
At5g27350-SFP1 | No | No | 0.094808 | SF | 98% (40/41) | 33.19 ± 2.89 | Tandem | |
At5g27360-SFP2 | No | 0.092829 | ||||||
Os03g24860 | No | Yes | 0.234977 | NF | 23% (4/13) | 64.81 ± 3.79 | Tandem | |
Os03g24870 | No | 0.141323 | ||||||
Os05g49260 | Yes | 5 | No | 0.102532* | SNF | 60% (9/15) | 33.52 ± 2.68 | Tandem |
Os05g49270 | No | 0.037313* | ||||||
At2g16120-PLT1 | No | No | 0.027582 | SF | ? | 11.48 ± 1.13 | Tandem | |
At2g16130-PLT2 | No | 0.042800 | ||||||
Os03g10090 | Yes | 8 | Yes | 0.298965 | NFP | 17% (2/12) | 50.09 ± 3.86 | Tandem |
Os03g10100 | No | 0.145176 | ||||||
Os04g58220 | Yes | 5/1* | Yes | 0.080027 | NFP | 0% (0/2) | 38.52 ± 3.08 | Tandem |
Os04g58230 | No | 0.204443 | ||||||
Os07g39350 | No | Yes | 0.102694 | NF | 13% (2/15) | 38.11 ± 2.77 | Tandem | |
Os07g39360 | No | 0.252438 | ||||||
Os11g41830 | No | Yes | 0.140604 | NF | 0% (0/0) | 16.76 ± 1.39 | Tandem | |
Os11g41850 | No | 0.062094 | ||||||
At2g35740-INT3 | Yes | 64/5*/1** | Yes | 0.211635 | NFP | 0% (0/23) | 57.34 ± 3.39 | Segmental |
At4g16480-INT4 | Yes | 5/3*/1** | 0.054267 | |||||
At3g03090 | No | 37/4*/5** | No | 0.148393* | SNF | 100% (42/42) | 41.86 ± 3.90 | Segmental |
At5g17010 | Yes | 0.226755* |
Paralog pair | Positive selection? | Number of codon sitesa | Asymmetric divergence? | Substitutions/site since duplicationb | Divergence model? | Tissue expression overlapc | Likely age of duplication (Myr) | Duplicate type |
At4g21480-STP12 | Yes | 5/1* | No | 0.194024* | SNF | 88% (37/42) | 71.66 ± 4.02 | Segmental |
At1g11260-STP1 | Yes | 8 | 0.130419* | |||||
Os01g38670 | No | Yes | 0.028451 | NF | 0% (0/1) | 16.39 ± 1.64 | Tandem | |
OS01g38680 | No | 0.115294 | ||||||
Os02g36414 | No | No | 0.000004 | SF | ? | 0 ± 0.04 | Tandem | |
Os02g36420 | No | 0.000004 | ||||||
Os02g36440 | No | No | 0.022562* | SF | 0% (0/1) | 9.09 ± 1.06 | Tandem | |
Os02g36450 | No | 0.054165* | ||||||
Os04g37990 | No | No | 0.223197 | SF | 0% (0/11) | 55.42 ± 3.60 | Tandem | |
Os04g38010 | No | 0.292604 | ||||||
Os04g37970 | Yes | 16 | No | 0.276829 | SNF | 43% (3/7) | 50.24 ± 3.19 | Tandem |
Os04g37980 | Yes | 10/1* | 0.208236 | |||||
Os07g03910 | No | No | 0.010490* | SF | 69% (9/13) | 1.61 ± 0.47 | Tandem | |
Os07g03960 | No | 0.000004* | ||||||
Os10g41190 | No | Yes | 0.240653 | NF | 13% (2/15) | 84.89 ± 5.62 | Segmental | |
Os03g01170 | No | 0.137437 | ||||||
At1g08890 | Yes | 1 | No | 0.099349 | SNF | ? | 36.45 ± 4.06 | Tandem |
At1g08900 | No | 0.131187 | ||||||
At1g08920 | No | 5 | No | 0.246047 | SNF | 88% (36/41) | 131.39 ± 8.60 | Tandem |
At1g08930-ERD6 | Yes | 0.297608 | ||||||
At1g19450 | No | No | 0.036980 | SF | 100% (42/42) | 39.60 ± 3.25 | Segmental | |
At1g75220 | No | 0.044701 | ||||||
At3g05155 | No | 12/2* | No | 0.205523 | SNF | ? | 55.68 ± 4.98 | Tandem |
At3g05400 | Yes | 0.163367 | ||||||
At3g05160 | No | No | 0.045116 | SF | ? | 20.39 ± 2.10 | Tandem | |
At3g05165 | No | 0.057165 | ||||||
At4g04750 | Yes | 464/4*/1** | No | 0.088799* | SNF (NFP?) | 0% (0/30) | 52.29 ± 4.14 | Tandem |
At4g04760 | Yes | 44/8*/9** | 0.216593* | |||||
At5g27350-SFP1 | No | No | 0.094808 | SF | 98% (40/41) | 33.19 ± 2.89 | Tandem | |
At5g27360-SFP2 | No | 0.092829 | ||||||
Os03g24860 | No | Yes | 0.234977 | NF | 23% (4/13) | 64.81 ± 3.79 | Tandem | |
Os03g24870 | No | 0.141323 | ||||||
Os05g49260 | Yes | 5 | No | 0.102532* | SNF | 60% (9/15) | 33.52 ± 2.68 | Tandem |
Os05g49270 | No | 0.037313* | ||||||
At2g16120-PLT1 | No | No | 0.027582 | SF | ? | 11.48 ± 1.13 | Tandem | |
At2g16130-PLT2 | No | 0.042800 | ||||||
Os03g10090 | Yes | 8 | Yes | 0.298965 | NFP | 17% (2/12) | 50.09 ± 3.86 | Tandem |
Os03g10100 | No | 0.145176 | ||||||
Os04g58220 | Yes | 5/1* | Yes | 0.080027 | NFP | 0% (0/2) | 38.52 ± 3.08 | Tandem |
Os04g58230 | No | 0.204443 | ||||||
Os07g39350 | No | Yes | 0.102694 | NF | 13% (2/15) | 38.11 ± 2.77 | Tandem | |
Os07g39360 | No | 0.252438 | ||||||
Os11g41830 | No | Yes | 0.140604 | NF | 0% (0/0) | 16.76 ± 1.39 | Tandem | |
Os11g41850 | No | 0.062094 | ||||||
At2g35740-INT3 | Yes | 64/5*/1** | Yes | 0.211635 | NFP | 0% (0/23) | 57.34 ± 3.39 | Segmental |
At4g16480-INT4 | Yes | 5/3*/1** | 0.054267 | |||||
At3g03090 | No | 37/4*/5** | No | 0.148393* | SNF | 100% (42/42) | 41.86 ± 3.90 | Segmental |
At5g17010 | Yes | 0.226755* |
Values without asterisk indicates codon sites with posterior probability (PP) >50%; single asterisk, PP >95%; double asterisks = PP >99%.
Asterisks indicate substitution rates that do not reach the 5% cutoff in a LRT but differ by more than their standard errors.
Numbers in parentheses indicate (number of tissues with overlapping expression/number of tissues with expression of at least one gene); “?” indicates data for one duplicate is unavailable.
Paralog pair | Positive selection? | Number of codon sitesa | Asymmetric divergence? | Substitutions/site since duplicationb | Divergence model? | Tissue expression overlapc | Likely age of duplication (Myr) | Duplicate type |
At4g21480-STP12 | Yes | 5/1* | No | 0.194024* | SNF | 88% (37/42) | 71.66 ± 4.02 | Segmental |
At1g11260-STP1 | Yes | 8 | 0.130419* | |||||
Os01g38670 | No | Yes | 0.028451 | NF | 0% (0/1) | 16.39 ± 1.64 | Tandem | |
OS01g38680 | No | 0.115294 | ||||||
Os02g36414 | No | No | 0.000004 | SF | ? | 0 ± 0.04 | Tandem | |
Os02g36420 | No | 0.000004 | ||||||
Os02g36440 | No | No | 0.022562* | SF | 0% (0/1) | 9.09 ± 1.06 | Tandem | |
Os02g36450 | No | 0.054165* | ||||||
Os04g37990 | No | No | 0.223197 | SF | 0% (0/11) | 55.42 ± 3.60 | Tandem | |
Os04g38010 | No | 0.292604 | ||||||
Os04g37970 | Yes | 16 | No | 0.276829 | SNF | 43% (3/7) | 50.24 ± 3.19 | Tandem |
Os04g37980 | Yes | 10/1* | 0.208236 | |||||
Os07g03910 | No | No | 0.010490* | SF | 69% (9/13) | 1.61 ± 0.47 | Tandem | |
Os07g03960 | No | 0.000004* | ||||||
Os10g41190 | No | Yes | 0.240653 | NF | 13% (2/15) | 84.89 ± 5.62 | Segmental | |
Os03g01170 | No | 0.137437 | ||||||
At1g08890 | Yes | 1 | No | 0.099349 | SNF | ? | 36.45 ± 4.06 | Tandem |
At1g08900 | No | 0.131187 | ||||||
At1g08920 | No | 5 | No | 0.246047 | SNF | 88% (36/41) | 131.39 ± 8.60 | Tandem |
At1g08930-ERD6 | Yes | 0.297608 | ||||||
At1g19450 | No | No | 0.036980 | SF | 100% (42/42) | 39.60 ± 3.25 | Segmental | |
At1g75220 | No | 0.044701 | ||||||
At3g05155 | No | 12/2* | No | 0.205523 | SNF | ? | 55.68 ± 4.98 | Tandem |
At3g05400 | Yes | 0.163367 | ||||||
At3g05160 | No | No | 0.045116 | SF | ? | 20.39 ± 2.10 | Tandem | |
At3g05165 | No | 0.057165 | ||||||
At4g04750 | Yes | 464/4*/1** | No | 0.088799* | SNF (NFP?) | 0% (0/30) | 52.29 ± 4.14 | Tandem |
At4g04760 | Yes | 44/8*/9** | 0.216593* | |||||
At5g27350-SFP1 | No | No | 0.094808 | SF | 98% (40/41) | 33.19 ± 2.89 | Tandem | |
At5g27360-SFP2 | No | 0.092829 | ||||||
Os03g24860 | No | Yes | 0.234977 | NF | 23% (4/13) | 64.81 ± 3.79 | Tandem | |
Os03g24870 | No | 0.141323 | ||||||
Os05g49260 | Yes | 5 | No | 0.102532* | SNF | 60% (9/15) | 33.52 ± 2.68 | Tandem |
Os05g49270 | No | 0.037313* | ||||||
At2g16120-PLT1 | No | No | 0.027582 | SF | ? | 11.48 ± 1.13 | Tandem | |
At2g16130-PLT2 | No | 0.042800 | ||||||
Os03g10090 | Yes | 8 | Yes | 0.298965 | NFP | 17% (2/12) | 50.09 ± 3.86 | Tandem |
Os03g10100 | No | 0.145176 | ||||||
Os04g58220 | Yes | 5/1* | Yes | 0.080027 | NFP | 0% (0/2) | 38.52 ± 3.08 | Tandem |
Os04g58230 | No | 0.204443 | ||||||
Os07g39350 | No | Yes | 0.102694 | NF | 13% (2/15) | 38.11 ± 2.77 | Tandem | |
Os07g39360 | No | 0.252438 | ||||||
Os11g41830 | No | Yes | 0.140604 | NF | 0% (0/0) | 16.76 ± 1.39 | Tandem | |
Os11g41850 | No | 0.062094 | ||||||
At2g35740-INT3 | Yes | 64/5*/1** | Yes | 0.211635 | NFP | 0% (0/23) | 57.34 ± 3.39 | Segmental |
At4g16480-INT4 | Yes | 5/3*/1** | 0.054267 | |||||
At3g03090 | No | 37/4*/5** | No | 0.148393* | SNF | 100% (42/42) | 41.86 ± 3.90 | Segmental |
At5g17010 | Yes | 0.226755* |
Paralog pair | Positive selection? | Number of codon sitesa | Asymmetric divergence? | Substitutions/site since duplicationb | Divergence model? | Tissue expression overlapc | Likely age of duplication (Myr) | Duplicate type |
At4g21480-STP12 | Yes | 5/1* | No | 0.194024* | SNF | 88% (37/42) | 71.66 ± 4.02 | Segmental |
At1g11260-STP1 | Yes | 8 | 0.130419* | |||||
Os01g38670 | No | Yes | 0.028451 | NF | 0% (0/1) | 16.39 ± 1.64 | Tandem | |
OS01g38680 | No | 0.115294 | ||||||
Os02g36414 | No | No | 0.000004 | SF | ? | 0 ± 0.04 | Tandem | |
Os02g36420 | No | 0.000004 | ||||||
Os02g36440 | No | No | 0.022562* | SF | 0% (0/1) | 9.09 ± 1.06 | Tandem | |
Os02g36450 | No | 0.054165* | ||||||
Os04g37990 | No | No | 0.223197 | SF | 0% (0/11) | 55.42 ± 3.60 | Tandem | |
Os04g38010 | No | 0.292604 | ||||||
Os04g37970 | Yes | 16 | No | 0.276829 | SNF | 43% (3/7) | 50.24 ± 3.19 | Tandem |
Os04g37980 | Yes | 10/1* | 0.208236 | |||||
Os07g03910 | No | No | 0.010490* | SF | 69% (9/13) | 1.61 ± 0.47 | Tandem | |
Os07g03960 | No | 0.000004* | ||||||
Os10g41190 | No | Yes | 0.240653 | NF | 13% (2/15) | 84.89 ± 5.62 | Segmental | |
Os03g01170 | No | 0.137437 | ||||||
At1g08890 | Yes | 1 | No | 0.099349 | SNF | ? | 36.45 ± 4.06 | Tandem |
At1g08900 | No | 0.131187 | ||||||
At1g08920 | No | 5 | No | 0.246047 | SNF | 88% (36/41) | 131.39 ± 8.60 | Tandem |
At1g08930-ERD6 | Yes | 0.297608 | ||||||
At1g19450 | No | No | 0.036980 | SF | 100% (42/42) | 39.60 ± 3.25 | Segmental | |
At1g75220 | No | 0.044701 | ||||||
At3g05155 | No | 12/2* | No | 0.205523 | SNF | ? | 55.68 ± 4.98 | Tandem |
At3g05400 | Yes | 0.163367 | ||||||
At3g05160 | No | No | 0.045116 | SF | ? | 20.39 ± 2.10 | Tandem | |
At3g05165 | No | 0.057165 | ||||||
At4g04750 | Yes | 464/4*/1** | No | 0.088799* | SNF (NFP?) | 0% (0/30) | 52.29 ± 4.14 | Tandem |
At4g04760 | Yes | 44/8*/9** | 0.216593* | |||||
At5g27350-SFP1 | No | No | 0.094808 | SF | 98% (40/41) | 33.19 ± 2.89 | Tandem | |
At5g27360-SFP2 | No | 0.092829 | ||||||
Os03g24860 | No | Yes | 0.234977 | NF | 23% (4/13) | 64.81 ± 3.79 | Tandem | |
Os03g24870 | No | 0.141323 | ||||||
Os05g49260 | Yes | 5 | No | 0.102532* | SNF | 60% (9/15) | 33.52 ± 2.68 | Tandem |
Os05g49270 | No | 0.037313* | ||||||
At2g16120-PLT1 | No | No | 0.027582 | SF | ? | 11.48 ± 1.13 | Tandem | |
At2g16130-PLT2 | No | 0.042800 | ||||||
Os03g10090 | Yes | 8 | Yes | 0.298965 | NFP | 17% (2/12) | 50.09 ± 3.86 | Tandem |
Os03g10100 | No | 0.145176 | ||||||
Os04g58220 | Yes | 5/1* | Yes | 0.080027 | NFP | 0% (0/2) | 38.52 ± 3.08 | Tandem |
Os04g58230 | No | 0.204443 | ||||||
Os07g39350 | No | Yes | 0.102694 | NF | 13% (2/15) | 38.11 ± 2.77 | Tandem | |
Os07g39360 | No | 0.252438 | ||||||
Os11g41830 | No | Yes | 0.140604 | NF | 0% (0/0) | 16.76 ± 1.39 | Tandem | |
Os11g41850 | No | 0.062094 | ||||||
At2g35740-INT3 | Yes | 64/5*/1** | Yes | 0.211635 | NFP | 0% (0/23) | 57.34 ± 3.39 | Segmental |
At4g16480-INT4 | Yes | 5/3*/1** | 0.054267 | |||||
At3g03090 | No | 37/4*/5** | No | 0.148393* | SNF | 100% (42/42) | 41.86 ± 3.90 | Segmental |
At5g17010 | Yes | 0.226755* |
Values without asterisk indicates codon sites with posterior probability (PP) >50%; single asterisk, PP >95%; double asterisks = PP >99%.
Asterisks indicate substitution rates that do not reach the 5% cutoff in a LRT but differ by more than their standard errors.
Numbers in parentheses indicate (number of tissues with overlapping expression/number of tissues with expression of at least one gene); “?” indicates data for one duplicate is unavailable.
Our gene duplicate divergence time estimates reveal that protogenes of each MST subfamily type were present in organisms leading to the land plant lineage at least as far back as the middle Proterozoic. The protoplastidic glucose transporters (fig. 3) started to diversify earliest at the boundary of the middle and late Proterozoic (∼967 ± 58 Myr) when trace fossils of simple multicelled eukaryotes appear. By the start of the Paleozoic, 3 ancestral pGlcT genes were present, with one additional duplication occurring in both rice and Arabidopsis since the divergence of these 2 lineages. The ERD subfamily also began to diversify in the late Proterozoic (∼866 ± 53 Myr) with at least one duplication occurring in the Ordovician (∼477 ± 31 Myr) as early land plants were evolving (fig. 3). Of note is a large clade of ERD transporters in Arabidopsis descended from a duplicate that has likely been lost in rice (bold branch lines in fig. 3). Because the monocot–dicot divergence calibration point cannot be reconstructed here and the nearest calibration point is very distant, the estimated dates on the nodes in this clade are likely too old. The STP subfamily (fig. 4) began to diversify ∼856 (±41) Myr, with 5 STP genes likely present by the start of the Paleozoic. Two additional duplications occurred in the Ordovician, and by the time of the monocot–dicot divergence, as flowering plants begin to proliferate, at least 9 ancestral STP genes were likely present. The PLT and INT subfamilies also begin their diversification in the late Proterozoic at ∼773 ± 61 and ∼689 ± 54 Myr, respectively, with the AZT subfamily first diversifying most recently at ∼311 ± 14 Myr (fig. 5). All 3 of these subfamilies experience additional duplications in the late Paleozoic (Permian) before radiating more fully in the late Mesozoic (Cretaceous). Of note is that the AZT subfamily experienced multiple gene duplication events at ∼145 Myr, resulting in very short branch lengths and poor resolution at these nodes on the phylogenetic tree (fig. 2). The XyloseTP homologs in land plants, in spite of their early origin in the prokaryotes, first duplicated relatively late at ∼478 ± 43 Myr and have remained a small subfamily of 2 genes in rice, expanding to 3 in Arabidopsis (fig. 4).
Testing Hypotheses of Gene Duplicate Divergence
To test current NF, SF, and SNF hypotheses (fig. 1), we evaluated 24 paralog pairs (star symbols on figs. 3–5) (5 resulting from segmental and 19 from tandem duplications) for the presence or absence of significant positive selection on codon sites (an indication of NF), compared amino acid substitution rates within pairs to find evidence of significant asymmetric divergence (table 2), and analyzed high throughput gene expression profiles for overlapping tissue expression, an indicator of potential SF. Because the hallmark of the NF model is conservation of original function in one of the duplicates, we regarded paralog pairs showing highly significant asymmetric divergence (i.e., one duplicate is highly conserved relative to the other) to be most consistent with the NF model. We further binned genes with both highly significant asymmetric divergence and the presence of positive selection to the category “neofunctionalization with positive selection” (NFP). We regarded paralog pairs showing divergence not significantly asymmetric and with no positive selection to be most consistent with the pure SF model and those with positive selection in at least one member of the pair to be consistent with the SNF model. Low overlapping expression values were interpreted as consistent with SF or (nearly) complete NF (Models II and III) and high values as consistent with SNF or Model I NF.
We recognize that, because SF can be asymmetric and NF may not be strongly asymmetric, symmetry of divergence cannot completely discriminate between NF and SF. However, given that both the “adaptive conflict” model of SF (Hughes 1994) and the complementary degenerative mutation model of SF (Force et al. 1999) suggest that strong asymmetries in protein sequence divergence are not expected and that most models of NF invoke strong asymmetry of divergence rates, we utilize divergence rates to bin gene duplicate pairs into either the NF or SF model of duplicate divergence. We also recognize that new function may arise through mutations occurring under relaxed purifying selection (Dykhuizen and Hartl 1980; Kimura 1983). However, new function gained through nearly neutral mutation is likely to undergo further positive selection to refine or enhance this function (Hughes 1994; Zhang et al. 1998). Thus, we use significant asymmetry of divergence as an indicator of the NF model of divergence and the presence of positive selection on codon sites as an indicator of adaptive evolution and NF, both in the NF model and the SNF model.
Of the 24 paralog pairs, 8 (33%) showed evidence of significant asymmetric divergence, most consistent with the NF model. Of these 8, two show evidence of positive selection in one member of the pair and one shows evidence of positive selection in both members (NFP) (table 2). In the pair showing positive selection in both members, the faster evolving gene had a large number of codon sites under probable positive selection (70). Of the 16 paralog pairs with substitution rates that were statistically the same (67%), 8 showed no positive selection on either duplicate (SF), 5 had evidence of positive selection in one duplicate and 3 in both (SNF). Of note, however, is that 6 duplicate pairs in which the LRT did not reject the null hypothesis, did show amino acid substitution rates that differed by more than the error of their estimation (shown with asterisks in table 2). All paralog pairs consistent with the NF model range in estimated age from 16.4 ± 1.64 to 84.89 ± 5.62 Myr, with those experiencing positive selection in at least one member ranging in age from 38.52 ± 3.08 to 57.34 ± 3.39 Myr (fig. 6). All paralog pairs consistent with the SF model range in estimated age from 0 ± 0.04 to 131.39 ± 8.60 Myr, with those experiencing positive selection in at least one member ranging in age from 33.52 ± 2.68 to 131.39 ± 8.60 Myr (fig. 6).

Scatter plot showing relationship between estimated time since duplication and the number of amino acid substitutions/site since duplication for paralogs with divergence patterns consistent with the NF, NFP, SF, and SFP models of gene duplicate divergence.
An evaluation of overlapping tissue expression in each of the paralog pairs for which gene expression data are available for both duplicates, and in comparison with symmetry of divergence as an indicator of sub- versus neofunctionalization, reveals that of the 8 pairs with significantly asymmetric protein divergence (NF or NFP), all have relatively low overlapping expression (0–23%), consistent with NF Models II and III (table 2). Of the 5 duplicate pairs with symmetric protein divergence and no positive selection (SF), 2 have 0% overlapping tissue expression and 3 have high overlap (69–100%) (table 2). Because SF is hypothesized to occur at either the protein function or the tissue expression level, both the very low and very high overlapping expression values may be consistent with SF. In those pairs with high or 100% overlapping expression, SF may be occurring in the protein-coding regions such that partitioning of sugar transport function is taking place. For example, duplicate genes descended from an ancestral broad-spectrum MST may subsequently specialize in transporting different sugars at higher rates but remain expressed in the same tissues. Of the 6 duplicate pairs with symmetric divergence and positive selection (SNF), 5 have moderate to high overlapping expression (43–100%) and 1 has 0% overlapping expression. Models of SNF suggest that, after an initial period of partitioning of tissue expression (low or no overlap), overlapping tissue expression should return as new function may be acquired in tissues where expression was previously lost. We see 5 of 6 duplicate pairs binned as SNF with overlapping expression consistent with this scenario. The sixth duplicate pair with 0% tissue expression overlap (At4g04750–At4g04760) is intriguing because it has a very large number of codon sites in both duplicates that appear to be under positive selection and protein sequence divergence rates that, while not statistically significant on LRT, differ by more than their standard errors of estimation. This duplicate pair, thus, appears more consistent with NF Model III.
The above results suggest that most retained gene duplicate pairs diverge at approximately symmetric rates for a significant period of time after duplication, with positive selection occurring in one or both members much later and potentially persisting for many tens of millions of years after onset. A smaller subset of genes appear to be consistent with the NF model, where most of these duplicate pairs have a much faster evolving member that diverges through neutral processes with some experiencing positive selection later. A caveat to these interpretations is that potential divergence among alleles prior to gene duplication, or positive selection on many sites after duplication, may make our estimated duplication dates appear older than they are.
Conclusions
The MST gene family in rice is significantly larger than that in Arabidopsis, with expansions of the STP and PLT subfamilies being responsible for most of the difference. Tandem duplications have played a significant role in the expansions of the STP, ERD, and PLT subfamilies in both Arabidopsis and rice, especially so for the STP and PLT subfamilies in rice and the ERD subfamily in Arabidopsis. An analysis of expression profiles and functional divergence among the closely related members of these tandem duplications will shed light on the adaptive significance of these lineage-specific subfamily expansions.
Gene duplicate divergence time estimates reveal that most subfamilies began to diversify in the early eukaryotes of the Proterozoic, with the STP subfamily radiating into at least 9 genes by the divergence of the basal dicot and monocot lineages ∼145 Myr. The PLT and AZT subfamilies began to diversify in the Mesozoic, with a number of duplications in the AZT subfamily happening at approximately the time of the monocot–dicot divergence. The subfamilies experiencing large expansions due to multiple tandem duplications continued to expand through the Mesozoic and Cenozoic.
Tests of gene duplicate divergence models in 24 paralog pairs reveal that two-thirds show symmetric divergence most consistent with the SF model, with one-third showing significantly asymmetric divergence most consistent with the NF model. Evidence of positive selection appears only in gene duplicate pairs estimated to be older than ∼34 Myr, extending to the oldest pair analyzed at ∼132 Myr. Among those genes with divergence rates not deemed statistically different, many show a rate difference greater than their standard errors of estimation. Given these data, the NF, SF, and SNF models of gene duplicate divergence appear to be describing different outcomes, along a continuum, from the same initial divergence process: Both gene duplicates are initially released from selective constraint after duplication and experience accelerated evolution until one or both experience mutations that change function. This may result in one duplicate experiencing significant functional change so that the other is forced back into constraint to maintain original function. Or, it may result in complementary mutations in both that result in some degree of functional partitioning, with one or both genes potentially gaining new function. In either case, as evolutionary change progresses, one or both genes may experience positive selection to refine or change their function in a new genetic and selective environment.
We thank David Liberles, Charles F. Williams, and Marjorie Matocq for helpful comments that improved an early version of the manuscript and 2 anonymous reviewers for comments that improved the final version. We thank Luobin Yang for programming and troubleshooting workflows on the Apple Xserve computing cluster in the Department of Biological Sciences at Idaho State University. M.A.T. and D.A.J. were supported by National Institutes of Health Grant number P20 RR016454 from the IDeA Network of Biomedical Research Excellence Program of the National Center for Research Resources. M.A.T. was also supported by a Pharmaceutical Research and Manufacturers of America Foundation award in informatics. Dissemination of this research was supported, in part, by Grant Number S07-D3 from the Graduate Student Research and Scholarship Committee of Idaho State University, Pocatello, Idaho.
References
Author notes
David Irwin, Associate Editor