Abstract

It is widely recognized that many genes and lineages do not adhere to a molecular clock, yet molecular clocks are commonly used to date divergences in comparative genomic studies. We test the application of a molecular clock across genes and lineages in a phylogenetic framework utilizing 12 genes linked in a 1-Mb region on chromosome 13 of soybean (Glycine max); homoeologous copies of these genes formed by polyploidy in Glycine; and orthologous copies in G. tomentella, Phaseolus vulgaris, and Medicago truncatula. We compare divergence dates estimated by two methods each in three frameworks: a global molecular clock with a single rate across genes and lineages using full and approximate likelihood methods based on synonymous substitutions, a gene-specific clock assuming rate constancy over lineages but allowing a different rate for each gene, and a relaxed molecular clock where rates may vary across genes and lineages estimated under penalized likelihood and Bayesian inference. We use the cumulative variance across genes as a means of quantifying precision. Our results suggest that divergence dating methods produce results that are correlated, but that older nodes are more variable and more difficult to estimate with precision and accuracy. We also find that models incorporating less rate heterogeneity estimate older dates of divergence than more complex models, as node age increases. A mixed model nested analysis of variance testing the effects of framework, method, and gene found that framework had a significant effect on the divergence date estimates but that most variation among dates is due to variation among genes, suggesting a need to further characterize and understand the evolutionary phenomena underlying rate variation within genomes, among genes, and across lineages.

In systematics, the use of a molecular clock for dating divergences has been considered problematic for some time (Graur and Li 2000). It is known that the rate of molecular evolution differs across the genome (Matassi et al. 1999), across lineages (Smith and Donoghue 2008), and over time (Fitch and Markowitz 1970, Lopez et al. 2002) due to generation time, population size, lineage effects, and shifts in the strength or direction of natural selection (Ayala 1999). In light of these phenomena, phylogeny-based “relaxed clock” methods that allow substitution rate variation over lineages were developed in maximum likelihood (ML; Sanderson 2002) and Bayesian (Thorne et al. 1998, Drummond et al. 2006) frameworks. In comparative genomic studies, relaxed methods are rarely employed, often because a well-resolved phylogeny is lacking or not used. These studies are limited to dating under the assumptions of a global or a gene-specific (genic) molecular clock, relying on pairwise sequence divergence measured with neutrally evolving characters and rate or fossil calibrations to estimate dates.

An example of molecular clock used in comparative genomics is dating a polyploidy event from a distribution of paralogue evolutionary distances. In this approach, evidence of polyploidy is determined by a higher than expected number of paralogue pairs with similar divergences at synonymous sites (Ks), termed a “polyploid peak” (Lynch and Conery 2000, Cui et al. 2006). The divergence date of the “progenitor” genome(s) of the polyploid is estimated by dividing the modal Ks value of the polyploid peak by a universal rate, often taken from the literature. Such global rates are often generalizations of questionable relevance to the taxa under study and can lead to very different interpretations of the same data. For example, Schlueter et al. (2004) and Blanc and Wolfe (2004) reported similar modal Ks values for polyploid peaks in several crop plants but estimated very different homoeologue divergence dates due to the use of different substitution rates. Fawcett et al. (2009) criticized the methodologies of these and other studies that led to such conflicting results and instead used a relaxed clock method to obtain dates for a number of polyploid events in angiosperms.

The use of relaxed methods eliminates one source of potential error, and the assumption that a global rate is appropriate to transform synonymous distances to dates. But how effective is this more rigorous, complex approach in obtaining a reliable date for the same event from multiple genes? Using a relaxed approach, Lavin et al. (2005) obtained different but correlated dates for legume divergences from two regions of the effectively nonrecombining chloroplast genome. With recombinational jumbling of evolutionary histories, the situation is likely to be worse for nuclear genes. Indeed, Zhang et al. (2002) reported synonymous substitution rate differences of up to 13.8-fold among homoeologous genes duplicated in Arabidopsis. If rate differences among genes are the rule, it is clear that the application of a global clock to estimate homoeologue divergence dates will produce wildly different results across genes, calling into question the accuracy of dates based on the mode of a polyploid peak.

To the extent that relaxed clock methods better model rate differences across lineages and genes, it would seem useful to apply them to the problem of dating homoeologue divergences and to assess how they compare with Ks-based global clock methods. Here, we do so for soybean (Glycine max L. (Merr.)) and relatives (Leguminosae). Glycine is known to be an ancient polyploid based on chromosome number (Goldblatt 1981), genetic mapping (Shoemaker et al. 1996), and sequence-based studies (Lee et al. 2001). More recent work used synonymous global clock dating via the Ks peak approach (Blanc and Wolfe 2004, Schlueter et al. 2004) and comparisons of individual homoeologous pairs (Pfeil et al. 2005, Schlueter et al. 2006, Schlueter, Lin, et al. 2007, Schlueter, Vasylenko-Sanders, et al. 2007, Innes et al. 2008) to identify and date two rounds of polyploidy in soybean: a more recent event in Glycine producing homoeologous copies estimated to have diverged 10–14 Ma and a more ancient event producing 50–60 Ma homoeologues (Shoemaker et al. 2006) shared with Medicago (Pfeil et al. 2005), which diverged from soybean ∼54 Ma (Lavin et al. 2005). We date homoeologue and orthologue divergences using multiple genes linked in a 1-Mb region of the soybean genome (Innes et al. 2008). The predominately selfing breeding system of soybean causes strong linkage among genes (Hyten et al. 2007), reducing the confounding affects of recombination. Genes in this region, however, show different synonymous distances between homoeologue pairs that trace their divergence to the same event (Innes et al. 2008), making this a useful study system for our purposes.

As genomic analyses become more prevalent, more studies will begin to incorporate phylogenetic information with comparative genomic data (e.g., Fawcett et al. 2009). It is thus important to understand the relative application and power of various divergence dating methods across scientific disciplines. Thus, we compare pairwise Ks-based molecular clock methods used in comparative genomic studies with phylogeny-based relaxed clock methods used by phylogeneticists.

A fundamental underlying difference between these methods is the modeling of rate differences across genes and lineages, differences that can have a strong impact on the outcome of divergence dating analyses. Understanding the impact of gene and/or lineage rate heterogeneity is paramount for understanding variation among divergence date estimates. Indeed, Thorne and Kishino (2005) stated that “ . the process of molecular evolution is poorly characterized and the relative contributions of gene-specific versus lineage-specific factors toward rate variation over time are unclear.” In this study, we explore the relative contributions of modeling gene and lineage rate heterogeneity to divergence date estimates and investigate the impact of time by dating divergences of various ages under three frameworks: 1) a global molecular clock framework that assumes a single rate across both genes and lineages, 2) a genic molecular clock framework that assumes rate constancy across lineages but allows an evolutionary rate specific to each gene, and 3) a relaxed clock framework that allows rate differences across both genes and lineages.

We employ a mixed model nested analysis of variance (ANOVA) approach for assessing the relative contributions of gene, method, and framework to the overall variation among date estimates by treating these different variables as categorical—a novel approach for comparing divergence dating methods—and estimating the relative impact of different types of rate heterogeneity on divergence dating. We also examine different strategies for dealing with multiple genes. Some scientists recommend concatenation as the best method (e.g., Thorne and Kishino 2005), and many studies do so without examining independent data sets, whereas other researchers disagree (e.g., Rodriquez-Trelles et al. 2002). Here, we compare the results of estimating divergence dates by using individual genes, averaging across independent gene estimates, and by concatenation of multiple genes. Finally, we address the question of whether the additional effort of employing more complex relaxed clock methods across multiple nuclear genes can produce estimates of divergence dates that are more precise or accurate than simpler global clock methods.

MATERIALS AND METHODS

To compare and contrast divergence dating methods used in comparative genomic and phylogenetic disciplines, assess the effect of differing methods, and gauge the impact of modeling different types of rate heterogeneity, here defined as rate differences allowed across genes or lineages, we used a nested scheme of two methods each within the global, genic, and relaxed clock frameworks (Fig. 1). We conduct these comparisons across varying phylogenetic depths to understand the impact of time on the accuracy of divergence dating. Our nested scheme enables statistical comparisons to determine the influence of method, framework, and gene on the variation of dates seen across nodes.

FIGURE 1.

Nested scheme for comparison of divergence dating methods within and among three frameworks, each with two methods. The global molecular clock framework assumes a universal rate of molecular evolution across both genes and lineages; the yn00 and F3x4 methods are employed therein. The genic molecular clock assumes the same rate across lineages but specific rates for each gene; methods include yn00c and F3x4c. The relaxed clock framework allows different rates across genes and lineages; PL and BA are used. A-O represent genes used herein, equating to genes A through J, M, and O from Innes et al. (2008). Gene sampling for each node and node names, H1, H2, H1/H2, H3/H4 and H1/H3, are as in Figure 2. This sampling scheme enables various sets of comparisons, such as across methods and frameworks but within node (shown by the series of dotted lines), among methods within framework (suggested by the gray boxes inclusive of two methods) and by using a mixed model nested ANOVA to compare within and among gene, method, and framework on a node by node basis (suggested by the concentric solid line groups).

FIGURE 1.

Nested scheme for comparison of divergence dating methods within and among three frameworks, each with two methods. The global molecular clock framework assumes a universal rate of molecular evolution across both genes and lineages; the yn00 and F3x4 methods are employed therein. The genic molecular clock assumes the same rate across lineages but specific rates for each gene; methods include yn00c and F3x4c. The relaxed clock framework allows different rates across genes and lineages; PL and BA are used. A-O represent genes used herein, equating to genes A through J, M, and O from Innes et al. (2008). Gene sampling for each node and node names, H1, H2, H1/H2, H3/H4 and H1/H3, are as in Figure 2. This sampling scheme enables various sets of comparisons, such as across methods and frameworks but within node (shown by the series of dotted lines), among methods within framework (suggested by the gray boxes inclusive of two methods) and by using a mixed model nested ANOVA to compare within and among gene, method, and framework on a node by node basis (suggested by the concentric solid line groups).

Study System and DNA Sequences

Our data set was constructed around 12 genes linked in a 1-Mb DNA segment on G. max chromosome 13 corresponding to genes A through J, M, and O of Innes et al. (2008). These were compared with available homologues from 1) homoeologous regions of the G. max genome duplicated by two polyploidy events, 2) regions orthologous to G. max homoeologues in a perennial homoploid relative of soybean, G. tomentella, and 3) homologous regions in Phaseolus vulgaris, a generic relative of soybean that diverged previous to the more recent soybean polyploidy event (see Fig. 2 for relationships and gene sampling across lineages). Medicago and outgroup sequences were obtained by BLAST (Altschul et al. 1997) searches of GenBank and the Arabidopsis Information Resource (TAIR, www.arabidopsis.org), with Medicago acting as outgroup for genes A, B, E, F, J, L, and O; Arabidopsis for genes C, D, and M; Populus for genes G and H; and Vitis for gene I (see Table S1, available from http://www.sysbio.oxfordjournals.org/, for GenBank numbers; available at http://www.sysbio.oxfordjournals.org). Gene by taxon sampling and phylogenetic relationships of orthologous and homoeologous sequences are outlined in Figure 2. All genes in Glycine and Phaseolus (GP) are syntenic, thereby providing evidence for orthology/paralogy. We analyzed a combined data set including genes C, D, G, H, I, K, and M. Gene K was included from Innes et al. (2008) because it was sampled in all lineages but Phaseolus.

FIGURE 2.

Phylogeny relating involved taxa and their genomes. H = homoeologue. G. = Glycine, P. = Phaseolus, and M. = Medicago. Letters in parentheses are genes sampled for each taxon. Dotted line represents lineages not sampled in all 12 genes. ▪ = recent polyploidy event preceding the radiation of Glycine, approximately 10 Ma. • = ancient polyploidy event shared by Medicago and Glycine approximately 50–60 Ma. Symbols over nodes represent polyploidy events and are approximately dated in million years ago. Particular nodes are named (see text). Modified from Innes et al. (2008).

FIGURE 2.

Phylogeny relating involved taxa and their genomes. H = homoeologue. G. = Glycine, P. = Phaseolus, and M. = Medicago. Letters in parentheses are genes sampled for each taxon. Dotted line represents lineages not sampled in all 12 genes. ▪ = recent polyploidy event preceding the radiation of Glycine, approximately 10 Ma. • = ancient polyploidy event shared by Medicago and Glycine approximately 50–60 Ma. Symbols over nodes represent polyploidy events and are approximately dated in million years ago. Particular nodes are named (see text). Modified from Innes et al. (2008).

Phylogenies and Testing the Global Molecular Clock

Introns were generally unalignable and were excised by comparison to previously annotated genes in GenBank and TAIR. DNA sequences were aligned using MUSCLE v. 3.6 (Edgar 2004) under default parameters and edited manually to ensure correct reading frame using Se-Al v. 2.0 (Rambaut 2002). Alignments for each gene, as well as the Bayesian tree topologies, are deposited in TreeBASE under submission number 10622 (www.treebase.org). Phylogenies were estimated using maximum parsimony (MP) in PAUP* 4.0b10 for 1000 random addition searches and tree-bisection-reconnection (TBR) branch swapping (Swofford 2002) and summarized in a strict consensus tree. Nodal support was assessed with 500 bootstrap replicates, each with 10 random addition sequences and TBR branch swapping. ML phylogenies were estimated using PHYML (Guindon and Gascuel 2003) on the PHYML Web interface (Guindon et al. 2005) using models of evolution determined in ModelTest v. 3.7 (Posada and Crandall 1998) under the Akaike information criterion. Five hundred ML bootstrap replicates were computed. Genes were tested for adherence to the molecular clock via the likelihood ratio test (LRT) for phylogenies estimated with and without a fixed-rate molecular clock (Felsenstein 1988) using appropriate models of evolution. Likelihood scores were estimated in PAUP* v. 4b10 (Swofford 2002). Relative rate tests (RRTs) were done on all genes using r8s v. 1.71 (Sanderson 2003) to determine if descendent lineages of each node had equal rates.

Divergence Dating: Global and Genic Clock Methods

Previous studies using Ks to date polyploidy events in this group used either the method of Yang and Nielsen (yn00 2000) or Goldman and Yang (F3x4 1994); we used both here. F3x4 accounts for transition/transversion and codon usage biases, estimating substitution rates using ML; yn00 accounts for the same biases but is an approximation of the F3x4 ML method. We conducted analyses for all pairwise comparisons for each gene using codeml (F3x4) and YN00 (yn00) in PAML (Yang 1997). Divergence dates (T) were calculated by dividing Ks by two times the synonymous substitution rate of 6.1 × 10 − 9 substitutions per synonymous site per year (s/s/year; Lynch and Conery 2000).

We also used a “genic molecular clock” that applies a rate specific to each gene calculated by dividing the Ks for the most recent common ancestor (MRCA) of Phaseolus and Glycine by twice the divergence date of 19.2 Ma and the mean age for this node estimated previously using the chloroplast matK gene based on 13 fossil calibration points (Lavin et al. 2005). We also calculated a rate using the minimum and maximum date estimates from the same study, 15.5 and 22.4 Ma, respectively, to assess calibration uncertainty. These genic clocks were then applied across the remaining nodes using the Ks values estimated by yn00 and F3x4 to calculate dates, denoted as yn00c and F3x4c, respectively.

Divergence Dating: Relaxed Clock Methods

We estimated divergence dates on individual and combined data sets in the relaxed clock framework using the two most commonly used methods in current literature: the ML method penalized likelihood (PL, Sanderson 2002) using r8s v. 1.71 (Sanderson 2003) and Bayesian inference (BA) using BEAST v1.4.7 (Drummond and Rambaut 2007). Tree topologies used in r8s were assessed as above with branch lengths estimated via ML in PAUP* v.4b10 using appropriate models of evolution. Trees were rooted using outgroup sequences (see Innes et al. 2008). The MRCA of GP was used for calibration, with separate analyses having the node fixed for the minimum (15.5 Ma), mean (19.2 Ma), and maximum (22.4 Ma) ages previously estimated by Lavin et al. (2005). We fixed the age of GP for two reasons: 1) PL requires one node to be fixed; and 2) a fixed calibration point allows a direct comparison with dates estimated under the genic framework, which required rate calibration to a single fixed date. Although a date estimate for the MRCA of Medicago and Glycine is available, some of our Medicago sequences are uncertain as to orthology, precluding the use of this node for calibration. For PL, cross-validation was used to determine the smoothing parameters, with log10 values ranging from 0 to 4 in 0.1 increments.

To estimate the degree of variance associated with PL date estimates, we created 110 bootstrapped sequence data sets for each gene using SEQBOOT in PHYLIP v3.67 (Felsenstein 2004). These pseudoreplicates were used to bootstrap branch lengths on the original tree topology (fixed) via ML in PAUP* v4b10 using the same model as the original data set. In a few instances, the bootstrap replicate was not read by r8s, likely due to zero length branches. These replicates were culled and remaining trees entered into r8s with dates summarized using the profile command.

Because topology must be held constant in r8s, phylogenetic uncertainty is not accounted for. To determine the extent to which topological uncertainty might influence the summary of dates in r8s, we estimated a branch and bound tree (BB, Hendy and Penny 1982) under MP for each pseudoreplicate using DNAPENNY in PHYLIP. BB trees were then summarized in majority rule consensus trees using CONSENSE, also in PHYLIP, allowing the determination of the type and prevalence of minor groupings suggested by individual bootstrap replicates.

BA estimates of divergence dates using Markov chain Monte Carlo (MCMC) sampling were determined in BEAST v1.4.7 (Drummond and Rambaut 2007). No topological constraints were used, allowing assessment of topological uncertainty in the divergence date posterior distribution. We used a relaxed clock model with log normally distributed uncorrelated rates between branches (Drummond et al. 2006). Each analysis was calibrated using the same node as in PL with a normally distributed prior having a mean of 19.2 Ma and a standard deviation of 1.4 Ma equating to the range of 15.5–22.4 Ma from Lavin et al. (2005). We chose to use a normally distributed date of calibration, as opposed to a fixed calibration date for two reasons: 1) to enable the assessment of uncertainty surrounding the calibration dates, as we do so by other means for other methods we use and 2) BEAST v1.4.7 does not allow fixed calibration dates—doing so would result in a prior with zero probability. Our preliminary analyses with the normally distributed calibration date produced a mean date for the GP node that was very close to the date of 19.2 Ma used for the fixed date calibration in PL. The tree prior was modeled under the coalescent with constant population size using Jeffrey's (1946) prior. We chose the coalescent over the Yule model because 1) the phylogeny under scrutiny is a gene phylogeny, not a species phylogeny and 2) the coalescence of duplicated genes is used to approximate the polyploidy event, not the divergence of any two species. In any case, we found minimal differences between dates estimated via the Yule versus coalescent models (data not shown). Each data set used appropriate models. All other priors were set as default values in the program BEAUti v1.4.7 (Drummond and Rambaut 2007). Final analyses consisted of two MCMC runs of 3–10 million generations sampled every 1000 generations. LogCombiner v1.4.7 (Drummond and Rambaut 2007) was used to combine the results of both runs. Tracer v1.4 (Rambaut and Drummond 2007) was used to determine burn-in (10%), confirm mixing and stationarity, visualize likelihood traces, and summarize the posterior distribution for date estimates across nodes. Convergence of MCMC was assessed by 1) comparison of generation by likelihood traces of independent runs to ensure similar values, 2) average standard deviation of split frequencies less than 0.01 at termination, and 3) potential scale reduction factor scores converging to 1.0.

Methodological Comparisons: Statistics

Our strategy incorporates a nested design that facilitates comparisons within and among genes, nodes, method, and framework (Fig. 1). We performed linear regressions for all pairwise comparisons between methods to visually illustrate correlations. We also calculated Pearson correlations (PC) for each comparison. To compare Ks estimates between yn00 and F3x4, we used the Wilcoxon matched pair signed rank test applicable to paired correlated samples that fall along a scale (i.e., time) in order to determine directional significant difference. We employed the sign test to determine if pairs of homoeologues differed significantly. We accounted for correlation among samples when comparing Ks estimates between methods by using the paired (two sample) t test.

Given that rates differ among genes and lineages, we expected methods that better model rate differences across lineages and genes to be more accurate. Testing this is difficult without known divergence dates or simulations. However, precision is also a useful goal and is a potential stand-in for accuracy. We argue that a decrease in the variance of date estimates across genes constitutes improvement, as models should converge on narrower intervals if they deal more successfully with rate differences. To quantify this, we used the cumulative variance across genes at each node as a measure of precision. A comparison of divergence date estimates for each node for all combinations of methods both within and among frameworks allowed assessment of the impact of different models and methods on the precision of dating divergences and how each compensates for variation among genes, lineages, and over time.

In order to understand the relative influence of framework, method, and gene on the overall variation in our data, we fit a mixed model nested ANOVA to dates estimated for each node. Date estimates for each node are not independent of each other because of the innate hierarchical structure (phylogenetic history) linking each species with its ancestor (internal node). Indeed, if this were not true, we could not estimate dates of divergence using calibration. However, this nonindependence presents a problem for comparing date estimates within a statistical framework. To overcome this, we computed a mixed model nested ANOVA separately for each node: date estimates for node H1 are not combined or compared with estimates for node H1/H2 or any other nodes (see Figs. 1 and 2). Although it is true that the dependence of dates still exists among nodes by virtue of the evolutionary history relating these taxa, those same historical evolutionary constraints are in place for each method used, and thus “cancel out” in the context of statistical independence, eliminating the problem of nonindependence for use in the statistical framework. Because the 12 genes represent a sampling of genes in the genome, GENE was set as a random effect and modeled under residual ML. “Method” is nested in “framework” (denoted METHOD[FRAMEWORK]) because each method is specific to framework. FRAMEWORK and METHOD[FRAMEWORK] were set as fixed effects. The model equation is DATE = GENE(random) + FRAMEWORK(fixed) + METHOD[FRAMEWORK](fixed). Analyses were conducted in JMP v7 (SAS Institute Inc.) and Microsoft Excel 2004 v11.2 data analysis toolkit (Microsoft Corporation).

RESULTS

Models and Rates of Molecular Evolution

Phylogenetic analyses of all genes recovered expected relationships (see Fig. 2). Selected models of molecular evolution varied among genes; rate heterogeneity across sites was detected and modeled by the Gamma distribution or proportion of invariable sites in all genes but F, where Hasegawa, Kishino, and Yano (HKY) model (Hasegawa et al. 1985) fit best (Table S2). LRT rejected adherence to a molecular clock for all genes (P < 0.05) except A, B, F, and O. RRTs found differential rates of molecular evolution across at least one node in almost every gene, corroborating LRT results (Table S3). However, LRT results are not completely congruent with RRT results, likely due to the different scope of these tests: LRT is tree wide, whereas RRT is lineage specific. Older nodes exhibited differential rates of nucleotide substitution, with high correlation between node age and the proportion of genes exhibiting significant rate differences across nodes (linear regression equation: proportion of rate differences = 0.163×node age + 0.1784; R2 = 0.97). PL smoothing parameters varied by gene (Table S4).

Ks estimates for nodes varied across physically linked genes from 1.7- to 3.6-fold for yn00 and 2.1- to 6-fold for F3x4 (Fig. 3 and Table S5). For the majority of gene–node comparisons, F3x4 Ks estimates were higher than yn00 estimates (Wilcoxon matched pair signed ranks test: P = 0.00016) as expected (Yang and Nielsen 2000). When averaged across genes, Ks estimates were consistent in being slightly higher for F3x4 than for yn00 (Table 1; Wilcoxon matched pair signed ranks test: P = 0.016), but differed significantly only for GP (paired t test: one-tailed P = 0.006). Ks values calculated from the concatenated data set were within the range of values from individual genes (Fig. 3 and Table S5) and were not statistically different from the average values across genes (paired t test: P > 0.05).

TABLE 1.

Average number of synonymous substitutions per site and dates of divergence across genes for key nodes for each method

 GP H1 H2 H1/H2 H3/H4 H1/HA H3/HB H1/H3 
Ks         
    yn00 0.25 (0.07) 0.06 (0.02) 0.06 (0.02) 0.12 (0.04) 0.10 (0.03) 0.51 (NA) 0.63 (0.20) 0.46 (0.12) 
    F3x4 0.29 (0.07) 0.06 (0.02) 0.07 (0.02) 0.13 (0.04) 0.10 (0.04) 0.46 (NA) 0.87 (0.27) 0.55 (0.19) 
Average date         
    yn00 20.75 (5.60) 4.58 (1.56) 5.24 (1.60) 9.98 (3.46) 8.42 (2.80) 41.61 (NA) 51.69 (16.14) 37.47 (9.77) 
    F3x4 23.89 (6.11) 4.63 (1.93) 5.52 (1.67) 10.48 (3.31) 8.55 (3.65) 37.82 (NA) 71.55 (22.30) 45.34 (14.80) 
    yn00c         
        Mean 19.2 (NA) 4.27 (1.19) 4.91 (1.68) 9.35 (2.58) 9.56 (4.67) 65.86 (NA) 55.89 (11.12) 42.67 (15.05) 
        Minimum 15.5 (NA) 3.45 (0.96) 3.96 (1.35) 7.55 (2.08) 7.63 (3.77) 53.17 (NA) 45.12 (8.97) 34.45 (12.15) 
        Maximum 22.4 (NA) 4.99 (1.39) 5.73 (1.96) 10.91 (3.01) 11.03 (5.45) 76.84 (NA) 65.20 (12.98) 49.78 (17.56) 
    F3x4c         
        Mean 19.2 (NA) 3.80 (1.64) 4.73 (1.56) 8.45 (1.98) 7.47 (3.16) 29.50 (NA) 46.74 (4.57) 38.84 (13.40) 
        Minimum 15.5 (NA) 3.07 (1.32) 3.82 (1.26) 6.82 (1.59) 6.03 (2.55) 23.82 (NA) 37.73 (3.69) 31.35 (10.82) 
        Maximum 22.4 (NA) 4.43 (1.91) 5.52 (1.82) 9.86 (2.30) 8.71 (3.69) 34.42 (NA) 54.53 (5.33) 45.31 (16.63) 
    PL 21.28 (1.86) 5.50 (2.21) 6.86 (2.01) 11.72 (2.84) 10.32 (4.99) 30.88 (NA) 31.54 (1.00) 52.60 (10.77) 
    BA 18.53 (0.01) 4.39 (1.04) 5.05 (0.64) 9.70 (1.62) 8.13 (1.30) 29.17 (NA) 34.08 (9.43) 40.26 (6.76) 
n 12 12 12 6 (3 for PL) 
 GP H1 H2 H1/H2 H3/H4 H1/HA H3/HB H1/H3 
Ks         
    yn00 0.25 (0.07) 0.06 (0.02) 0.06 (0.02) 0.12 (0.04) 0.10 (0.03) 0.51 (NA) 0.63 (0.20) 0.46 (0.12) 
    F3x4 0.29 (0.07) 0.06 (0.02) 0.07 (0.02) 0.13 (0.04) 0.10 (0.04) 0.46 (NA) 0.87 (0.27) 0.55 (0.19) 
Average date         
    yn00 20.75 (5.60) 4.58 (1.56) 5.24 (1.60) 9.98 (3.46) 8.42 (2.80) 41.61 (NA) 51.69 (16.14) 37.47 (9.77) 
    F3x4 23.89 (6.11) 4.63 (1.93) 5.52 (1.67) 10.48 (3.31) 8.55 (3.65) 37.82 (NA) 71.55 (22.30) 45.34 (14.80) 
    yn00c         
        Mean 19.2 (NA) 4.27 (1.19) 4.91 (1.68) 9.35 (2.58) 9.56 (4.67) 65.86 (NA) 55.89 (11.12) 42.67 (15.05) 
        Minimum 15.5 (NA) 3.45 (0.96) 3.96 (1.35) 7.55 (2.08) 7.63 (3.77) 53.17 (NA) 45.12 (8.97) 34.45 (12.15) 
        Maximum 22.4 (NA) 4.99 (1.39) 5.73 (1.96) 10.91 (3.01) 11.03 (5.45) 76.84 (NA) 65.20 (12.98) 49.78 (17.56) 
    F3x4c         
        Mean 19.2 (NA) 3.80 (1.64) 4.73 (1.56) 8.45 (1.98) 7.47 (3.16) 29.50 (NA) 46.74 (4.57) 38.84 (13.40) 
        Minimum 15.5 (NA) 3.07 (1.32) 3.82 (1.26) 6.82 (1.59) 6.03 (2.55) 23.82 (NA) 37.73 (3.69) 31.35 (10.82) 
        Maximum 22.4 (NA) 4.43 (1.91) 5.52 (1.82) 9.86 (2.30) 8.71 (3.69) 34.42 (NA) 54.53 (5.33) 45.31 (16.63) 
    PL 21.28 (1.86) 5.50 (2.21) 6.86 (2.01) 11.72 (2.84) 10.32 (4.99) 30.88 (NA) 31.54 (1.00) 52.60 (10.77) 
    BA 18.53 (0.01) 4.39 (1.04) 5.05 (0.64) 9.70 (1.62) 8.13 (1.30) 29.17 (NA) 34.08 (9.43) 40.26 (6.76) 
n 12 12 12 6 (3 for PL) 

Notes: See Figure 2 for node names. yn00 is the method of Yang and Nielsen (2000) for calculating synonymous substitution rates; F3x4 is the method of Goldman and Yang (1994); yn00c and F3x4c use the previously defined methods but calibrated to gene-specific rates of molecular evolution; PL is penalized likelihood (Sanderson 2002); BA is Bayesian inference using BEAST v1.4.7 (Drummond and Rambaut 2007). n is the number of genes included in the key node average; n changes between key nodes because not all taxa were sampled for each gene. Standard deviations are in parentheses. All dates are given in million years ago. Those deep divergences as estimating the age of the root in PL were not included in the averages here, artificially lowering its standard deviation and variance (see Table S4). F3x4c and yn00c show dates estimated based on the mean (19.2 Ma), minimum (15.5 Ma), and maximum (22.4 Ma) GP divergence dates of Lavin et al. (2005). NA, not applicable.

FIGURE 3.

Numbers of synonymous substitutions/synonymous site (Ks) for each gene across key nodes using the approximate (yn00) and full ML (F3x4) methods. Linked genes A to O are listed in order of synteny along Glycine max chromosome 13. Z represents the concatenated data set. Note that older divergences (i.e., H1/H3) have wider ranges of values, illustrating the trend of older nodes to exhibit greater variance across genes.

FIGURE 3.

Numbers of synonymous substitutions/synonymous site (Ks) for each gene across key nodes using the approximate (yn00) and full ML (F3x4) methods. Linked genes A to O are listed in order of synteny along Glycine max chromosome 13. Z represents the concatenated data set. Note that older divergences (i.e., H1/H3) have wider ranges of values, illustrating the trend of older nodes to exhibit greater variance across genes.

The nested polyploidy events in soybean provide an internal standard: for autopolyploid taxa, comparisons of H1 versus H2 and H1/H2 versus H3/H4 should produce similar rate (Ks) and date estimates for each gene, as homoeologues were created at the same time. In contrast, allopolyploid homoeologues could have inherited different rates from divergent lineages, but relaxed methods should still obtain similar date estimates. Although H2 exhibited higher Ks values than H1 on average, the differences were not significant for yn00 or F3x4 (sign test: P > 0.125; t test assuming equal variances: one-tailed P > 0.15). Ks estimates did not differ significantly between H1/H2 and H3/H4 for either method (sign test: P > 0.625; t test assuming equal variances: P > 0.15). These results suggest that homoeologous gene pairs are evolving at similar rates.

The rate of molecular evolution as calibrated to the 19.2 Ma split for node GP varied between genes 2.51-fold for yn00c and 2.15-fold for F3x4c. Genic rates ranged from 3.85 (gene M) to 9.68 (gene E) × 10 − 9 s/s/year for yn00c and from 4.95 (gene F) to 10.66 (gene C) × 10 − 9 s/s/year for F3x4c (Table S5). The more conservative approach using the low of the minimum age calibration (15.5 Ma; Lavin et al. 2005) and the high of the maximum age calibration (22.4 Ma) yielded a range of 3.30–11.99 × 10 − 9 s/s/year for yn00c and 4.24–13.20 × 10 − 9 s/s/year for F3x4c (Table S5). The genic rate of evolution averaged across genes was 6.59 and 7.59 × 10 − 9 s/s/year for yn00c and F3x4c, respectively, based on the mean age of 19.2 Ma for GP. The genic rates of evolution suggested in the combined analyses had estimates of 6.76 and 7.04 × 10 − 9 s/s/year for yn00c and F3x4c, respectively, calibrated to 19.2 Ma. The genic rates of the minimum and maximum age calibrations ranged from 5.79 to 8.37 and from 6.04 to 8.72 × 10 − 9 s/s/year for yn00c and F3x4c combined analyses, respectively (Table S5).

Divergence Dates: Ages of Key Nodes

Divergence dates varied among genes, methods, and frameworks (Tables S3 and S4) but were not statistically different between H1 and H2 or between H1/H2 and H3/H4 within method (two-sample t test assuming equal variances: all one-tailed p values > 0.05). The date for the G. max/G. tomentella split (represented by nodes H1 and H2) averaged between 3.8 and 6.86 Ma across methods (Table 1). Average dates between Glycine homoeologues since the more recent polyploidy event (represented by nodes H1/H2 and H3/H4) ranged from 7.47 to 11.72 Ma across methods (Table 1). The date estimated for the homoeologue divergence since the more ancient polyploidy event (represented by node H1/H3) averaged from 37.47 to 52.6 Ma across methods.

Measures of uncertainty for divergence date estimates are presented for all methods except F3x4. For yn00, uncertainty is given by the standard errors of Ks values (Table S5). For genic dates, measures of uncertainty were obtained by using the minimum and maximum calibration dates and associated rates from the chloroplast matK data set of Lavin et al. (2005). A 95% confidence interval (CI) for PL is provided by bootstrap analyses and a 95% credibility interval is presented for Bayesian analyses (Table S4). The width of the interval of uncertainty varied by method, with older nodes mostly having wider intervals regardless of method (Table S5).

Topological uncertainty in the bootstrapped replicates created for use in r8s varied by data set. For the combined data set and genes D, H, and J, all BB topologies were fully congruent with the original ML tree topology. Gene F had 21.67% of BB topologies that supported Phaseolus as sister to H2. All other genes had some groupings that were incongruent with the ML topology, but none were present in more than 6% of the bootstrap replicate topologies (Supplementary File S1). Most of the incongruent trees were one of a set of 2 or 3 most parsimonious trees that included the ML topology. These results suggest minimal topological uncertainty among bootstrap replicates with proportions so low that it is unlikely to significantly change the dates summarized in r8s.

Because of the diverse strategies used by the methods employed, measures of uncertainty are not readily comparable. To allow a direct comparison of uncertainty across methods, we averaged divergence dates across genes for each node by method and computed a 95% CI using the associated standard deviation (Table 1). Because we summarized across average dates for each gene, these ranges were narrower than intervals suggested by each method alone. In Figure 4, these ranges are placed on their respective nodes with branch lengths proportional to the average across genes and methods. For the most part, BA produced the smallest CI with the exception of the H3/HB node for which PL exhibited the smallest CI (which also did not overlap with the average across genes and methods). However, the CI for H3/HB from PL is artificially narrow because the estimate did not include nodes that represented the root of the phylogeny, a position for which PL is often unreliable (see r8s manual). The CIs associated with GP varied greatly due to its use for calibration in some methods.

FIGURE 4.

Average divergence date for key nodes across genes for each method. The phylogeny has branch lengths proportional to the average across the six methods and all genes for each key node. The bars represent a 95% CI for each method at each node estimated across genes. H1/HA had only a single point and so lacks error bars. The arrows attached to bars indicate that the bar, and thus 95% CI, should be elongated to the number indicated but is truncated here to allow better resolution of the phylogeny.

FIGURE 4.

Average divergence date for key nodes across genes for each method. The phylogeny has branch lengths proportional to the average across the six methods and all genes for each key node. The bars represent a 95% CI for each method at each node estimated across genes. H1/HA had only a single point and so lacks error bars. The arrows attached to bars indicate that the bar, and thus 95% CI, should be elongated to the number indicated but is truncated here to allow better resolution of the phylogeny.

Divergence Dates: Methodological Comparisons

Although divergence date estimates varied among methods for each node, not all comparisons were statistically different. Of 90 comparisons, 16 had one-tailed P values < 0.05 with 13 remaining significant using a two-tailed P value (Table S6). Half of the comparisons that were statistically different involved GP, perhaps due to its use for calibration. However, neither yn00 nor F3x4 require calibration based on a specific node and yet three GP comparisons were significantly different (F3x4 to yn00c, F3x4c, and BA). Use of a Bonferroni-corrected p value of 0.00056 (α = 0.05) left 5 statistically significant comparisons, all involving GP.

The slope of the linear regression line in methodological comparisons illustrates the variation between pairs of methods (Fig. 5). Departure from a slope of 1.0 indicates increasing disagreement in age estimates between models as node age increases. For 12 of 15 comparisons, the slope of the line was significantly different from 1.0 (Fig. 5). The amount of scatter around the line, captured in correlation coefficients, illustrates disparity of methods for dealing with dating issues at older nodes. PC and linear regression analyses showed high correlation among all comparisons. The yn00/F3x4 comparison had the highest correlation (R2 = 0.9494; PC = 0.9756), whereas the PL/yn00c comparison had the lowest (R2 = 0.5485; PC = 0.8077; Fig. 5).

FIGURE 5.

Pairwise comparisons between dates of divergence estimated by global, genic, and relaxed clock methods. Equations are linear regression trend lines through the origin with accompanying correlation coefficients (R2). Comparisons above the diagonal do not show linear regression graphs but do show trend line equations and their associated correlation coefficients. PC is Pearson correlation coefficient. X- and Y-axes are from 0 to 100 Ma. Comparisons do not include those nodes for which the root was estimated (designated by “*” in Table S4). NS designates those comparisons whose slope of the line is not significantly different from one.

FIGURE 5.

Pairwise comparisons between dates of divergence estimated by global, genic, and relaxed clock methods. Equations are linear regression trend lines through the origin with accompanying correlation coefficients (R2). Comparisons above the diagonal do not show linear regression graphs but do show trend line equations and their associated correlation coefficients. PC is Pearson correlation coefficient. X- and Y-axes are from 0 to 100 Ma. Comparisons do not include those nodes for which the root was estimated (designated by “*” in Table S4). NS designates those comparisons whose slope of the line is not significantly different from one.

The comparison within the global framework based on linear regression suggests that, as node age increased, yn00 produced lower age estimates than F3x4. Yn00 is an approximation of the full likelihood method, F3x4, and is said to be less precise (Yang and Nielsen 2000). The reverse relationship was true between F3x4c and YN00c within the genic framework in which a rate for each gene was calculated using the GP split. Given that the global and genic methods use the same Ks values (estimated under the global framework), we expect the same relationship as between yn00 and F3x4. A closer inspection suggests that nodes H1/H3, H1/HA, and H3/HB from gene M (see Fig. 2), a MtN21 nodulin protein, are within ranges of corresponding estimates from other genes, but younger nodes (including GP) are markedly younger (see Tables S3 and S4). After calibration to GP, the older nodes for gene M would be extrapolated to be older still causing the linear slope to be opposite that expected. These results may be explained through the acceptance of a change in the rate of molecular evolution through time and/or over lineages in the ancestor of GP, either through a shift in functional constraint, allowing sites to act under a covarion-type model (Fitch and Markowitz 1970), or through heterotachy, a change in the synonymous substitution rates across sites in a locus (Philippe and Lopez 2001). Previous research suggested that tribe Phaseolinae, to which Phaseolus belongs, exhibits elevated rates of molecular evolution compared with other legume genera (Lavin et al. 2005, Stefanovic et al. 2009). This may also explain our findings that the GP node differed significantly among pairwise method comparisons discussed previously.

Within-framework comparisons (yn00–F3x4, yn00c–F3x4c, and PL–BA) were more highly correlated than among-framework comparisons. Those involving PL were the least correlated of among-framework comparisons. Those involving BA were slightly more correlated across frameworks than those involving PL.

To compare accuracy across methods, we calculated the cumulative variance across genes for each method and node except GP and H1/HA (due to calibration and single point estimates, respectively; data from Table 1). The global framework had the highest cumulative variances of 380.75 and 747.1 for yn00 and F3x4, respectively. The genic framework gave cumulative variances of 382.87 for yn00c and 219.47 for F3x4c. The relaxed framework had the lowest cumulative variances of 106.22 for BA and 158.88 for PL. However, the PL variance was artificially low due to the culling of root nodes.

The mixed model nested ANOVA suggested that FRAMEWORK played an important role in modeling the variation in the data and was a significant factor for comparisons within GP (genic:global and global:relaxed) and H1 (genic:relaxed) and marginally so for H1/H2 and H2 (Table 2). FRAMEWORK was not significant for H3/H4 or H1/H3, likely due to fewer genes sampled. METHOD[FRAMEWORK] contributed significantly to the overall variation for GP only, likely due to the calibration constraints on this node. Although not a significant contributor to date variation overall, the comparison between F3x4c and PL for METHOD[FRAMEWORK] was significantly different for H1, H2, and H1/H2. After controlling for fixed effects, GENE contributed significantly to residual variability (between 13% and 56%; Table 2) and seems to be the main contributor to variation among divergence dates in this study.

TABLE 2.

Results from mixed model nested ANOVAs conducted for each key node

 n genes FRAMEWORK P value METHOD [FRAMEWORK] P value % Variation contributed by GENE Significant comparisons within FRAMEWORK Significant comparisons within METHOD [FRAMEWORK] 
H1 12 F = 4.3219 F = 2.4851 56.73 GE × RE F3x4c × PL 
  P = 0.0181 P = 0.0702    
H2 F = 2.9264 F = 2.5692 38.58  F3x4c × PL 
  P = 0.0690 P = 0.0728    
H1/H2 12 F = 3.2750 F = 1.6043 13.03 GE × RE F3x4c × PL 
  P = 0.0453 P = 0.1989    
H3/H4 F = 0.2462 F = 0.9489 53.77   
  P = 0.7848 P = 0.4420    
GP 12 F = 6.2140 F = 3.3594 13.93 GE × GL F3x4 × F3x4c 
  P = 0.0037 P = 0.0251  GL × RE yn00c × F3x4 
      F3x4 × BA 
H1/H3 F = 1.3938 F = 1.0947 49.28   
  P = 0.2924 P = 0.3958    
 n genes FRAMEWORK P value METHOD [FRAMEWORK] P value % Variation contributed by GENE Significant comparisons within FRAMEWORK Significant comparisons within METHOD [FRAMEWORK] 
H1 12 F = 4.3219 F = 2.4851 56.73 GE × RE F3x4c × PL 
  P = 0.0181 P = 0.0702    
H2 F = 2.9264 F = 2.5692 38.58  F3x4c × PL 
  P = 0.0690 P = 0.0728    
H1/H2 12 F = 3.2750 F = 1.6043 13.03 GE × RE F3x4c × PL 
  P = 0.0453 P = 0.1989    
H3/H4 F = 0.2462 F = 0.9489 53.77   
  P = 0.7848 P = 0.4420    
GP 12 F = 6.2140 F = 3.3594 13.93 GE × GL F3x4 × F3x4c 
  P = 0.0037 P = 0.0251  GL × RE yn00c × F3x4 
      F3x4 × BA 
H1/H3 F = 1.3938 F = 1.0947 49.28   
  P = 0.2924 P = 0.3958    

Notes: n genes is the number of genes sampled for each node (see Fig. 2 for node names). FRAMEWORK and “method nested within framework” (= METHOD[FRAMEWORK]) are fixed effects, whereas GENE is random. Significant comparisons are those exhibiting significant differences at the 0.05 level partitioned by effect. GE is genic, GL is global, and RE is relaxed as categories of FRAMEWORK. F is the F ratio value; P is the P value. The degrees of freedom are 2 for FRAMEWORK and 3 for METHOD[FRAMEWORK].

When analyzed in combination (as total evidence), node ages were near the median of the range of estimates from individual genes within method. Values from combined analyses were similar to (cf. Table 1 and Tables S4 and S5) and highly correlated with (Fig. S1) those obtained by averaging across genes for each node. Variation in dates exists across physically linked genes, even within method. An investigation of the random effect predictions determined by residual ML as part of the mixed model nested ANOVA suggested that some genes lie outside the distribution representing the mean across genes for each node: genes F, H, I, J, and M all had values significantly different from the distribution estimated for one or more nodes (Table 2). Overall, global and genic methods exhibited greater variation across genes than relaxed clock methods, with BA having the smallest variation of all.

DISCUSSION

Dating the divergence of molecular sequences is difficult, particularly under the assumption of a molecular clock where a global rate of nucleotide substitution must be chosen and justified. Dating divergences becomes increasingly problematic when multiple genes are introduced, as Ks and genic rates may vary due to GC content (Ticher and Graur 1989), codon bias (Sharp and Li 1987), genomic location (Matassi et al. 1999), or gene expression levels (Akashi 2001). However, linked genes may be more alike in substitutional divergence (Matassi et al. 1999), mitigating some of the variation across genes due to genomic location. Our linked genes had a narrower range in Ks for the 10 Ma soybean homoeologues compared with ∼10-fold ranges of unlinked genes in soybean (Blanc and Wolfe 2004, Schlueter et al. 2004), concurring with other clusters of soybean linked loci (Schlueter, Vasylenko-Sanders, et al. 2007, Van et al. 2008).

Dating divergences is further complicated by rate differences accumulating across lineages. Studies have illustrated lineage-specific rate heterogeneity (e.g., Hamilton et al. 2003), even at synonymous sites (Eyre-Walker and Gaut 1997). Lineage rate heterogeneity could stem from the evolution of life history traits such as generation time (Andreasen and Baldwin 2001) or habit (Kay et al. 2006). In the case of polyploidy, homoeologous genes may exhibit rate differences that were inherited from divergent progenitors (i.e., allopolyploidy; Small and Wendel 2002). Moreover, time itself can become a factor in the ability to estimate dates of divergence reliably, as evidenced by the amplitude of change from gene to gene being markedly larger for older nodes (Fig. 3). Increasing variance among dates over genes at older nodes is reminiscent of a one-dimensional random walk: as time increases, the variance among independent replicates (i.e., genes) increases (Fig. S2). The idea that greater variance at older nodes is due to stochastic changes in rate over time led to the development of standard Markov models of sequence evolution (Gilks et al. 1996), such as the JC model (Jukes and Cantor 1969), K2P model (Kimura 1980), or the general time reversible model (Tavare 1986). Such a trend could also be due to changes accumulating from ancient untraceable gene conversion events, a phenomena that may be more prevalent among duplicated regions, such as those created by polyploidy (Zhang et al. 2002).

Testing Hypotheses among Frameworks

We tested the hypothesis that increasingly complex models of rate heterogeneity should result in more precise dates of divergence. Thus, the genic clock methods used in the genic framework should outperform the strict molecular clock methods used in the global clock framework. Likewise, relaxed clock methods from the relaxed framework should produce more tightly grouped dates across genes than genic or global methods.

Our sampling of methods, frameworks (i.e., types of rate heterogeneity), and genes allowed us to use a novel statistical approach to determine the overall impact of these factors on the variance of date estimates for each node. The mixed model nested ANOVA allows us to go deeper than a simple comparison of different methods by estimating the impact of a framework—defined by different combinations of modeling genic versus lineage-specific rate heterogeneity—while taking into account variation across multiple methods (hence the inclusion of multiple methods within each framework) and genes. Our results confirm that FRAMEWORK had a significant impact on the outcome of divergence date analyses, illustrating the importance of modeling different types of rate heterogeneity among genes and lineages. However, we found GENE to be the main contributor to divergence date variation. This finding alone suggested that genic methods should be preferred over global methods, a suggestion backed by our precision criterion: genic methods had lower cumulative variances than global methods. Because each node was analyzed separately to eliminate phylogenetic nonindependence, phylogenetic lineage was not included as a factor in the mixed model nested ANOVA. However, our RRTs confirmed lineage-specific rate variation in our data, supporting the use of relaxed methods that incorporate such rate heterogeneity. Here again, our precision criterion shows that relaxed methods should be preferred over genic and global methods by having lower cumulative variances across genes.

Previous studies that compared methods across frameworks provide conflicting evidence as to which methods are preferred. Some studies found little or no difference among divergence dates when comparing relaxed clock methods to phylogenetically based global clock methods (e.g., Roger and Hug 2006), whereas others gave mixed results. Depending on nucleotide position and calibration, Yoder and Yang (2000) found local and global methods to be either remarkably similar or fairly divergent for primate speciation dates, whereas Perez-Losada et al. (2004) found local clock methods to be more precise. In spite of these mixed results, our findings verify that modeling the different types of rate heterogeneity has a positive effect on dating analyses. In addition, our use of cumulative variance as a criterion for choosing among methods suggests that the use of more complex models of rate heterogeneity in divergence dating can increase the precision of divergence date estimates from multiple genes.

The Impact of Time on Divergence Dating

Estimating precise dates of divergences at older nodes is more difficult than at younger nodes, as evidenced by comparisons both within and among frameworks. When comparing among frameworks, the framework with the less complex model tended to estimate older dates relative to the more complex model as node age increased, regardless of method (Fig. 5). This may be due to the problem of saturation. The global and genic frameworks used the HKY model of nucleotide substitution, incorporating transition/transversion rate biases, and unequal base frequencies (Hasegawa et al. 1985). Relaxed clock methods incorporated models specific to each data set, including the general time reversible model, which allows differing rates for each type of nucleotide substitution (Lanave et al. 1984). If the model used does not fully correct for multiple hits, branch length error may lead to erroneous date estimates, especially at older nodes where modeling of multiple hits is most influential (Yang 1996, Phillips 2009).

The 95% CI taken across genes for each node/method combination also widened as node age increased (Fig. 4). In addition to the problem of saturation, this may result from a systematic bias of date estimation manifested with increasing node age: regardless of method, older nodes tend to be overestimated because of the lack of bounds on their inference, whereas younger nodes are bounded by the present (Rodriquez-Trelles et al. 2002). This may also result simply from stochastic changes accumulating over time; the older the node, the more disparity in rates (Fig. S2). The trend in widening CIs as node age increases would likely be mitigated through the use of multiple calibration points, especially if scattered throughout the phylogeny (Perez-Losada et al. 2004). However, multiple points of calibration can only be used in the relaxed framework, as both the global and the genic frameworks rely on a rate for calibration instead of dates.

Dealing with Multiple Divergent Node Age Estimates

When estimating divergence dates from multiple genes using any method, one must decide whether to use a concatenated approach or to estimate dates on individual genes. If using concatenation, one can apply a single rate simultaneously over all genes (e.g., Jian et al. 2008), as we did in our PL analyses or apply mixture models in which each gene is allowed its own rate model (e.g., Egan and Crandall 2008), as implemented in our BA analyses. If genes are dealt with individually, dates can be presented as a range of estimates (e.g., Miyamoto and Fitch 1995) or as an average across genes.

The issue of simultaneous versus individual dating is an extension of the controversy in phylogenetics over whether to estimate phylogenies on a concatenated data set or on individual DNA regions followed by consensus tree methods (e.g., Gadagkar et al. 2005, Kubatko and Degnan 2007). Yang and Yoder (2003) advocated combining their two mitochondrial genes for use in relaxed clock dating, suggesting that “combining multiple genes that have different patterns of rate change will introduce internal constraints on the model, allowing extraction of information about the common parameters, i.e., the divergence times.” They did not directly compare their strategy with averaging across genes, however. In contrast, based on simulation studies, Rodriquez-Trelles et al. (2002) suggested that averaging over many genes will converge on more consistent dates. In our study, we explicitly compared results using combined and averaged approaches. All pairwise comparisons of the average age across genes versus the combined age were highly correlated and linear, but not equally so—combined PL overestimated H1/H3 relative to all other methods, including the PL average across genes (Fig. S1), perhaps due to this node being unbounded. We expected the Bayesian analysis to provide more precise estimates of dates within the relaxed framework for the combined analyses because of the use of rates specific to each individual gene. Our results supported this expectation, with BA having a lower cumulative variance across genes than PL. This agrees with the conclusion of Thorne and Kishino (2002), suggesting that divergence dates are best estimated by allowing each gene in a multigene data set to have its own rate, but with genes united under a common topology.

CONCLUSIONS

In this study, we estimated divergence dates of nodes duplicated by polyploidy in soybean and by speciation events in related genera. We compared the outcomes of six different methods of estimating divergence times, two methods each within three frameworks that differ in the types and combinations of rate heterogeneity modeled: the global framework assumed a universal rate, the genic framework allowed different rates across genes, and the relaxed framework allowed differing rates across genes and lineages. We used a mixed model nested ANOVA approach to test the impact of framework, method nested within framework, and gene on the overall variation among divergence date estimates for each node. We suggest that this statistical approach is a useful tool for researchers across many disciplines to explore the impact of modeling different types and combination of rate heterogeneity on their data.

Our results suggest that framework has a statistically significant effect on divergence dating outcomes. Our testing of the impact of multiple methods within each framework showed that, in the majority of cases, methods were not statistically different within framework and did not significantly affect the overall variation among date estimates. This suggests that the decision to model rate heterogeneity across genes and/or lineage is more important than the choice of method within that chosen model. Furthermore, our results determined that gene was the main contributor to the variation among dates, echoing previous findings that genic rates contribute heavily to variation in molecular date estimates (Smith and Eyre-Walker 2003). Ours and other studies (Welch et al. 2008) strongly suggest the need for the incorporation and modeling of factors that may effect substitution rate variation among genes through time whether selective forces or random drift in substitution rates over time and among lineages causes variation.

Even in the face of gene-to-gene rate variation, our results suggest that more precise dates can be obtained with relaxed methods (particularly those that allow gene-specific rates). Therefore, the extra effort in applying these more complex relaxed clock methods will often be worthwhile, especially when dealing with older divergences. In situations where lack of a phylogeny or known calibration points precludes the use of nonclock methods, such as in many comparative genomic studies, a global clock approach is the only option for dating divergences. Our results suggest that dates may be adequately estimated in these circumstances at younger nodes but that accuracy and precision fall off with increasing node age due to the accumulation of rate changes over time. This problem also occurs in relaxed clock methods but does not appear to be as severe. Previously estimated divergence dates are useful in calibrating rates among genes and will strengthen the estimation of divergence dates in studies lacking a phylogeny, especially at older nodes. The problem comes when trying to test whether or not the data fit a molecular clock because lineage rate heterogeneity and heterotachy cannot be assessed without a phylogeny. Fortunately, as more genomes are sequenced, the lack of phylogenetic information will be less of a problem for comparative genomic studies, allowing a wider application of relaxed clock methods (Jackson et al. 2006).

SUPPLEMENTARY MATERIAL

Supplementary material can be found at http://www.sysbio.oxfordjournals.org/.

FUNDING

This work was supported by the National Science Foundation (DBI-0321664 and DEB-0516673 to J.J.D).

We especially thank Roger Innes for the use of DNA sequence data and M.A. Saghai Maroof, Bruce Roe, Nevin Young, and Valérie Geffroy for support. We thank Sue Sherman-Broyles, Keith A. Crandall, M.A. Saghai Maroof, Milind B. Ratnaparkhe, Bernard E. Pfeil, and three anonymous reviewers for helpful comments on the manuscript. We also thank Dr. James Booth, Françoise Vermeylen, Xin Zhao, and Shamil Sadigov from the Cornell Statistical Consulting Unit for help with statistical analyses. The Computational Biology Service Unit at Cornell University provided computer support in conjunction with the Microsoft Corporation.

References

Akashi
H
Gene expression and molecular evolution
Curr. Opin. Genet. Dev.
 , 
2001
, vol. 
11
 (pg. 
660
-
666
)
Altschul
SF
Madden
TL
Schaffer
AA
Zhang
J
Zhang
Z
Miller
W
Lipman
DJ
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res.
 , 
1997
, vol. 
25
 (pg. 
3389
-
3402
)
Andreasen
K
Baldwin
BG
Unequal evolutionary rates between annual and perennial lineages of Checker Mallows (Sidalcea, Malvaceae): evidence from 18S-26S rDNA internal and external transcribed spacers
Mol. Biol. Evol.
 , 
2001
, vol. 
18
 (pg. 
936
-
944
)
Ayala
FJ
Molecular clock mirages
Bioessays
 , 
1999
, vol. 
21
 (pg. 
71
-
75
)
Blanc
G
Wolfe
KH
Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes
Plant Cell
 , 
2004
, vol. 
16
 (pg. 
1667
-
1678
)
Cui
LY
Wall
PK
Leebens-Mack
JH
Lindsay
BG
Soltis
DE
Doyle
JJ
Soltis
PS
Carlson
JE
Arumuganathan
K
Barakat
A
Albert
VA
Ma
H
dePamphilis
CW
Widespread genome duplications throughout the history of flowering plants
Genome Res.
 , 
2006
, vol. 
16
 (pg. 
738
-
749
)
Drummond
AJ
Ho
SYW
Phillips
MJ
Rambaut
A
Relaxed phylogenetics and dating with confidence
PLoS Biol.
 , 
2006
, vol. 
4
 pg. 
e88
 
Drummond
AJ
Rambaut
A
BEAST: Bayesian evolutionary analysis by sampling trees
BMC Evol. Biol.
 , 
2007
, vol. 
7
 pg. 
214
 
Edgar
RC
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res.
 , 
2004
, vol. 
32
 (pg. 
1792
-
1797
)
Egan
AN
Crandall
KA
Divergence and diversification in North American Psoraleeae (Fabaceae) due to climate change
BMC Biol.
 , 
2008
, vol. 
6
 pg. 
55
 
Eyre-Walker
A
Gaut
BS
Correlated rates of synonymous site evolution across plant genomes
Mol. Biol. Evol.
 , 
1997
, vol. 
14
 (pg. 
455
-
460
)
Fawcett
JA
Maere
S
Van de Peer
Y
Plants with double genomes might have had a better chance to survive the Cretaceous-Tertiary extinction event
Proc. Natl. Acad. Sci. U.S.A.
 , 
2009
, vol. 
106
 (pg. 
5737
-
5742
)
Felsenstein
J
Phylogenies from molecular sequences: inference and reliability
Annu. Rev. Genet.
 , 
1988
, vol. 
22
 (pg. 
521
-
565
)
Felsenstein
J
PHYLIP (Phylogeny Inference Package). Version 3.67. Distributed by the author
 , 
2004
Seattle (WA)
Department of Genome Sciences, University of Washington
Fitch
WM
Markowitz
E
An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution
Biochem. Genet.
 , 
1970
, vol. 
4
 (pg. 
579
-
593
)
Gadagkar
SR
Rosenberg
MS
Kumar
S
Inferring species phylogenies from multiple genes: Concatenated sequence tree versus consensus gene tree
J. Exp. Zool. (Mol. Dev. Evol.)
 , 
2005
, vol. 
304B
 (pg. 
64
-
74
)
Gilks
WR
Richardson
S
Spiegelhalter
DJ
Markov Chain Monte Carlo in practice
 , 
1996
London
Chapman & Hall/CRC
Goldblatt
P
Polhill
RM
Raven
PH
Cytology and the phylogeny of Leguminosae
Advances in legume systematics Part 2
 , 
1981
Kew
Royal Botanic Gardens
(pg. 
427
-
463
)
Goldman
N
Yang
Z
A codon-based model of nucleotide substitution for protein-coding DNA sequences
Mol. Biol. Evol.
 , 
1994
, vol. 
11
 (pg. 
725
-
736
)
Graur
D
Li
WH
Fundamentals of molecular evolution
 , 
2000
2nd ed
Sunderland (MA)
Sinauer Associates
Guindon
S
Gascuel
O
A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood
Syst. Biol.
 , 
2003
, vol. 
52
 (pg. 
696
-
704
)
Guindon
S
Lethiec
F
Duroux
P
Gascuel
O
PHYML Online— a web server for fast maximum likelihood-based phylogenetic inference
Nucleic Acids Res.
 , 
2005
, vol. 
33
 (pg. 
W557
-
W559
)
Hamilton
MB
Braverman
JM
Soria-Hernandez
DF
Patterns and relative rates of nucleotide and insertion/deletion evolution at six chloroplast intergenic regions in new world species of the Lecythidaceae
Mol. Biol. Evol.
 , 
2003
, vol. 
20
 (pg. 
1710
-
1721
)
Hasegawa
M
Kishino
H
Yano
T
Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.J. Mol
Evol.
 , 
1985
, vol. 
22
 (pg. 
160
-
174
)
Hendy
MD
Penny
D
Branch and bound algorithms to determine minimal evolutionary trees
Math. Biosci
 , 
1982
, vol. 
59
 (pg. 
277
-
290
)
Hyten
DL
Choi
IY
Song
QJ
Shoemaker
RC
Nelson
RL
Costa
JM
Specht
JE
Cregan
PB
Highly variable patterns of linkage disequilibrium in multiple soybean populations
Genetics
 , 
2007
, vol. 
175
 (pg. 
1937
-
1944
)
Innes
RW
Ameline-Torregrosa
C
Ashfield
T
Cannon
E
Cannon
SB
Chacko
B
Chen
NWG
Couloux
A
Dalwani
A
Denny
R
Deshpande
S
Doyle
JJ
Egan
AN
Geffroy
V
Glover
N
Hans
CS
Howell
S
Ilut
D
Jackson
S
Lai
H
Mammodov
J
Martin del Campo
S
Metcalf
M
Nguyen
A
O-Bleness
M
Pfeil
B
Podicheti
R
Ratnaparkhe
MB
Roe
BA
Saghai Maroof
MA
Samain
S
Sanders
I
Ségurens
B
Sévignac
M
Sherman-Broyles
S
Thareau
V
Tucker
DM
Walling
J
Wawrzynski
A
Yi
J
Young
ND
Differential accumulation of retroelements and diversification of NB-LRR disease resistance genes in duplicated regions following polyploidy in the ancestor of soybean
Plant Physiol
 , 
2008
, vol. 
148
 (pg. 
1740
-
1759
)
Jackson
SA
Rounsley
S
Purugganan
MD
Comparative sequencing of plant genomes: choices to make
Plant Cell
 , 
2006
, vol. 
18
 (pg. 
1100
-
1104
)
Jeffreys
H
An invariant form for the prior probability in estimation problems
Proc. Biol. Sci.
 , 
1946
, vol. 
186
 (pg. 
453
-
461
)
Jian
SG
Soltis
PS
Gitzendanner
MA
Moore
MJ
Li
R
Hendry
TA
Qiu
YL
Dhingra
A
Bell
CD
Soltis
DE
Resolving an ancient, rapid radiation in Saxifragales
Syst. Biol.
 , 
2008
, vol. 
57
 (pg. 
38
-
57
)
Jukes
TH
Cantor
CR
Munro
MN
Evolution of protein molecules
Mammalian protein metabolism
 , 
1969
New York
Academic Press
(pg. 
21
-
132
)
Kay
KM
Whittall
JB
Hodges
SA
A survey of nuclear ribosomal internal transcribed spacer substitution rates across angiosperms: an approximate molecular clock with life history effects
BMC Evol. Biol.
 , 
2006
, vol. 
6
 pg. 
36
 
Kimura
M
Estimation of evolutionary distances between homologous nucleotide sequences
Proc. Natl. Acad. Sci. USA.
 , 
1980
, vol. 
78
 (pg. 
454
-
458
)
Kubatko
LS
Degnan
JH
Inconsistency of phylogenetic estimates from concatenated data under coalescence
Syst. Biol.
 , 
2007
, vol. 
56
 (pg. 
17
-
24
)
Lanave
C
Preparata
G
Saccone
C
Serio
G
A new method for calculating evolutionary substitution rates
J. Mol. Evol.
 , 
1984
, vol. 
20
 (pg. 
86
-
93
)
Lavin
M
Herendeen
PS
Wojciechowski
MF
Evolutionary rates analysis of Leguminosae implicates a rapid diversification of lineages during the Tertiary
Syst. Biol.
 , 
2005
, vol. 
54
 (pg. 
575
-
594
)
Lee
JM
Grant
D
Vallejos
CE
Shoemaker
RC
Genome organization in dicots. II. Arabidopsis as a `bridging species' to resolve genome evolution events among legumes
Theor. Appl. Genet.
 , 
2001
, vol. 
103
 (pg. 
765
-
773
)
Lopez
P
Casane
D
Philippe
H
Heterotachy, an important process of protein evolution
Mol. Biol. Evol.
 , 
2002
, vol. 
19
 (pg. 
1
-
7
)
Lynch
M
Conery
JS
The evolutionary fate and consequences of duplicate genes
Science
 , 
2000
, vol. 
290
 (pg. 
1151
-
1155
)
Matassi
G
Sharp
PM
Gautier
C
Chromosomal location effects on gene sequence evolution in mammals
Curr. Biol.
 , 
1999
, vol. 
9
 (pg. 
786
-
791
)
Miyamoto
MM
Fitch
WM
Testing species phylogenies and phylogenetic methods with congruence
Syst. Biol.
 , 
1995
, vol. 
44
 (pg. 
64
-
76
)
Perez-Losada
M
Hoeg
JT
Crandall
KA
Unraveling the evolutionary radiation of the Thoracican barnacles using molecular and morphological evidence: a comparison of several divergence time estimation approaches
Syst. Biol.
 , 
2004
, vol. 
53
 (pg. 
244
-
264
)
Pfeil
BE
Schlueter
JA
Shoemaker
RC
Doyle
JJ
Placing paleopolyploidy in relation to taxon divergence: a phylogenetic analysis in legumes using 39 gene families
Syst. Biol.
 , 
2005
, vol. 
54
 (pg. 
441
-
454
)
Philippe
H
Lopez
P
On the conservation of protein sequences in evolution
Trends Biochem. Sci.
 , 
2001
, vol. 
26
 (pg. 
414
-
416
)
Phillips
MJ
Branch-length estimation bias misleads molecular dating for a vertebrate mitochondrial phylogeny
Gene
 , 
2009
, vol. 
441
 (pg. 
132
-
140
)
Posada
D
Crandall
KA
Modeltest: testing the model of DNA substitution
Bioinformatics
 , 
1998
, vol. 
14
 (pg. 
817
-
818
)
Rambaut
A
Se-Al: sequence alignment editor. Version 2.0a11
2002
 
Rambaut
A
Drummond
A
2007
 
Tracer v1.4. Available from: http://beast.bio.ed.ac.uk/Tracer
Rodriquez-Trelles
F
Tarrio
R
Ayala
FJ
A methodological bias toward overestimation of molecular evolutionary time scales
Proc. Natl. Acad. Sci. USA.
 , 
2002
, vol. 
99
 (pg. 
8112
-
8115
)
Roger
AJ
Hug
LA
The origin and diversification of eukaryotes: problems with molecular phylogenetics and molecular clock estimation. Philos. Trans. R. Soc. Lond
B Biol. Sci.
 , 
2006
, vol. 
361
 (pg. 
1039
-
1054
)
Sanderson
MJ
Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach
Mol. Biol. Evol.
 , 
2002
, vol. 
19
 (pg. 
101
-
109
)
Sanderson
MJ
r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock
Bioinformatics
 , 
2003
, vol. 
19
 (pg. 
301
-
302
)
Schlueter
JA
Dixon
P
Granger
C
Grant
D
Clark
L
Doyle
JJ
Shoemaker
RC
Mining EST databases to resolve evolutionary events in major crop species
Genome
 , 
2004
, vol. 
47
 (pg. 
868
-
876
)
Schlueter
JA
Lin
JY
Schlueter
SD
Vasylenko-Sanders
IF
Deshpande
S
Yi
J
O'Bleness
M
Roe
BA
Nelson
RT
Scheffler
BE
Jackson
SA
Shoemaker
RC
Gene duplication and paleopolyploidy in soybean and the implications for whole genome sequencing
BMC Genomics
 , 
2007
, vol. 
8
 pg. 
330
 
Schlueter
JA
Scheffler
BE
Schlueter
SD
Shoemaker
RC
Sequence conservation of homeologous bacterial artificial chromosomes and transcription of homeologous genes in soybean (Glycine max L. Merr.)
Genetics
 , 
2006
, vol. 
174
 (pg. 
1017
-
1028
)
Schlueter
JA
Vasylenko-Sanders
IF
Deshpande
S
Yi
J
Siegfried
M
Roe
BA
Schlueter
SD
Scheffler
BE
Shoemaker
RC
The FADs gene family of Soybean: insights into the structural and functional divergence of a paleopolyploid genome
Crop Sci.
 , 
2007
, vol. 
47
 (pg. 
S14
-
S26
)
Sharp
PM
Li
WH
The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias
Mol. Biol. Evol.
 , 
1987
, vol. 
4
 (pg. 
222
-
230
)
Shoemaker
RC
Polzin
K
Labate
J
Specht
J
Brummer
EC
Olson
T
Young
N
Concibido
V
Wilcox
J
Tamulonis
JP
Kochert
G
Boerma
HR
Genome duplication in soybean (Glycine subgenus Soja)
Genetics
 , 
1996
, vol. 
144
 (pg. 
329
-
338
)
Shoemaker
RC
Schlueter
JA
Doyle
JJ
Paleopolyploidy and gene duplication in soybean and other legumes
Curr. Opin. Plant Biol.
 , 
2006
, vol. 
9
 (pg. 
104
-
109
)
Small
RL
Wendel
JF
Differential evolutionary dynamics of duplicated Paralogous Adh loci in Allotetraploid Cotton (Gossypium)
Mol. Biol. Evol.
 , 
2002
, vol. 
19
 (pg. 
597
-
607
)
Smith
NGC
Eyre-Walker
A
Partitioning the variation in mammalian substitution rates
Mol. Biol. Evol.
 , 
2003
, vol. 
20
 (pg. 
10
-
17
)
Smith
SA
Donoghue
MJ
Rates of molecular evolution are linked to life history in flowering plants
Science
 , 
2008
, vol. 
322
 (pg. 
86
-
89
)
Stefanovic
S
Pfeil
BE
Palmer
JD
Doyle
JJ
Relationships among Phaseoloid legumes based on sequences from eight chloroplast regions
Syst. Bot
 , 
2009
, vol. 
34
 (pg. 
115
-
128
)
Swofford
DL
PAUP*. Phylogenetic analysis using parsimony (* and other methods). Version 4.0b10
 , 
2002
Sunderland (MA)
Sinauer Associates
Tavare
S
Some probabilistic and statistical problems in the analysis of DNA sequences
Lectures on Mathematics in the Life Science
 , 
1986
, vol. 
17
 (pg. 
57
-
86
)
Thorne
JL
Kishino
H
Divergence time and evolutionary rate estimation with multilocus data
Syst. Biol.
 , 
2002
, vol. 
51
 (pg. 
689
-
702
)
Thorne
JL
Kishino
H
Nielsen
R
Estimation of divergence times from molecular sequence data
Statistical methods in molecular evolution. Springer
 , 
2005
(pg. 
235
-
256
)
Thorne
JL
Kishino
H
Painter
IS
Estimating the rate of evolution of the rate of molecular evolution
Mol. Biol. Evol.
 , 
1998
, vol. 
15
 (pg. 
1647
-
1657
)
Ticher
A
Graur
D
Nucleic acid composition, codon usage, and the rate of synonymous substitution in protetin-coding genes
J. Mol. Evol.
 , 
1989
, vol. 
28
 (pg. 
286
-
298
)
Van
K
Kim
DH
Cai
CM
Kim
MY
Shin
JH
Graham
MA
Shoemaker
RC
Choi
B-S
Yang
T-J
Lee
S-H
Sequence level analysis of recently duplicated regions in Soybean [ Glycine max (L.) Merr.] Genome
DNA Res.
 , 
2008
, vol. 
15
 (pg. 
93
-
102
)
Welch
JJ
Bininda-Emonds
ORP
Bromham
L
Correlates of substitution rate variation in mammalian protein-coding sequences
BMC Evol. Biol.
 , 
2008
, vol. 
8
 pg. 
53
 
Yang
Z
Maximum likelihood models for combined analyses of multiple sequence data
J. Mol. Evol.
 , 
1996
, vol. 
42
 (pg. 
587
-
596
)
Yang
Z
PAML: a program package for phylogenetic analysis by maximum likelihood
Bioinformatics
 , 
1997
, vol. 
13
 (pg. 
555
-
556
)
Yang
Z
Nielsen
R
Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models
Mol. Biol. Evol.
 , 
2000
, vol. 
17
 (pg. 
32
-
43
)
Yang
Z
Yoder
AD
Comparison of likelihood and Bayesian methods for estimating divergence times using multiple gene loci and calibration points, with application to a radiation of cute-looking mouse lemur species
Syst. Biol.
 , 
2003
, vol. 
52
 (pg. 
705
-
716
)
Yoder
AD
Yang
Z
Estimation of primate speciation dates using local molecular clocks
Mol. Biol. Evol.
 , 
2000
, vol. 
17
 (pg. 
1081
-
1090
)
Zhang
L
Vision
TJ
Gaut
BS
Patterns of nucleotide substitution among simultaneously duplicated gene pairs in Arabidopsis thaliana
Mol. Biol. Evol.
 , 
2002
, vol. 
19
 (pg. 
1464
-
1473
)

Author notes

Associate Editor: Vincent Savolainen

Supplementary data