In recent years, a number of phylogenetic methods have been developed for estimating molecular rates and divergence dates under models that relax the molecular clock constraint by allowing rate change throughout the tree. These methods are being used with increasing frequency, but there have been few studies into their accuracy. We tested the accuracy of several relaxed-clock methods (penalized likelihood and Bayesian inference using various models of rate change) using nucleotide sequences simulated on a nine-taxon tree. When the sequences evolved with a constant rate, the methods were able to infer rates accurately, but estimates were more precise when a molecular clock was assumed. When the sequences evolved under a model of autocorrelated rate change, rates were accurately estimated using penalized likelihood and by Bayesian inference using lognormal and exponential models of rate change, while other models did not perform as well. When the sequences evolved under a model of uncorrelated rate change, only Bayesian inference using an exponential rate model performed well. Collectively, the results provide a strong recommendation for using the exponential model of rate change if a conservative approach to divergence time estimation is required. A case study is presented in which we use a simulation-based approach to examine the hypothesis of elevated rates in the Cambrian period, and it is found that these high rate estimates might be an artifact of the rate estimation method. If this bias is present, then the ages of metazoan divergences would be systematically underestimated. The results of this study have implications for studies of molecular rates and divergence dates.
The rate of molecular evolution can vary considerably among different organisms, challenging the concept of the “molecular clock.” If the assumption of rate constancy does not hold across a tree, then phylogenetic inference and divergence date estimation can be seriously misled unless this rate heterogeneity among lineages is adequately taken into account (Yoder and Yang 2000). In recent years, a number of methods have been developed to relax the clock assumption to varying degrees (e.g., Takezaki, Rzhetsky, and Nei 1995; Cooper and Penny 1997; Sanderson 1997, 2002; Rambaut and Bromham 1998; Thorne, Kishino, and Painter 1998; Cutler 2000; Huelsenbeck, Larget, and Swofford 2000; Aris-Brosou and Yang 2002; Thorne and Kishino 2002). These methods have been applied to a range of evolutionary questions, and their use is becoming more widespread with increasing awareness of the effects of ignoring violation of the molecular clock assumption.
One method of addressing the problem of non-clocklike data is to dispose of the molecular clock assumption altogether and to assign a unique rate parameter to each branch in the tree, thereby allowing rates to vary in an unrestricted manner. This method has been implemented for the purpose of inferring phylogenetic trees (e.g., Yang 1997; Huelsenbeck and Ronquist 2001).
Unfortunately, such an approach cannot be taken when estimating absolute rates or divergence dates because rate and time are inextricably linked, unless the sequences are noncontemporaneous and have known ages (Rambaut 2000; Drummond et al. 2002). Instead, we can implement a method that allows the rate to vary across the tree in a relatively constrained manner. For example, some methods employ local molecular clock models, which permit a different rate in each region of the tree. These regions can be specified by the user (Yang 1997; Yoder and Yang 2000), assigned objectively by the algorithm (Foster 2004), or determined on the basis of phylogenetic relationships (Cooper and Penny 1997; Bromham et al. 1998). However, implementing local clock models for large trees can be difficult because of the number of possible rate configurations (Sanderson 1998).
An alternative to using local clock models is to allow the molecular rate to vary throughout the tree, but with closely related species sharing similar rates. This follows the idea that molecular rates are heritable because they are tied to physiological, biochemical, and life-history characteristics of the species in question. Adopting this philosophy, Sanderson (1997, 2002) proposed a nonparametric rate smoothing method and a semiparametric penalized likelihood approach, both of which operate by minimizing rate changes throughout the tree.
In a Bayesian framework, a rate can be assigned to each branch of a tree and a prior can be imposed that allows the rate to “evolve” throughout the tree in an autocorrelated manner. Thorne, Kishino, and Painter (1998) and Aris-Brosou and Yang (2002) implemented parametric models in which the rate in a given branch (the descendant rate) is a priori related to the rate in the branch immediately ancestral to it (the parent rate). The model they adopt for the prior is one in which the descendant rate is drawn from a chosen distribution, such as a lognormal, exponential, or gamma distribution, whose mean is the parent rate (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001; Aris-Brosou and Yang 2002). Aris-Brosou and Yang (2002) also employed a model in which the rate was assumed to evolve according to the Ornstein-Uhlenbeck process (OUP), a stationary Gaussian process that includes a rate-slowing friction term. The variance of each of these distributions corresponds to the degree of departure from the molecular clock. The hyperparameters (the parameters that govern the prior distributions) can be optimized by comparing the posterior likelihoods of each rate prior given the sequence data.
It is important to evaluate the performance of the different relaxed-clock methods that are available. In spite of their increasingly widespread application, there have been few studies into the accuracy of relaxed-clock models (for exceptions, see Kishino, Thorne, and Bruno 2001; Yang and Yoder 2003; Perez-Losada, Hoeg, and Crandall 2004). In this paper, we assess the accuracy of rate estimates produced by rate smoothing (penalized likelihood) and by Bayesian inference using various models of autocorrelated rate change. We tested the methods using sequence data generated by Monte Carlo simulation on a known tree, for which the true rate of substitution in each branch was known. The simulation study is followed by a critical simulation-based investigation of molecular rates in the Cambrian period.
Materials and Methods
The accuracy of methods for rate estimation can only be tested when the true phylogeny (including the ages of internal nodes) relating the sequences is known, which is rarely the case for real data. Instead, sequence evolution can be simulated on a known tree, and substitution rates can be recorded during the simulation process for comparison with rate estimates subsequently made from the data. We generated alignments of nine nucleotide sequences, each 1,000 nt in length, on the rooted tree in figure 1. The out-group sequence is only used to root the tree. Simulations were performed using the Java program RateEvolver v1.0 (available upon request from S.Y.W.H.), which can generate sequences under models of nucleotide substitution ranging from a Jukes-Cantor model (Jukes and Cantor 1969) to the HKY85 model (Hasegawa, Kishino, and Yano 1985). The program can also implement various processes of autocorrelated rate change, including rates following lognormal, exponential, and gamma distributions, as well as independently sampled rates, where rates are drawn from a uniform distribution (fig. 2). Under these rate change models, with the exception of the independently sampled rates, each new rate is sampled from the chosen distribution with the parent rate as the mean. The user can also set the probability of a new rate being drawn at the beginning of each branch, which is set to 1.0 for the models studied in this paper.
Whenever it was allowed by the computer program, rate estimates were calibrated by constraining two internal nodes to the range 19–21 time units, which is an uncertainty range of 5% on either side of their actual age of 20 time units (fig. 1). In analyses using PhyBayes, calibration uncertainty could not be specified, so point values of 20 time units were used instead. We chose to calibrate rate estimates using internal nodes rather than the root node because in most applications of the relaxed clock, the ages can be ascribed to internal nodes with greater confidence than to the root node.
Case 1—Constant Rates Among Lineages
Ten sequence alignments were generated on the tree in figure 1 under a molecular clock model, in which the rate was held constant at 0.01 average substitutions per site per time unit throughout the tree. These data sets were analyzed using several methods: maximum likelihood, implemented by the program baseml (Yang 1997); penalized likelihood, implemented by r8s (Sanderson 2003); and Bayesian inference under various relaxed-clock models, implemented by multidivtime (Thorne, Kishino, and Painter 1998) and PhyBayes (Aris-Brosou and Yang 2002, 2003). In the Bayesian analyses, posterior distributions were approximated by sampling values at every 100 steps over 1,000,000 Markov chain Monte Carlo (MCMC) steps, after discarding a burn-in of 100,000 steps. Preliminary investigations showed that this burn-in was sufficient for the chain to reach the stationary distribution, from which subsequent samples were drawn. This was confirmed by a posteriori inspection of the MCMC samples from each analysis using the program Tracer (Rambaut and Drummond 2004), which also showed that the set of samples for each of the estimated parameters had an effective sample size of over 100. The effective sample size is influenced not only by the number of samples that are drawn from the MCMC but also by the degree of autocorrelation among the samples.
In all analyses using penalized likelihood, the (fixed) input tree was inferred using maximum likelihood, as implemented by the program baseml (Yang 1997), without the assumption of a molecular clock. In the process of tree inference, rate homogeneity among sites was assumed, matching the simulation conditions, but the transition/transversion ratio was estimated. In the estimation of rates using penalized likelihood, the smoothing parameter was optimized for each data set using the cross-validation procedure available in the program by comparing sums of squared errors for a wide range of smoothing values of 10x (x = 0.0, 0.2, …, 4.0). Each data set was also analyzed using a smoothing parameter of 1.0, but the resulting rate estimates were extremely poor and are not presented here.
In the first set of analyses, a molecular clock constraint was enforced in each of three methods used to estimate rates from the data (maximum likelihood, the Langley-Fitch method used in r8s, and Bayesian inference). The data were then analyzed a second time, without a molecular clock constraint, using penalized likelihood and relaxed-clock Bayesian inference. In the Bayesian analyses, four different inference models of rate change were tested: lognormally distributed (LND) rates (multidivtime and PhyBayes), gamma-distributed (GD) rates (PhyBayes), exponentially distributed (ED) rates (PhyBayes), and OUP (PhyBayes). The hyperparameters of the lognormal distribution were assumed to be 0.01 (mean) and 0.025 (variance parameter). The drift parameter for the OUP prior was optimized by comparing the posterior likelihoods when it was set at each of four different values (10−2, 10−1, 100, and 101).
For all nonclock analyses performed with PhyBayes, the default values were used for the hyperparameters of the birth-death process prior on divergence times: the birth (speciation) rate, death (extinction) rate, and sampling fraction were drawn from uniform distributions over [0,15], [0,5], and [0,0.001], respectively. The ages of internal nodes are specified relative to the age of the root node, which is arbitrarily set at 1.0. This prior is described in detail by Aris-Brosou and Yang (2002). In all nonclock analyses using multidivtime, the default value of 1.0 was used for the hyperparameter minab, which governs the prior distribution of divergence times. The distribution of internal node ages is conditional on the age of the root node, which is in turn described by its own gamma distribution prior. The generalized Dirichlet prior on divergence times used in multidivtime is described in detail by Kishino, Thorne, and Bruno (2001). In contrast to the two Bayesian relaxed-clock programs, penalized likelihood does not require the specification of any priors on divergence times or rates.
Case 2—Lognormally Autocorrelated Rates Among Lineages
Ten sequence alignments were generated on the tree in figure 1 under a model in which each descendant rate was autocorrelated with its parent rate (see fig. 2a). The log of each descendant rate, ri, was drawn from a normal distribution with a mean equal to the log of the parent rate,
Rates were estimated from the simulated sequences using the same methods as those described in Case 1, but without a molecular clock constraint.
Case 3—Independent Rates Among Lineages
Ten sequence alignments were generated on the tree in figure 1 under a model in which a descendant rate was independent of its parent rate (see fig. 2b). Each rate was drawn from a uniform distribution, ri ∼ U(0.005,0.015).
Rates were estimated from the simulated sequences using the same methods as those described in Case 1, but without a molecular clock constraint.
Three of the four methods performed well when the sequence data evolved according to a molecular clock constraint and were analyzed under a clock assumption (fig. 3a). The Langley-Fitch method implemented in r8s overestimated the rate by about 5% on average, which suggests that there might be an intrinsic bias (in either the method itself or its current implementation). In practice, it is probably better to use penalized likelihood with a high value for the smoothing parameter to analyze clocklike data.
Rates estimated by the other methods were close to the true rate of 0.01 average substitutions per site per time unit, with the observed variation probably resulting from the stochastic nature of Poisson-distributed nucleotide substitutions. There were small discrepancies in the spread of the estimates made using different programs, which are likely to be due to differences in priors, algorithms, and software architecture and because of stochastic variation in the MCMC realizations. Most importantly, multidivtime and PhyBayes employ different priors for divergence times, while r8s does not incorporate prior information on divergence times (unless the user specifies constraints on node ages). The multidivtime program uses the Dirichlet distribution for internal nodes and a gamma distribution for the age of the root, whereas PhyBayes uses a generalized birth-death process with species sampling (Yang and Rannala 1997). In view of these differences, it is not surprising that the programs produce different rate estimates, even in the most basic comparisons.
When the clock assumption was relaxed, penalized likelihood tended to underestimate rates. Penalized likelihood will tend to underestimate the smoothing parameter when there are random fluctuations in the realized substitution rate, which would be more pronounced when sequences are short and substitution rates are low. An underestimation bias was also seen, to a much lesser extent, in the rate estimates produced under an LND model of rate change (multidivtime and PhyBayes) (fig. 3b). Under other models of rate change (GD, ED, and OUP), the rate of substitution was often overestimated (fig. 3c). In all cases, the spread of rate estimates was much greater than under a molecular clock assumption. This suggests that if sequence data have evolved under clocklike (or approximately clocklike) conditions, it may be better to assume a molecular clock during phylogenetic analysis of the sequences to improve the precision, and perhaps accuracy, of substitution rate estimates (and, consequently, of inferred divergence dates). The particular conditions leading to a balance between obtaining better models of the data and the loss of precision will vary among different data sets, and so it is important to make an informed choice in this respect.
There was a large discrepancy between the levels of estimated uncertainty associated with rate estimates made by multidivtime and PhyBayes using a lognormal model of rate change. The narrower spread of rate estimates made by multidivtime is probably because the program requires the user to specify a prior for the tree height. Another difference between the two programs, which could affect the levels of estimated uncertainty, is that PhyBayes calculates the full likelihood, whereas multidivtime uses a multivariate normal approximation based on maximum likelihood estimates of branch lengths in the tree.
The average (across data sets) of the sampled marginal probabilities of the data given the model was the highest when the OUP prior was used (table 1), which might simply reflect the fact that the use of this prior involves two hyperparameters. However, the OUP model still yields the highest marginal posterior probability even when each method is penalized 1 log unit for each hyperparameter. Nevertheless, there does not appear to be a clear relationship between the marginal posterior probability and the accuracy of the rate estimates. Unfortunately, a formal comparison using Bayes factors is difficult to perform without reversible-jump MCMC.
|Model of Rate Change||Case 1||Case 2||Case 3|
|Model of Rate Change||Case 1||Case 2||Case 3|
Marginal posterior probability of the data given the model. Numbers in bold denote absolute values, with other numbers giving values relative to these.
Rates estimated from data that had evolved under a process of LND rate change were recovered with reasonable accuracy when an LND model of rate change was assumed (fig. 4). This is not surprising because the model used for generating the data matched that used to analyze the data. Use of the exponential model also yielded rate estimates that were quite accurate. However, rate estimates produced by penalized likelihood were inaccurate when the actual rate was high, due to the effect of the smoothing parameter. Even if rate estimates were calibrated by placing constraints on the age of the root node, rather than on other internal nodes, it is unlikely that the rate estimates would show considerable change because they are already close to the average simulation rate.
Rates were frequently overestimated under the GD model of rate change, with some rates being overestimated by a factor of three, while the performance of the OUP model was slightly better. The rate estimates produced under an ED model of rate change were good, even though the standard deviation is equal to the mean (by definition) and cannot be set to the value used in generating the data. This result is unusual because the true rates are homoscedastic because the variance of the rates depends on the temporal branch lengths and the variance parameter, both of which were held at constant values for all simulations. An additional point of note is that the estimated uncertainty under the ED rate model was smaller than that under either the GD or OUP models but comparable to that under the LND model.
In all the analyses performed using the program PhyBayes, rates in internal branches tended to be overestimated and rates in terminal branches tended to be underestimated. Estimates of terminal rates are expected to be more accurate because they are constrained by the calibration points at the internal nodes. Conversely, the age of the root node and rates in internal branches are relatively unconstrained, resulting in rate estimates that are less accurate and less precise. Estimates of rates deep in the tree are probably influenced by the birth-death process prior on the divergence times; a small sampling fraction will lead to short internal branches, which inflates the estimates of rates in these branches (Yang and Rannala 1997). Aris-Brosou and Yang (2002) found that increasing the upper boundary on the prior sampling interval for the sampling fraction to 0.5 caused divergence date estimates to increase by 36%, with a doubling of the associated credibility interval. The large difference between the prior range of the birth rate and the prior range of the extinction rate is also liable to produce trees with short internal branches.
When sequences evolved under a model where the rate in each branch was independently drawn from a uniform distribution U(0.005,0.015), few methods performed well (fig. 5). In particular, penalized likelihood produced rate estimates of around 0.01 regardless of the true rate in the branch, such that it overestimated lower rates and underestimated higher rates. Penalized likelihood was not expected to accurately recover the true rates because there was a considerable violation of its fundamental assumption of minimal rate changes between branches. Use of the GD model also yielded poor rate estimates, with the majority of rates being considerably overestimated, sometimes by a factor of four. Furthermore, many rates were substantially overestimated under the OUP model of rate change, despite the hyperparameters being optimized by comparison of marginal posterior probabilities. This may be a consequence of the constraints on the friction term, a possibility requiring further investigation.
In this case, as with the previous two cases, the OUP model of rate change yielded the highest marginal posterior probabilities of the data given the model (table 1), even when penalties were imposed for the number of hyperparameters involved in each method.
The rate estimates made under an ED model of rate change appeared to be the most accurate of all the methods that were tested. The overall performance of the ED model, even in the analysis of data generated under independent rates, suggests that it should be used for estimating rates and divergence dates from sequences that show considerable departure from clocklike evolution. An additional benefit of using the ED model of rate change is that the variance of the prior distribution does not need to be specified, thus minimizing the number of hyperparameters in the analysis. If a conservative approach to divergence time estimation is desired, then the ED model seems to be a good choice.
Case Study—Elevated Rates in the Cambrian
Studies of rates in branches of the metazoan tree have revealed that molecular rates may have been elevated during the Cambrian period, which would partly or wholly explain the discrepancy between molecular and paleontologic estimates of metazoan divergence dates (Bromham and Hendy 2000; Aris-Brosou and Yang 2002, 2003). It has been suggested that rates of molecular change in the Cambrian may have been at least twice as high as in the periods preceding and following it. However, the consistent overestimation of rates seen in the cases above suggests that the high rate estimates may be a consequence of the method used. In order to examine this possibility, we reanalyzed the 18S metazoan data set used by Aris-Brosou and Yang (2002, 2003) using a simulation-based approach.
Ten sequence alignments were generated using the program Seq-Gen 1.3.1 (Rambaut and Grassly 1997), using model parameters estimated from the 18S alignment: a transition/transversion ratio of 1.73, gamma shape parameter of 0.373, and empirically derived base frequencies (A: 0.264, C: 0.216, G: 0.275, T: 0.245). Ungapped alignments of 40 sequences, each with a length of 1,710 nt, were generated on the metazoan tree used by Aris-Brosou and Yang (2003), using the relative divergence times that they estimated under an ED model of rate change.
The original data set, along with the 10 simulated data sets, was analyzed using PhyBayes with an ED model of rate change, using the same substitution model parameter values as those used to generate the data. The resulting rate estimates followed a pattern similar to that seen in the study by Aris-Brosou and Yang (2003) (fig. 6a), with elevated rates in certain branches, despite the sequence data having been generated under a molecular clock constraint (fig. 6b). Aris-Brosou and Yang (2003) found that the average relative rate in the period from 640 to 420 MYA was 1.37 per site per MYA, compared with 0.62 in the period from 420 MYA to the present. The values obtained from data simulated under a molecular clock, for these time intervals, were 1.24 and 0.87, respectively. This implies that the evidence for elevated rates in the Cambrian was at least partly, if not entirely, an artifact of the estimation method. Rate estimates from nodes more recent than the Cambrian also fluctuated considerably (average standard deviation of 0.62 for the 10 simulated data sets), while rates in terminal branches were estimated with much better precision (average standard deviation of 0.18).
The ranges for the values of hyperparameters governing the birth-death process prior were varied to investigate their effects on rate estimates. Aris-Brosou and Yang (2002) only tested a small number of different prior ranges for the sampling fraction and failed to test for the effect of varying the birth and death rates of the birth-death prior on divergence times. High rates in internal branches were only produced for three of the parameter value sets tested, with the rates being fairly uniform when other sets were tested (figs. 6c and 7). However, the parameter values that produced the most accurate rate estimates were not the ones yielding the highest average (over each group of data sets) marginal probabilities of the data given the model (results not shown).
As seen in the case study, as well as in the simulation studies throughout this paper, popular methods for inferring rates from sequence data tend to overestimate rates in the internal branches close to the root of the tree. This has wide-ranging implications for studies of divergence dates: if the rate is overestimated in internal branches, then the ages of nodes older than the calibration points will be systematically underestimated. In view of this, previously proposed timescales for evolutionary events may need to be reassessed.
Rates can be difficult to recover accurately, even when the tree topology and calibration points are accurately known, which is rarely the case. In practice, the inference of rates is even more difficult than is demonstrated in the present study, being further confounded by uncertainties in calibration points, by tree topology, and by asymmetric tree shapes. The distribution of node heights in the true tree can also affect the accuracy of rate estimates, particularly if they digress markedly from those generated under the prior model.
Other factors that are known to affect phylogenetic analysis, such as rate heterogeneity among sites, compositional heterogeneity, invariant sites, and covariation of mutations at different sites, can also adversely affect the process of rate estimation. It is difficult to predict the impact of these factors because they are generally not well understood even in much simpler cases.
The consistent overestimation of rates in internal branches has important consequences for the estimation of divergence times. Its impact is likely to be dependent on whether rate and date estimates are calibrated using the root or using other internal nodes. In the case of the former, the ages of internal nodes would be overestimated because the overestimation of rates in branches between internal nodes and the root leads to a reduction in the temporal length of these branches. If nodes other than the root are used for calibration, however, then the age of the root might be underestimated. This is particularly relevant for studies of deep divergences, such as the evolution of metazoan phyla.
It is relatively straightforward to test whether data conform to a molecular clock (but see Bromham et al. 2000), even if the data set contains noncontemporaneous sequences (Rambaut 2000). If the clock is rejected, however, there is currently no method available to determine whether rates are sufficiently autocorrelated to warrant the application of relaxed-clock methods. It is unclear whether the currently available models can satisfactorily deal with such data sets by taking high values for the variance parameter.
In the simulation studies described here, the choice of rate prior has a considerable effect on posterior estimates of substitution rates. The effect of the prior was evident in spite of the relatively high levels of variation in the data, suggesting that the prior has a strong influence on the analysis. The present study, however, is based on the analysis of alignments comprising only a small number of relatively short sequences. Further investigation is required to assess whether or not the amount of information about rate variation contained in real sequence data is sufficient to overcome any biases caused by prior rate models.
Our results suggest that careful selection of rate priors is important, especially when the information content of the data set being analyzed is low. However, our results also question the reliability of simplistic (and often inappropriate) model comparison methods, such as using the Akaike Information Criterion, for choosing the model that will most accurately infer molecular rates and divergence dates. It would be important to develop valid formal procedures to compare the models, perhaps incorporating the Bayesian model comparison criteria described by Kass and Raftery (1995) and Spiegelhalter et al. (2002).
One of the major drawbacks of current relaxed-clock methods is their inability to simultaneously infer topology and implement relaxed-clock models. The development of such a method would offer a preferable alternative to the undesirable practice of having to specify a tree in order to use more sophisticated models of rate variation. Moreover, being able to infer trees in a relaxed-clock framework would represent a significant advance in methods for phylogenetic analysis.
Relaxed-clock models present a biologically plausible and computationally tractable approach to relaxing the assumption of rate constancy among lineages. They are able to perform well under conditions of limited rate variation, but models of rate change must be carefully chosen when the data have evolved under more complex processes of rate change.
The authors thank R. A. Fortey for reading a draft of the manuscript. S.Y.W.H. was supported by a Commonwealth (Oxford) Scholarship from the Commonwealth Scholarship Commission and a Domus Research Studentship and Edward Penley Abraham Cephalosporin Scholarship from Linacre College, Oxford. A.J.D. was supported by the Wellcome Trust. A.C. was supported by the Biotechnology and Biological Sciences Research Council and the Wellcome and Leverhulme Trusts.
*Henry Wellcome Ancient Biomolecules Centre and †Evolutionary Biology Group, Department of Zoology, University of Oxford, Oxford, United Kingdom