Characterization of the Uncertainty of Divergence Time Estimation under Relaxed Molecular Clock Models Using Multiple Loci

Genetic sequence data provide information about the distances between species or branch lengths in a phylogeny, but not about the absolute divergence times or the evolutionary rates directly. Bayesian methods for dating species divergences estimate times and rates by assigning priors on them. In particular, the prior on times (node ages on the phylogeny) incorporates information in the fossil record to calibrate the molecular tree. Because times and rates are confounded, our posterior time estimates will not approach point values even if an infinite amount of sequence data are used in the analysis. In a previous study we developed a finite-sites theory to characterize the uncertainty in Bayesian divergence time estimation in analysis of large but finite sequence data sets under a strict molecular clock. As most modern clock dating analyses use more than one locus and are conducted under relaxed clock models, here we extend the theory to the case of relaxed clock analysis of data from multiple loci (site partitions). Uncertainty in posterior time estimates is partitioned into three sources: Sampling errors in the estimates of branch lengths in the tree for each locus due to limited sequence length, variation of substitution rates among lineages and among loci, and uncertainty in fossil calibrations. Using a simple but analogous estimation problem involving the multivariate normal distribution, we predict that as the number of loci (L) goes to infinity, the variance in posterior time estimates decreases and approaches the infinite-data limit at the rate of 1/L, and the limit is independent of the number of sites in the sequence alignment. We then confirmed the predictions by using computer simulation on phylogenies of two or three species, and by analyzing a real genomic data set for six primate species. Our results suggest that with the fossil calibrations fixed, analyzing multiple loci or site partitions is the most effective way for improving the precision of posterior time estimation. However, even if a huge amount of sequence data is analyzed, considerable uncertainty will persist in time estimates.

Bayesian estimation of species divergence times under the clock and relaxed clock models incorporating uncertain fossil calibrations has attracted much attention lately. Several computer programs have been developed that implement different priors of rates and times and different strategies for incorporating fossil calibrations, such as MULTIDIVTIME (Thorne et al. 1998;Kishino et al. 2001), MCMCTREE (Yang and Rannala 2006;Rannala and Yang 2007;Inoue et al. 2010), BEAST (Drummond and Rambaut 2007), MRBAYES (Ronquist et al. 2012b) PHYLOBAYES (Lartillot et al. 2009), and DPPDiv (Heath et al. 2012). An important common feature of those new generation dating programs is that they accommodate the uncertainties in the fossil record to some extent. For example, unlike earlier dating analyses which assume that the ages of certain nodes on the phylogeny are known without error (Graur and Martin 2004), the new methods may use minimum-and maximum-age bounds to calibrate the molecular tree (e.g., Benton et al. 2009).
Because molecular sequence data provide information about distances only, but not about times and rates individually, this confounding effect between times and rates means that with uncertain calibrations, Bayesian estimation of times (and rates) will not converge to a point even if a huge amount of sequence data is analyzed. Yang and Rannala (2006) and Rannala and Yang (2007) developed the infinite-sites theory, which provides analytically the limiting posterior distribution when the number of sites in the sequence alignment approaches N →∞. Moreover, the prior will always exert an impact on the posterior, even in the analysis of large genome-scale data sets, and seemingly small differences between program implementations may translate to large differences in posterior time estimates (Inoue et al. 2010). Recently, dos Reis and Yang (2013) developed the finite-sites theory, which extends the infinite-sites theory of Yang and Rannala (2006) to the finite-sites case. Assuming the molecular clock, they were able to partition the uncertainty in posterior time estimates into two sources: That due to the uncertain fossil calibrations and that due to the limited sequence data (limited number of sites in the sequence alignment). Furthermore, when the sequence length N increases, the posterior variance in the time estimate approaches the infinite-data limit of Yang and Rannala (2006) at the rate 1/N.
The theory of dos Reis and Yang (2013) works for one single locus (site partition) and assumes the molecular clock. However, most modern molecular clock dating analyses are conducted under the relaxed clock models, as the strict clock is often violated, especially if the species are distantly related, and also use multiple loci (e.g., Bracken-Grissom et al. 2014;Christin et al. 2014). Thus, in this article, we consider the case of relaxed clock analyses of sequence data from multiple loci. Here the term locus refers to a site partition. The work will be an extension of the infinite-sites theory of Rannala and Yang (2007) to large but finite sequence data sets. 268 SYSTEMATIC BIOLOGY VOL. 64 We are interested in how the uncertainty in the posterior time estimates is reduced 1) when the number of sites at each locus N →∞, 2) when the number of loci L →∞, and 3) when both N and L →∞. The problem is not tractable analytically. Thus, we take the same strategy as in dos Reis and Yang (2013) and study a simple analogous case based on the normal distribution, and then use computer simulation and analysis of a real data set for a primate phylogeny to confirm predictions based on the analogy. We also use the primate data set to illustrate the application of the finite-sites theory we develop here to real data analysis.
Our theory applies to both the correlated-rate (Thorne et al. 1998;Rannala and Yang 2007) and independentrate models (Drummond et al. 2006;Rannala and Yang 2007) for relaxing the molecular clock. In the analysis of sequence data from multiple loci (or site partitions), we assume the compound Dirichlet prior of dos Reis et al. (2014) for the locus rates (i.e., the rates appropriate for the loci). This prior is implemented in the MCMCTREE program. Most current Bayesian dating programs assume an independent and identical distribution of locus rates for multiple loci, which leads to overconfident posterior time estimates (dos Reis et al. 2014). The case of the i.i.d. prior for locus rates is discussed later.

Theoretical Analysis of a Normal Distribution Example
Following Rannala and Yang (2007) and dos Reis and Yang (2013), we consider a simple case of Bayesian estimation of confounded parameters involving the multivariate normal distribution. The example will help us to gain insights into the posterior uncertainty in the estimates of divergence times and substitution rates as the size of the sequence data set increases.
Suppose the data consist of a matrix X = {x ij }, i = 1, 2, …, L; j = 1, 2, …, N; where each observation x ij is normally distributed. The model is where e ij ∼ N(0, 2 e ) and i ∼ N(0, 2 ), with both 2 e and 2 given. This is a mixed linear model with 1 and 2 to be the fixed effects (parameters) and i to be the random effects. However, the data or likelihood always depends on the sum 1 + 2 , but not on 1 and 2 separately, so that there is a problem of nonidentifiability in the construction of the model. We assign priors 1 ∼ N(−1, v 1 ) and 2 ∼ N(1, v 2 ) to estimate 1 and 2 , with v 1 and v 2 known. We are interested in the posterior distribution (and in particular, the posterior means and variances) of 1 and 2 (and possibly of i ) when the size of the data (N or L or both) is large.
x ij be the sample means. The likelihood is given byx This is a (L+2)-variant normal distribution. In the Appendix, we show that 1 and 2 have a bivariate normal posterior distribution with means, variances, and correlation to be where v 4 = v 3 /L and v 3 = 2 e /N + 2 . The posterior variance of 1 can be written as the sum of three terms where ∞ = v 1 v 2 v 1 +v 2 is the infinite-data limit and where The approximations apply when L is large, which we assume here. Based on equation (4), the following observations can be made. 1) When L is large and fixed, the posterior variance s 2 1 approaches the limit v ∞ +b/L at the rate 1/N when N increases. 2) When N is fixed, s 2 1 approaches ∞ when L →∞, independently of N. The rate of approaching the limit is 1/L. The same limit is reached (a) when L →∞ with N fixed at a finite value and (b) when both N and L →∞. at each locus. The times (ages of nodes on the tree) are shared across all loci and are assigned a prior, often based on the birth-death process (e.g., Yang and Rannala 2006;Lepage et al. 2007). This prior also incorporates fossil calibrations, which come in the form of probability distributions. We assume that none of the node ages is known with certainty. When the data are analyzed, each locus i has an overall (mean) rate i , for i = 1,2,…,L, and the i are assigned a gamma-Dirichlet prior (dos Reis et al. 2014). The rates for branches or for nodes on the tree at the locus are then generated conditional on the locus rate i . Under the independent-rate model, the rates for branches in the tree at the locus are i.i.d. variables from the lognormal or gamma distribution with mean i (Drummond et al. 2006;Rannala and Yang 2007). Under the correlated-rate model, i is the rate at the root of the tree at the locus, from which rates for other nodes or branches evolve according to a stochastic process such as the geometric Brownian motion (Thorne et al. 1998;Thorne and Kishino 2002;Rannala and Yang 2007).
With this setup, the results for the normal distribution example (in particular, equation (4)) lead us to the following predictions concerning the uncertainty in the posterior estimates of divergence times. 1) When the number of loci L is large and fixed, the posterior variance of any node age has a nonzero limit when N →∞, and it approaches this limit at the rate 1/N. 2) When the number of sites at each locus N is large and fixed, the posterior variance of any node age has a nonzero limit when the number of loci L →∞, and it approaches this limit at the rate 1/L. Furthermore, the limiting variance for L →∞ is the same when N is fixed at different large values. The limiting variance when L → ∞ reflects the uncertainties in the fossil calibrations, which cannot be reduced by further increase of the sequence data. This prediction suggests that to improve the precision of posterior time estimates, adding more loci (or site partitions) may be far more effective than increasing the length of sequence at each locus.
Using an analogy to equation (4), we may also partition the uncertainty in the posterior estimates of divergence times into three sources: The part due to the limited number of sites at each locus (that is, the term a LN in equation (4)), which disappears when N → ∞, the part due to the limited number of loci (the term b L in equation (4)), which disappears when L → ∞, and finally the part due to uncertainties in the fossil calibrations ( ∞ in equation (4)), which cannot be reduced by further increase of sequence data. We illustrate this calculation in simulated and real data sets later.

Design of the Simulation Experiment
We simulated sequence alignments on two phylogenies, with two or three species, respectively, to examine how the number of loci (L) and the sequence length at each locus (N) affect the uncertainty in posterior time estimates. For computational efficiency, we use the Jukes-Cantor (1969) model both to simulate data and to analyze them. Our focus in this study is the asymptotic behavior of posterior time estimation, and the precise nature of the assumed substitution model is unimportant (as long as the correct model is used). For example, if we use GTR+ (Tavaré 1986;Yang 1994aYang , 1994b both to simulate and to analyze the data, the dynamics of posterior time estimation will be the same as under JC69, with parameters in the GTR+ model approaching the true values. Similarly, while we use small trees of only two or three sequences in the simulation (to reduce the computational cost) we expect the asymptotic dynamics of posterior time estimates to be independent of the dimension of the problem and to apply to larger trees with many species.
We describe the simulation for the three-species case and then comment on the two-species case. The true age of the root is fixed at t 1 = 1 and the age of the internal node is t 2 = 0.5. We consider one time unit to be 100 Myrs, so the two node ages are 100 and 50 Myrs, respectively. To simulate sequence alignments at L loci, we fix the overall rate at locus i at i = 0.1 (meaning 0.1 substitutions per time unit or 10 −9 substitutions per site per year). The rates for the four branches on the three-species tree are then generated as independent random variables from the lognormal distribution: r ij~L N(log i − 2 /2, 2 ), for j = 1,…, 4 (Rannala and Yang 2007; equation (A.1)). We fix 2 = 0.01, which means that the molecular clock is slightly violated. Each branch length is then calculated by multiplying the time duration of the branch and the rate for the branch. Sequences at the locus are then simulated by evolving sequences along branches of the tree, using the EVOLVER program in the PAML package (Yang 2007). Simple R code is written to sample rates for branches and to generate the control files for EVOLVER. The alignments for all L loci are then merged into one data file and constitute one data set, to be analyzed by the MCMCTREE program, also from the PAML package. For the case of infinite sites (N =∞), branch lengths in the unrooted tree constitute the data. The number of replicates is 200.
In the two-species case, the true divergence time (or the age of the root) is t = 1, and there are only two branches on the tree. All other settings are the same as in the case of three species. For example, the rates for the two branches are generated as independent lognormal variables with the mean i = 0.1 and variance parameter 2 i = 0.01. The sequence data sets (each consisting of L alignments, each of N sites) are analyzed using MCMCTREE, whereas data sets of infinite sites (each consisting of L sets of branch lengths on the unrooted tree) are analyzed using the program INFINITESITES. Both are Markov chain Monte Carlo (MCMC) programs from PAML ver. 4.8. The age of the root is assigned the prior t 1 ∼ G(100, 100), with mean 1 and 95% interval (0.814, 1.205). This mimics the use of a soft-bound calibration at the root node of 81-121 Myrs. In the case of three species, the age of the internal node has a uniform prior between 0 and t 1 , that is, t 2 |t 1 ∼ U(0,t 1 ). The prior on t 2 is 270 SYSTEMATIC BIOLOGY VOL. 64 thus very diffuse, whereas the prior on t 1 is informative. This mimics the situation in which a fossil calibration is placed on t 1 but not on t 2 . We assume a gamma-Dirichlet prior for rates at loci, { i }∼gammaDir(100, 1000, 1) (dos Reis et al. 2014). The average rate over all loci is assigned a gamma prior, = 1 L L i=1 i ∼ G(100, 1000), with mean 0.1 and variance 10 −4 , and with the prior 95% interval 0.1±0.02 or (0.08, 0.12)×10 −8 substitutions per site per year. Then the total rate for all loci, L, is partitioned into the locus rates using a uniform Dirichlet distribution (dos Reis et al. 2014). Given the locus rate i , the rates for branches have the i.i.d. prior from the lognormal distribution with mean i and variance parameter 2 i . This is the so-called independent-rate model for relaxing the clock. A gamma-Dirichlet prior is assigned on parameters { 2 i }∼gammaDir(2, 200, 1) as well: The mean across loci is assigned the gamma G(2, 200), whereas the total is partitioned among loci using a uniform Dirichlet.
In the MCMC analysis, the likelihood is calculated using Felsenstein's (1981) pruning algorithm. The length of the MCMC run is determined by running the program multiple times on the same data set to assess consistency between runs. Note that the amount of computation increases far more quickly with the increase of the number of loci than with the increase of the number of sites per locus. The analysis of each replicate data set (with L loci each of N sites) leads to a posterior sample of the divergence times, from which the posterior means and 95% equal-tail credibility intervals (CIs) are generated. Those are then averaged over the 200 replicates and reported.

Estimation of Divergence Times on a Primate Phylogeny
We use the genomic data for six primate species of dos Reis and Yang (2013) to examine the uncertainty of posterior time estimates as a function of the loci used. The phylogeny with fossil calibrations is given later in Figure 3 The genomic data consist of six-species alignments of 9,992 protein coding genes. We removed ambiguous codons and alignment gaps and excluded genes with fewer than 200 codons, resulting in a data set of 7947 genes. We used only the third codon positions. Among those 7947 genes, the number of codons (or the number of third-position sites) ranged from 200 to 5055, with the median at 399.
To study the effect of the number of loci on the uncertainty of posterior time estimates, we sampled genes (third codon positions only) without replacement from the 7947 genes, to generate data sets with L = 1, 5, 10, 20, 50, 100, 200, and 500 genes. For each L, 100 replicate data sets were generated. The data were then analyzed similarly to the simulated data, using MCMCTREE ver. 4.8. The sequence likelihood was calculated under the HKY+ 5 substitution model (Hasegawa et al. 1985;Yang 1994b). The prior of the divergence times was constructed using the calibrations of Figure 3 and the birth-death process, with the birth and death parameters BD = BD = 1, and sampling fraction BD = 0. Those parameter values give a uniform kernel density (Yang and Rannala 2006, equation (5)). We used the gamma-Dirichlet prior gammaDir(2, 20, 1) for rates at loci ( i ): The average rate over all loci is assigned a gamma prior G(2, 20), with mean 0.1 and variance 0.005, whereas the total rate is partitioned among loci using a Dirichlet distribution (dos Reis et al. 2014). Here the time unit is 100 myr, so that the prior mean rate is 10 −9 substitutions per site per year. Given the locus rate i , the rates for branches at the locus have the i.i.d. prior from the lognormal distribution with mean i and variance parameter 2 i . Parameters 2 i were assigned the prior gammaDir(2, 20, 1), to allow for the violation of the clock.
Note that in the normal distribution example N is the same for different i, whereas here N varies among loci. However, one may envisage an extra sampling step in which the number of sites N for each locus is sampled as a random variable from a common distribution and then the prediction based on the normal example should remain valid.
Besides the independent sampling scheme of generating data sets of L loci, we also used an alternative scheme, to partition the 7947 genes (or 3,982,327 thirdcodon-position sites) into L partitions. For each gene, we estimated the branch lengths by maximum likelihood, using RAxML (Stamatakis et al. 2012) and ranked the genes by tree length (sum of branch lengths). We then partitioned the genes into L = 5, 10, 20, 50, 100, 200, and 500 partitions, with approximately the same number of genes in each partition, placing genes with similar tree lengths into the same partition. Here the tree length was used as a proxy for relative evolutionary rate. Note that the total number of sites analyzed (3,982,327) is always the same whatever the number of partitions L is. Each data set was analyzed exactly in the same way as above, with the L partitions treated as L independent loci. This way of analyzing the data is carried out here as it is common in molecular clock dating analysis, in which the genetic data are fixed and different partitioning strategies are evaluated (e.g., dos Reis et al. 2012). However, our predictions based on the normal distribution example are not expected to apply to this case.

Simulation in the Case of Two Species
For the tree of two species, the true root age is t = 1, and this is the only time parameter to be estimated. The posterior mean, the 95% CI, and the CI width (w) of time t are calculated for a number of values of L and N, with the results shown in Table 1, averaged over 200 simulated replicates. The posterior means are all very close to the true value. Here we focus on the uncertainty in the posterior time estimates, measured by w 2 . Note that when the data set is large, w 2 should be approximately proportional to the posterior variance. Notes: The true age is t = 1. Results for N =∞ are calculated using the infinite-sites theory of Rannala and Yang (2007). In the last two columns, (w/w 0 ) 2 is the ratio of the posterior to the prior variance, and 1 − (w ∞ /w) 2 is the percentage of the posterior variance that is due to limited sequence data. The results are averages over 200 replicate data sets.
In Figure 1a, we plot w 2 against 1/N with L fixed at either 10 or 100, and in Figure 1b, we plot w 2 against 1/L with N fixed at either 100 or 1000. According to our predictions based on the analogous normal distribution example, both plots should be linear when N and L are large, with positive (nonzero) intercepts. This is indeed the case.
In Figure 1a, the lines of best fit are w 2 = 1.8092/N + 0.0796 for L = 10 and w 2 = 0.3387/N +0.0773 for L = 100. Both the slope and the intercept are smaller for the larger L, indicating more information in data of more loci.
In Figure 1b, the lines of best fit are w 2 = 0.2317/L+ 0.0777 for N = 100, and w 2 = 0.0481/L+0.0773 for N = 1000. The slope is smaller for the larger N (0.0418 for N = 1000 compared with 0.2317 for N = 100). The intercepts for N = 100 and 1000 are nearly identical (0.0777 for N = 100 and 0.0773 for N = 1000). Based on the analogy with the normal distribution example, we expect the uncertainty to be the same for different N, when L → ∞, so that those intercepts should be equal. Note that the intercept in the w 2 versus 1/L plot represents the limiting value when L →∞ and reflects the uncertainty in fossil calibrations and in the rate prior, uncertainty that cannot be reduced by increasing the amount of sequence data. We take w 2 = 0.0773 as the limiting uncertainty in time estimates for infinite sequence data. Before any molecular data are analyzed, w 2 = 0.391 2 = 0.153, given by the prior, and this is reduced by a half (1 − 0.0773/0.153 = 0.495) when an infinite amount of sequence data is analyzed. The other half is due to the uncertainty in the prior and in the fossil calibration, and cannot be reduced by further increase of sequence data.
It is noteworthy that the posterior uncertainty does not approach zero when the amount of sequence data increases without bound (that is, when N →∞, when L →∞ or when both N and L →∞) (Table 1) intercepts of the w 2 versus 1/L plots in Figure 1b and from the results of Table 1, we estimate the limiting posterior 95% CI to be (0.871, 1.148), with w = 0.278 or w 2 = 0.0773. In other words, even with an infinite amount of sequence data, the 95% CI (87.1-114.8Ma) spans 28 Myrs. Compared with the prior, which has the 95% CI width to be 39.1 Myrs (Table 1), the posterior interval at the infinite-data limit is 30% narrower, and the posterior variance is ∼ 50% smaller.
It is also interesting to examine the posterior uncertainty when the data set is small, before the asymptotics are reliable. With only L = 2 loci, increasing the sequence length N from 100 to 1000 reduces the interval width from 0.354 to 0.306. This is a reduction of 14% (= 1 − 0.306/0.354) in the CI width or a reduction of 25% (= 1 − 0.306 2 /0.354 2 ) in the posterior variance. However, increasing N further to 10 4 reduces the posterior CI width to 0.294, with much less effect. This trend of diminishing returns is because very quickly most of the posterior variance is due to the uncertain fossil calibrations, which cannot be reduced by adding sequence data. Increasing the sequence length has even less effect when a larger number of loci is used. It seems that with L ≤ 5 loci analyzed, 10 4 sites per locus are nearly as good as an infinite number of sites, and with L ≥ 10 loci analyzed, 200 or 1000 sites are not much worse than an infinite number of sites.

Simulation in the Case of Three Species
In the tree for three species, there are two node ages, with the true ages to be t 1 = 1 and t 2 = 0.5. The posterior means, the 95% CIs, and the CI widths for t 1 and t 2 are shown in Table 2, averaged over 200 replicates. Again, the posterior means are close to the true values, so we focus on the posterior uncertainty measured by w 2 .
We plot w 2 against 1/N with L fixed at 10 or 100 in Figure 2a and b, and against 1/L with N fixed at 100 or 1000 in Figure 2c and d. Both kinds of plots show near perfect linear relationships when N and L are large, confirming our predictions based on the normal distribution example. The asymptotic linear relationship holds well for N ≥ 40 and for L ≥ 10.
For both t 1 and t 2 , the intercepts in the plots of w 2 against 1/L are nearly identical for N = 100 and N = 1000 (Fig. 2c, d). For example, the intercept for t 1 is 0.0780 at N = 100 and 0.0773 at N = 1000, whereas for t 2 it is 0.0191 at N =100 and 0.0193 at N =1000. Those results confirm our prediction that the limiting posterior uncertainty when L →∞ does not depend on N. The minor differences may be attributed to small sampling errors due to limited number of replicates and to the fact that our L and N values are not very large. The results suggest that with an infinite amount of sequence data, the limiting values are w 2 1 ≈ 0.0773 = 0.278 2 for t 1 and w 2 2 ≈ 0.0193 = 0.139 2 for t 2 , with w 1 /w 2 = 2 = t 1 /t 2 . The one-dimensional limiting posterior is almost entirely dominated by the prior for t 1 .
We also examined posterior uncertainties in small data sets before the asymptotic theory is reliable. Note that the prior and data information for t 1 are nearly the same as in the simulation for two species. The posterior uncertainty for t 1 is also very similar to and slightly smaller than that for the two-species case. The reduced interval or improved precision is due to the information coming from the third sequence. At the infinite-data limit, the posterior for t 1 is exactly the same as in the two-species case, with the posterior CI to be (0.871, 1.148), and w = 0.278 or w 2 = 0.0773. Overall the pattern for t 1 is the same as in the two-species case.
The results for t 2 are very different from those for t 1 . There is much weaker information in the prior about Notes: The true node ages are t 1 = 1 and t 2 = 0.5. Note that the limiting CI width for t 2 is w ∞ = 0.139 = √ 0.0193 (Fig. 2). See legend to Table 1.
t 2 than about t 1 . As a result, when molecular data are analyzed, the uncertainty in t 2 is reduced far more dramatically than that for t 1 . Increase of both the sequence length N and the number of loci L leads to reduction in the posterior uncertainty of t 2 , and improvement is seen even at large values of L or N (Table 2). Even so, the percentage of posterior uncertainty that is due to sequence data goes down quickly. For example, for the sequence length N = 1000, this percentage is 59, 36, and 6% for L = 2, 5, and 50 loci, respectively. Furthermore, we note that the linear prediction becomes reliable for t 2 earlier than for t 1 ; in other words, the linear relationship applies for smaller values of N and L for t 2 than for t 1 .

Estimation of Divergence Times on a Primate Phylogeny
The posterior means, the 95% CIs, and the CI widths (w) of divergence times on the primate phylogeny of Figure 3 (t 7 -t 11 ) are shown in upper panel of Table 3, when L loci are sampled at random from the 7949 genes are analyzed. Here a gene or locus means all the third codon positions of the protein coding gene. The prior is shown as well (L = 0 in Table 3).
In Figure 4 we plot the posterior uncertainty (measured by w 2 ) for the five divergence times in the phylogeny (t 7 -t 11 ) against 1/L. In all cases except t 8 , w 2 shows a strong linear relationship with 1/L as long as L ≥ 10, consistent with our predictions. For t 8 , the linear relationship holds well only for much larger values of L, that is, only if L ≥ 50. For small L before the asymptotics become reliable, the posterior CI width is smaller than the predicted value from the straight line (see plot for t 8 in Fig. 4). As in the simulation for three species, the asymptotic theory starts to become reliable for smaller values of L if the node has a less informative prior calibration. dos Reis and Yang (2013) suggested the use of the 95% prior interval width divided by the prior mean as a measure of prior or calibration precision. This is 0.91, 0.37, 0.90, 1.02, and 0.56 for t 7 , t 8 , t 9 , t 10 , and t 11 , respectively, with t 8 having the most precise calibration.
The intercept of the regression line, which is an estimate of the limiting uncertainty (w 2 ) with an infinite amount of sequence data, is 0.0080, 0.0019, 0.0006, 0.0001, and 0.00005, for the five nodes t 7 , t 8 , t 9 , t 10 , and t 11 , respectively. These are very close to the w 2 values for L = 500 in upper panel of Table 3: 0.0090, 0.0021, 0.00068, 0.00014, 0.000064 for t 7 -t 11 , indicating that 500 loci may be close to an infinite amount of sequence data.
We then consider the alternative strategy of partitioning the 7947 loci (third codon positions) into L site partitions. In this case, the same total number of sites are always used in the analysis whatever L is. The results are summarized in lower panel of Table 3. Note that our asymptotic theory based on the normal distribution example does not apply to this case, as the setups of the statistical estimation problems are very different. The case of L = 1 (lower panel of Table 3 w 2 FIGURE 2. The finite-sites theory for three sequences. The square of the posterior 95% CI widths (w 2 ) for the two divergence times (t 1 and t 2 ) is plotted against 1/N, with the number of loci L fixed at 10 a) or 100 b); and against 1/L, with the number of sites fixed at N = 100 c) or 1000 d). In (a) for L = 10, the point N = 20 is not used to fit the line for t 1 because those values of L and N appear too small for the linear asymptotic trend to apply. all sites are analyzed as one partition, with the same substitution model and the same set of branch rates for all sites. This is still commonly used in molecular dating analyses (e.g., Christin et al. 2014). The posterior CIs from this concatenation analysis are slightly wider than those in the independent sampling scheme for L = 5 loci for t 7 and t 8 , and are slightly smaller for t 9 -t 11 (upper panel of Table 3), but they are larger for all node ages than in the independent sampling scheme for L = 10 loci, even though 10 independent loci constitute only about 0.13% (= 10/7947) of the data used in the concatenation analysis.
At the other end, when L = 500 loci, the posterior 95% CI widths are similar between the independent sampling scheme (upper panel of Table 3) and the partition analysis (lower panel of Table 3), even though the former used only about 6.3% (= 500/7947) of the data as in the latter. Indeed the independent sampling scheme had very slightly smaller CI widths for t 7 -t 11 .
Those results suggest that to increase the precision of posterior time estimates, it is far more effective to increase the number of loci than to increase the sequence length at each locus, at least when the sequence length at each locus is not too small (in our case, N ≥ 200 at every locus).

Unconventional Nature of the Estimation Problem
The confounding effect of times and rates makes Bayesian estimation of species divergence times Phylogeny of six primate species. Fossil calibrations are shown next to the nodes on the tree. The joint bounds (for nodes 7, 8, 9, and 11) are implemented as soft uniform bounds with a sharp minimum (1% of tail probability on the left) and a soft maximum (5% of tail probability on the right) (Yang and Rannala 2006, figure 2). The minimum bound (on node 10) is implemented using the truncated Cauchy distribution (Inoue et al. 2010).
using uncertain fossil calibrations an unconventional estimation problem. Here we highlight two important differences. First, in a conventional Bayesian estimation problem, the estimate (posterior mean, say) will converge to the true parameter value when the size of the data set (N) increases without bound, with the posterior variance decreasing to zero. In large data sets, the posterior variance is typically proportional to 1/N. The Bayesian method of parameter estimation is known to be statistically consistent. However, such convergence to truth does not occur in Bayesian divergence time estimation, when the amount of sequence data increases without bound and when the fossil calibrations involve uncertainties and are fixed. The model is not fully identifiable, and the posterior variance will not approach zero even if an infinite amount of sequence data is analyzed (Britton 2005;Yang and Rannala 2006). Second, in a conventional Bayesian estimation problem, the impact of the prior will become unimportant when the data size increases, and in large data sets, the posterior will be dominated by the data or likelihood. This is not the case when we use genetic sequence data to estimate absolute times and absolute rates. Even with an infinite amount of sequence data, the posterior will remain sensitive to the prior on times (which includes the calibration information) and the prior on rates, as highlighted in the infinite-sites theory (Yang and Rannala 2006;Rannala and Yang 2007).
In this study, we characterized the posterior uncertainty, measured by the posterior variance or the square of the posterior CI width (w 2 ), in Bayesian relaxed clock dating analysis using sequence data from multiple loci. We used an analogous normal distribution example to make qualitative predictions about the posterior variance of times when the number of loci L and the sequence length N are large but finite. The predictions are then confirmed in specific cases using computer simulation and analysis of a primate data set. The posterior variances of divergence times have three components. The first is due to sampling errors in the branch lengths due to the limited sequence length N. If L is large, this component goes to zero at the rate 1/N. The second component is due to evolutionary rate variation among loci and among branches at each locus according to the relaxed clock model. This decreases to zero in proportion to 1/L. The third component is due to uncertainty in fossil calibrations. This cannot be reduced by increasing the amount of sequence data. For most phylogenetic analysis, the locus is not very short (with N > 1000 sites, say). Then use of multiple loci (or site partitions) will be the most effective approach to improving the precision of posterior time estimation under relaxed clock models.
The importance of analyzing multiple loci or site partitions to reduce posterior uncertainty does not seem to be well appreciated in the literature. Many dating analyses commonly use the concatenation method, by which multiple genes are concatenated into one "supergene," with one set of rates for branches used in the model (e.g., Christin et al. 2014). Also empirical biologists appear to be surprised by the lack of big improvement in estimation precision with the addition of molecular data (e.g., Mulcahy et al. 2012), even though this is expected from theory (Yang and Rannala 2006;Rannala and Yang 2007). Given the persistent uncertainty in the posterior even when large genomescale data sets are analyzed, the posterior means or medians of divergence times may not represent the posterior distribution well. Thus, we suggest that in molecular dating analyses, posterior CIs or standard deviations of divergence times be reported in addition to the posterior means or medians.

Assumptions and Validity of the Theory
Our theory is constructed not through mathematical proofs but by statistical intuition through an analogy with a toy example that is analytically tractable. However, there are a number of differences between the toy example and the divergence time estimation problem. Here we discuss or speculate on which of those differences matter to our qualitative results. First, the toy example assumes an additive model (equation (1)), but the model in the Bayesian dating problem is multiplicative (i.e., a branch length or sequence distance is the product of time and rate). However, we consider this difference to be unimportant. If we take the exponential on both sides of equation (1), the additive model will become a multiplicative one, with where X ij = e x ij , M 1 = e 1 , and so on. By assuming that the random effect i and the error E ij have lognormal distributions and by assigning lognormal . The finite-sites theory applied to the analysis of genomic sequence data from six primate species (Fig. 3). The square of the posterior 95% CI widths (w 2 ) for the 5 node ages (t 7 , t 8 , t 9 , t 10 , and t 11 ) is plotted against the reciprocal of the number of loci, sampled at random from 7947 protein coding genes (with only the third codon positions used).
priors on parameters M 1 and M 2 , equation (6) specifies exactly the same model as equation (1), so that exactly the same results will be obtained in the Bayesian analysis. Furthermore, in large data sets, the variances for a parameter x and for its function y = y(x) are approximately related as var(y) = var(x)× dy dx 2 , where the derivative is evaluated at the true parameter values and is a constant scale factor. Thus, if var(x) is proportional to 1/N, so is var(y). Such transforms or reparametrizations typically affect the sample size needed for the asymptotics to work well but not the asymptotic behavior itself. In other words, var(x) may become proportional to 1/N sooner (for smaller N) than var(y), but with sufficiently large N, both var(x) and var(y) should be proportional to 1/N. Similarly we do not consider the distributional forms (normal in the toy example vs. multinomial for sequence alignments in the dating problem) to be important.
In the simulation we fixed the locus rate i to be constant among loci (equal to the prior mean). We suggest that the asymptotic behavior should be the same if one samples i from the prior instead: In both cases, the data are independent among loci given the parameters of interest ( 1 and 2 in the toy example vs. times and the average rate across loci in the dating problem). Another assumption we made both in the model and in the simulation is that the species phylogeny and species divergence times are shared among all loci. A number of biological processes can cause the gene trees at individual loci to differ from the species tree. For example, the coalescent process in ancestral species may cause the gene tree topology to differ from the species tree and also the gene divergences to be older than the species divergences. This source of uncertainty is ignored here. As a result, our theory may apply only to distantly related species so that the coalescent times are negligible compared with the species divergence times.
Although many factors affect divergence time estimation, we consider two of them to be particularly important to the qualitative results of this article: The fossil calibrations and the prior model for variable rates among loci and among branches. Here we discuss the first factor and treat the second below in the next subsection.
We have assumed that the fossil calibrations are uncertain (i.e., they are not fixed constants) but correct. We have ignored the challenges of summarizing the fossil evidence to generate uniform bounds or other statistical distributions for use as fossil calibrations in the dating analysis. Erroneous calibrations may have major effects on divergence time estimation. In some situations, they may produce extremely precise but grossly wrong time estimates whereas in others the conflicts between fossils and between fossils and molecules may lead to multimodal posteriors. The impact of incorrect calibrations on the asymptotic behavior of divergence time estimation is not well understood (dos Reis and Yang 2013). Also our theory assumes that the fossil calibrations used in the dating analysis are fixed, and does not apply if more and more fossil calibrations are added into the dating analysis. We speculate that our asymptotic results will also apply to the joint analysis of morphological characters and molecular sequence data, as conducted by Ronquist et al. (2012a), or in the joint analysis of fossil occurrence data and molecular sequence data, as in (Wilkinson et al. 2011; see also Bracken-Grissom et al. 2014), if the morphological data or fossil occurrence data are fixed, whereas the amount of molecular data increases. Further research is needed to test those predictions. , the data consist of L independent loci, sampled at random from the 7947 genes ( upper panel of Table 3) The results are averages over 100 replicates. In (a')-(b'), the 7947 genes are grouped into L partitions according to the relative rate of substitution. In (a) and (a') the posterior mean and 95% CI for each node age t i are plotted. In (b) and (b'), the 95% CI width (w) is plotted.

Prior Models for Substitution Rates Among Loci
In this study, we have assumed the compound Dirichlet prior of dos Reis et al. (2014) for the locus rates (averages rates for the loci). Most current Bayesian dating programs implement the i.i.d. prior for locus rates, assuming that the rates for loci have an independent and identical distribution. Suppose i has a distribution with variance v. With L loci, the average of i over all L loci, = 1 L L i=1 i , will have the variance v/L. For large L, this variance will be very small. In analysis of large molecular data sets, the posterior of times and rates is nearly one dimensional (e.g., dos Reis and Yang 2013, figure 3), and the near certain knowledge of one parameter, such as the age of a single node on the tree or the average rate over all loci, will be sufficient to resolve the nonidentifiability problem and make all parameters be estimated with near certainty. This high precision is an artefact of the unreasonable prior. One consequence of this i.i.d. prior is that if the rate prior is mis-specified (with a wrong prior mean, say), the posterior time estimates may be grossly wrong and yet very precise (e.g., dos Reis et al. 2014, figure 1).
Here we mention two priors that avoid this problem. The gamma-Dirichlet prior implemented by dos Reis et al. (2014) assigns a gamma distribution for the average rate across all loci: = 1 L L i=1 i ∼ G( , ), and then partitions the total rate for all L loci (L) using a Dirichlet distribution with concentration parameter . The means, variance, and correlation are given as  (6-7)).
The second prior may be referred to as the conditional i.i.d. prior, and is not yet implemented. We assume that the rate i for locus i has the distribution i | ∼ G(, /), where is the shape parameter and measures how variable the rates are among loci, and is the mean of the prior distribution. Then we assign a hyperprior ∼ G( , ), with shape parameter and . Under this prior, we have It does not seem possible to integrate out to generate the joint prior distribution f ( 1 , 2 , …, L | , , ) analytically. However, one may use MCMC to integrate over.
A drawback of the gamma-Dirichlet prior is that it depends on the size of the data (the number of loci L), but it has the advantage that the joint prior distribution f ( 1 , 2 , …, L | , , ) is tractable analytically (dos Reis et al. 2014, equation (3)). The conditional i.i.d. prior appears to match the toy example and our simulation more closely than the gamma-Dirichlet prior. Nevertheless, the two priors are extremely similar for large L, with the same means, variances, and correlations. They appear to have the same asymptotic dynamics.
One aspect of the prior for rates that may be unrealistic biologically is the assumption that given the rates for loci ( 1 , …, L ), the rates for branches are independent among loci. In other words, the different loci will provide independent realizations of the ratedrift process. Analysis of real data has highlighted the existence of both the gene effect, which is described by the prior model f ( i |), and the lineage effect of substitution rates (Ho 2014). The lineage effect here means that some branches on the tree have high rates in all or most loci whereas some other branches have low rates. Current prior models ignore such rate correlation among loci, and may be expected to exaggerate the information content in the sequence data (Thorne et al. 1998). An even more extreme case may be as follows. Imagine one has five genes but analyzes them as 10 partitions, separating each gene into two arbitrary partitions. Such an analysis will generate more precise time estimates, but the high precision is spurious, because the two halves of the same gene may have very similar evolutionary rate trajectories and do not constitute two independent realizations of the rate-drift process.

FUNDING
This work was supported by a grant from the Biotechnological and Biological Sciences Research Council (BB/J009709/1) to Z.Y., and grants from Natural Science Foundation of China (31301093, 11301294 and 11201224) to T.Z.

ACKNOWLEDGMENTS
The authors thank Jeff Thorne and another anonymous referee for many insightful comments.