## Abstract

It is frequently true that molecular sequences do not evolve in a strictly clocklike manner. Instead, substitution rate may vary for a number of reasons, including changes in selection pressure and effective population size, as well as changes in mean generation time. Here we present two new methods for estimating stepwise changes in substitution rates when serially sampled molecular sequences are available. These methods are based on multiple rates with dated tips (MRDT) models and allow different rates to be estimated for different intervals of time. These intervals may correspond to the sampling intervals or to a priori–defined intervals that are not coincident with the times the serial samples are obtained. Two methods for obtaining estimates of multiple rates are described. The first is an extension of the phylogeny-based maximum-likelihood estimation procedure introduced by Rambaut. The second is a new parameterization of the pairwise distance least-squares procedure used by Drummond and Rodrigo. The utility of these methods is demonstrated on a genealogy of HIV sequences obtained at five different sampling times from a single patient over a period of 34 months.

## Introduction

Although molecular sequences accumulate substitutions over time, the rate at which this occurs may not be constant. The rate of substitution is dependent on biological processes, including the intensity of selection, changes in effective size (when selection is present), and changes in the dynamics of the population, say, a shift in mean generation time. These effects can change substitution rate (1) over time, (2) in different lineages, and (3) at different positions along the sequence. In this paper, we present methods that model the substitution rate of molecular sequences obtained serially from individuals within a population or between species (and higher taxa) by allowing the rate to change over time in a stepwise fashion.

Population genetics studies that utilize molecular sequences typically rely on samples of sequences that have been obtained contemporaneously (Felsenstein 1992<$REFLINK> ; Fu 1994<$REFLINK> ; Nee et al. 1995<$REFLINK> ; Pybus, Rambaut, and Harvey 2000<$REFLINK> ). However, there has recently been increased interest in the analysis of samples that are gathered serially, each at a different time. This includes samples from rapidly evolving viral populations such as HIV (Leitner and Albert 1999<$REFLINK> ; Rodrigo et al. 1999<$REFLINK> ) and samples of ancient DNA from fossilized remains (Leonard, Wayne, and Cooper 2000)<$REFLINK> . It is our aim to derive estimates of substitutional parameters from this type of data using biologically relevant models.

Recently, two papers have independently described methods to estimate substitution rate, μ, from serial samples under the assumption of a molecular clock. Rambaut (2000)<$REFLINK> shows how a phylogeny-based maximum-likelihood estimate (MLE) of the constant substitution rate, μ, expressing the divergence between dated sequences as a product of μ and the time interval, can be obtained (fig. 1*a* ). Drummond and Rodrigo (2000)<$REFLINK> , using a distance-matrix least-squares (LS) approach, parameterize intersample divergence in two ways. First, analogous to Rambaut's single rate with dated tips (SRDT) model, ω-parameterization estimates only a single substitution rate, ω, using ω*t _{i}* as the intersample divergence for the

*i*th interval with elapsed time

*t*(ω is the number of substitutions per unit time, but since time may be measured in chronological units rather than in generations, Drummond and Rodrigo use ω instead of μ). Second, with δ-parameterization, each intersample interval is allowed to have its own substitution rate, ω

_{i}_{i}; i.e., for the

*i*th interval with elapsed time

*t*, ω

_{i}_{i}

*t*= δ

_{i}_{i}(fig. 1

*b*). In keeping with Rambaut's terminology, we will refer to this as the “multiple rates with dated tips” (MRDT) model. Drummond and Rodrigo (2000)<$REFLINK> go on to use these substitution rate estimates in a phylogenetic reconstruction procedure called serial-sample unweighted pair grouping method with arithmetic means (sUPGMA) which recovers a tree with lineages that terminate in the order of sampling.

In this paper, we extend Rambaut's (2000)<$REFLINK> tree-based SRDT likelihood estimation procedure to include the MRDT model. In addition, we show that there are two forms of the MRDT model, one where the rates are estimated differently for each sampling interval (corresponding to Drummond and Rodrigo's [2000] δ parameterization, above), and another where the rates are different for different a priori–defined intervals that do not necessarily coincide with sampling intervals (fig. 1*c* ). Maximum-likelihood (ML) and LS estimators can be constructed for both forms of the MRDT model. Finally, we illustrate the use of these methods on an example data set of HIV-1 partial envelope (*env*) sequences obtained serially from an individual who was treated with Zidovudine midway through the sampling program.

## Likelihood Model

Let us consider the case of sequence data for which there is exact time information and a known phylogeny. Our generalization allows for the rate of substitution to have stepwise changes over time and gives rise to a multiple-rate model. The MRDT model is constructed by dividing the one substitution rate of the SRDT model (Rambaut 2000)<$REFLINK> into a vector **Ω** = {ω_{1}, ω_{2}, … , ω* _{k}*}, where ω

*is the*

_{i}*i*th substitution rate in the model (fig. 1 ). Hence, this ω-parameterization allows the substitution rate to have a number of stepwise changes between the most recent and the most ancient sampling times. As in the SRDT model, branch lengths from the root of the tree are no longer required to be equal. Instead, branch lengths must sum to values determined by the temporal spacing of the tip in question and the different substitution rates of the time periods that the tip traverses. Since the information about substitution rates comes from the relative positioning of tips in the tree, it is evident that rate parameters can only be estimated for time intervals for which there exists at least one sequence sample. Hence, the maximum number of ω parameters is given by the number of sampling points minus one, as one time point is needed for reference. However, this maximum number of rate parameters cannot be estimated for every tree topology. Take, for example, the simplest case of two sequences sampled at different times. In this situation, the uncertainty of the root confounds rate and time parameters, and the sequence data only hold information about the upper limit of the rate (set by the branch length between the two sequences).

The parameters of the tree are thus the substitution rates **Ω** and the vector of times corresponding to the dated tips and the (*n* − 1 for a bifurcating tree) internal node heights (*h*), measured in units of substitutions (following Rambaut 2000; note that the tip times may be measured either in generations or in some calendar unit, and a simple rescaling allows one to move between the two). Our framework estimates a series of substitution rates only within the interval bounded by the first and last samples. Specifically, no assumptions are made with regard to the rate between the earliest sampling time and the root of the tree. Over this interval, there is no chronological information, and the branch lengths are free to be optimized in the standard manner as composite parameters of time and substitution rate. This rate may, of course, be of interest, for example, in dating the most recent common ancestor (MRCA). In this case, additional assumptions must be made: a natural assumption in the case of stepwise changes is that the earliest estimated rate remains constant when extrapolated back in time to the root.

For a given tree *T,* the likelihood of **Ω** is the conditional probability of obtaining the sequence data **S** given **Ω**, *T,* and **τ**, the vector of times, as well as the instantaneous substitution rate matrix **Μ** (also assumed to be known):

**Ω**and

**τ**enter the calculations as constraints on the branch tip positions (fig. 1

*b*and

*c*). The MLEs of the rates, ω̂

_{i}, are jointly chosen such that

*L*(

**Ω̂**) is maximized. The only remaining constraint in place is that each estimated rate cannot be less than zero.

Confidence interval estimation in the case of multiple ω's is less straightforward. There are at least two ways of computing confidence intervals for multiple rates. First, multivariate upper and lower (1 − α)% confidence limits may be obtained by locating rates that correspond to log likelihood values differing from the maximum log likelihood value by χ^{2}_{k,α}/2, where *k* is the number of rates estimated. If unbiased, these confidence intervals have a probability of (1 − α) of enclosing the true **Ω**. Second, a profile confidence likelihood interval may be obtained for each ω as follows. Over a range of ω* _{i}*, locate the upper and lower values of ω

*such that*

_{i}_{j}is the MLE of the

*j*th rate, and ω̂

^{*}

_{j}is the ML estimate of the

*j*th rate when ω

*is fixed at a given value.*

_{i}In the case where all elements of **Ω** are equal, the MRDT model collapses to the SRDT model of a uniform molecular clock. If all ω parameters are set to 0, the MRDT model reduces to the standard contemporaneous tips clock model (the single-rate [SR] model; Goldman 1993<$REFLINK> ; Rambaut 2000<$REFLINK> ). In fact, under the likelihood framework, one is able to test whether the MRDT model is a significantly better model for the data than the SRDT model. Since the SRDT model is simply a constrained MRDT model, the standard asymptotic likelihood ratio test can be applied. In this case, the test statistic,

^{2}with

*k*− 1 degrees of freedom under the null hypothesis that the two models are not significantly different, where

*k*is the number of ω parameters.

When testing the SRDT model against the SR model, the null and alternative hypotheses are of the form

The test is a one-tailed test. If α is chosen as the level of significance, then the null hypothesis should be rejected if## Least-Squares Model

With the distance matrix LS estimate of **Ω** described by Drummond and Rodrigo (2000)<$REFLINK> , the expected evolutionary distance *d*(*m _{i}*,

*n*) between a pair of sequences

_{j}*m*(of the

_{i}*i*th sample; assume this is the earlier time point) and

*n*, is equal to the expected pairwise distance Θ

_{j}*for sequences from sample*

_{i}*i*plus the added substitutions accruing between sequences from sample

*i*and sample

*j*in the interval τ

_{j}− τ

_{i}. If there exist times τ

_{i+1}, τ

_{i+2}, … , τ

_{j−1}in this interval that correspond to changes in substitution rate, then

**P̂**= {

**Θ̂**,

**Ω̂**} are obtained by the standard LS solution:

**d**is the vector of the pairwise distances, and

**t**is the matrix of time intervals and [0, 1] values signifying the absence or presence, respectively, of the Θ's associated with each of the samples. Unlike the MLE, LS rate estimates obtained using equation (6) are not constrained to be nonnegative. Such a constraint can be added with appropriate linear programming strategies.

The standard error of the LS estimates of ω cannot be calculated easily because of the nonindependence of the pairwise distances. Drummond and Rodrigo (2000)<$REFLINK> advocate the use of the parametric bootstrap (Efron and Tibshirani 1993<$REFLINK> ; Goldman 1993<$REFLINK> ) to generate confidence intervals of the estimates. Parametric bootstrapping requires specification of a model and subsequent simulation of pseudoreplicate data sets with the same number of sequences and sites as the original data, assuming that the estimates recovered using the observed data are the “true” values of the parameters. With the SRDT model and an assumption of a constant Θ over time, it is easy to generate pseudoreplicate data sets under a coalescent model in which population size is held constant (Drummond and Rodrigo 2000)<$REFLINK> . However, under the MRDT model, parametric bootstrapping is not simple, since any resampling procedure must accommodate changing substitution rates and Θ's. This is a drawback of the distance-based LS method—procedures for variance estimation are often elusive.

## Example

In this section, we illustrate the use of the MRDT model on an HIV data set previously published by Rodrigo et al. (1999), where the onset of drug therapy is shown to coincide with a significant reduction in substitution rate.

Before the advent of potent combination therapy against HIV, drugs were less effective in lowering viral load and hindering progression toward AIDS. To investigate the effect of a one-drug therapy regime on the evolutionary progression of HIV, we analyzed previously published data consisting of serially sampled partial HIV-1 envelope (*env*) sequences from an infected individual who began Zidovudine treatment partway through the sampling period (Rodrigo et al. 1999<$REFLINK> ). Complete details of the data set are given in Rodrigo et al. (1999); briefly, the data set contains an initial sample followed by additional samples after 7 (day 214), 22 (day 671), 23 (day 699), and 34 months (day 1005). Monotherapy with Zidovudine was initiated after 13 months (day 409; J. Mullins, University of Washington, personal communication) and continued during the remaining time of study. Therefore, the data set contains two samples from before and three samples from after treatment began.

It has been suggested that highly active combination antiretroviral therapy leads to a cessation of viral replication (Finzi et al. 1997<$REFLINK> ; Wong et al. 1997<$REFLINK> ). A natural question is whether monotherapy with Zidovudine had the effect of slowing or halting viral replication in the particular individual from whom samples were available. If viral replication does, in fact, cease (or slow down), this will be reflected in the rate at which substitutions accumulate, since it is during the process of viral replication that this occurs. This corresponds to testing whether a, MRDT model, which allows for a change in substitution rate after the onset of therapy, provides a better fit to the data than an SRDT model and, if so, whether the estimated substitution rate after drug therapy is significantly different from zero.

The data set consisted of 60 sequences from five time points. The length of the alignment was 660 nt. Gapped columns were included in the analysis. To begin with, the data set was first split into two subsets, one containing all sequences before therapy (28 sequences), and the other containing all sequences after therapy commenced (32 sequences). For each of these data sets, a neighbor-joining tree was built and an ML general time-reversible (REV) model was estimated using PAUP* 4.0b4 (Swofford 1999<$REFLINK> ).

Each tree was used to estimate a uniform substitution rate using the SRDT likelihood model as implemented in the computer program TIPDATE (Rambaut 2000)<$REFLINK> . TIPDATE was also used to find the ML roots for the two trees. This was done by rooting the tree at every branch on the unrooted topology and optimizing the branch lengths in accordance with the dated tips. The rooted topology that maximizes the likelihood was used to estimate the substitution rate. All estimated rates are reported in table 1 . A rate (ω_{before}) of 5.034 × 10^{−5} substitutions per site per day (1.84% per year, 95% confidence limit = [1.02%, 2.73%]) was obtained for the sequences before therapy, and a rate (ω_{after}) of 5.8 × 10^{−7} substitutions per site per day (0.021% per year, 95% confidence limit = [0.0%, 0.77%]) for the sequences after therapy. As ω_{after} has a confidence interval that encloses 0, we cannot show that significant substitutions have occurred since therapy commenced.

The complete data set, consisting of sequences obtained before and after therapy, was then used to obtain an unconstrained and unrooted neighbor-joining tree, once again using the REV substitution model. Once again, an SRDT model was fitted to the tree (after the ML root was found), and a uniform substitution rate of 1.346 × 10^{−5} substitutions per site per day (0.49% per year, 95% confidence limit = [0.26%, 0.76%]) was estimated. An MRDT model was then fitted to the full data set, allowing two substitution rates, the first up to the time of therapy (i.e., 409 days from the first sample), and the second after this time. Rates of 4.145 × 10^{−5} substitutions per site per day (1.51%) and 0.0 substitutions per site per day were estimated simultaneously for ω_{before} and ω_{after}, respectively. Trees reconstructed using MRDT and SRDT models are shown in figure 2 . To obtain the 95% confidence intervals for both substitution rates, a grid search of the two parameters was undertaken. The rate ω_{before} was allowed to vary from 0 to 10^{−4} substitutions per site per day, while ω_{after} was allowed to vary from 0 to 5 × 10^{−5} substitutions per site per day, both in steps of 10^{−6} substitutions per site per day. The likelihood surface resulting from this search is shown in figure 3 as a contour plot. The resulting 95% profile confidence intervals were obtained by taking the maximum and minimum values of ω_{before} and ω_{after} on the contour demarcating χ^{2}_{1,0.05} (=1.92) log likelihood units from the maximum log likelihood. For ω_{before}, the profile likelihood confidence interval is [2.6 × 10^{−5}, 5.8 × 10^{−5}], whereas for ω_{after}, it is [0, 0.8 × 10^{−5}]. The bivariate confidence interval for **Ω̂** = {ω̂_{before}, ω̂_{after}} is also outlined on the likelihood surface contour plot by the contour demarcating χ^{2}_{2,0.05}/2(=2.99) log likelihood units from the maximum log likelihood. The upper and lower values of ω_{before} and ω_{after} on this bivariate confidence interval contour are [2.1 × 10^{−5}, 6.1 × 10^{−5}] for ω_{before} and [0, 1.1 × 10^{−5}] for ω_{after}. Of course, these intervals are larger than the profile likelihood confidence intervals, but only marginally so.

Table 1 gives the log likelihood scores obtained using the different models described above. For the complete data set with samples before and after therapy included, the most general clocklike model is the MRDT model. As explained above, the SRDT model is constrained so that all ω's are equal. The SR model with contemporaneous tips is a further constraint on the SRDT with all ω's equal and set to zero. In Table 1 , likelihood ratio test statistics have been computed for MRDT versus SRDT and for SRDT versus SR models. The SRDT model is significantly better than the SR model (*P* < 0.05), and the MRDT model is significantly better than the SRDT model (*P* < 0.01).

Similar analyses were performed for before- and after-therapy samples, except that in these instances, the only comparison made was between the SRDT and SR models. For the before-therapy samples, the SRDT model has a statistically better fit to the data than the SR model (*P* < 0.01). However, for the after-therapy sequence subset, the SRDT model cannot be distinguished statistically from the SR model. Taken on its own, this suggests that there is little or no accumulation of substitutions over this period. (Caution must be taken with this interpretation: as we discuss in the next section, the MRDT model is significantly worse than a model that assumes no consistent clocklike pattern of evolution among the sequences).

Equivalent estimates were also derived with the LS method. Table 1 summarizes the results. Both the likelihood and the least-squares procedures consistently estimated a higher rate of substitution before therapy, about an order of magnitude greater than the estimated rate after therapy.

## Discussion

The framework presented allows for the modeling of complex evolutionary scenarios such as the evolution of HIV sequences undergoing drug therapy. Application of the MRDT model to samples obtained from an individual treated with zidovudine appears to indicate a reduction in substitution rate after the commencement of therapy. Our results are consistent with those obtained elsewhere (Chun et al. 1997<$REFLINK> ; Wong et al. 1997<$REFLINK> ). Independent rate estimates from samples before and after therapy have nonoverlapping 95% confidence intervals and are therefore significantly different at α = 0.05. This poses a problem for any SRDT estimation procedure. TIPDATE, for instance, returns a rate of 0.5% per year for the entire genealogy. This rate is lower than previously published rates of HIV evolution (0.93% per year; Shankarappa 1999<$REFLINK> ). However, it is similar to other published estimates for this data set that assume a single rate (0.3% per year; Drummond and Rodrigo 2000)<$REFLINK> . In this paper, we outlined a likelihood framework that addresses this discrepancy, in addition to providing a pairwise distance least-squares estimation approach. There are, nonetheless, several features of these analyses that bear mention and indicate that more work in this area is required.

First, for any analysis that involves the inference of some kind of clocklike behavior, whether it be a constant clock or a changing clock, a first step should be a test of whether such a model is significantly worse than an unconstrained nonclock model (also called a different-rate [DR] model). The DR model is the standard used in phylogenetic tree reconstruction and effectively allows every branch to have its own substitution rate. Thus, the length of the *i*th branch is an estimate of the composite parameter ω* _{i}t_{i}*. If an SRDT model or an MRDT model is significantly worse than the DR model, it means that at least some lineages are not evolving in a clocklike manner. In fact, where appropriate, we recommend a hierarchy of nested likelihood ratio tests: DR versus MRDT, MRDT versus SRDT, and SRDT versus SR. For our example data set, the DR model was always significantly better than the SRDT and MRDT models (data not shown). Our primary intention with the use of the data set was simply to illustrate the methods described in this paper, rather than to make substantive statements about the effects of monotherapy on substitution rates. It is important, however, to note this here for completeness.

Also, the likelihood estimation procedures presented here (and by Rambaut [2000]) assumes that the evolutionary history of the sequences, i.e., the topology of the genealogy, is known or can be reconstructed correctly. The bias introduced into parameter estimation and hypothesis-testing procedures by using incorrect genealogies is largely unknown. On the other hand, the least-squares estimation procedure is not based on a reconstructed topology and therefore may not suffer from this possible source of bias. For example, for a single-rate model, the least-squares estimator has been shown to be an unbiased estimator (Drummond and Rodrigo 2000)<$REFLINK> . However, distance-based LS methods do not take into account the correlations induced by shared history, thus making variance estimation difficult.

Ultimately, the best approach would be to incorporate the uncertainty of the genealogy explicitly into a probabilistic framework. One way of taking the uncertainty of the topology into consideration in the likelihood model is to integrate over a number of topologies. A natural way to do this is to use a Markov chain Monte Carlo (MCMC) sampling procedure to sample tree space in proportion to the likelihood of the data (Kuhner, Yamato, and Felsenstein 1995<$REFLINK> ). This approach has been used, for example, to incorporate the uncertainty in the tree topology into estimates of population size and growth rates (Kuhner, Yamato, and Felsenstein 1995, 1998<$REFLINK> ). This method has a natural extension to the estimation of substitution rates and can also be used to find confidence intervals in topology space under the SRDT and MRDT models of evolution.

One of the interesting observations of this study is that different models (SR, SRDT, MRDT) can have different ML tree topologies. This may turn out to be a common occurrence. For example, for a 45-sequence subset of the data, 729 strictly bifurcating ML tree topologies were found. Although these trees had identical likelihoods under an unconstrained (nonclock) model, they had a range of likelihoods under the SR, SRDT, and MRDT models. Furthermore, no single strictly bifurcating topology represented the ML topology under all three models. If one chooses to use different topologies for each model, then the asymptotic approximation to the likelihood ratio test cannot be used. Instead, some alternative procedure (say, a parametric simulation procedure [Goldman and Whelan 2000]) should be used. A sampling method such as MCMC would also be useful in this case, as the sampling procedure integrates over tree space in proportion to the likelihood of the data. Thus, for two competing models, a null and alternative distribution can be compared.

In the previous section, we also alluded to the fact that different rootings of an unrooted tree can have different likelihoods under a given model of substitution. By extension, this also means that different models may require the tree to be rooted differently. This does not change the mechanics of any likelihood ratio test, since no new free parameters are added to the model. However, if the root of the tree is not known, an extra step needs to be added to any analysis to find the appropriate root.

Serial molecular samples add a new dimension to population genetics studies. Since it is possible to estimate substitution (or mutation) rate independently of other parameters, it is also possible to decouple composite population parameters like Θ = 2*N*_{e}μ (where *N*_{e} is the effective population size) into their component parts. The models we introduce here go one step further and allow these parameters to be expressed as functions of time. Although we have only spoken of stepwise changes in substitution rates, these models can be generalized to allow substitution rate to vary as any predefined function of time. With viral populations such as HIV, this becomes especially interesting, since it allows us to study changes in average generation time and substitution rate during disease progression or under different therapeutic regimes. In conjunction with the estimation of demographic functions of time (Pybus<$REFLINK> , Rambaut, and Harvey 2000)<$REFLINK> , it also means that we can decompose 𝛉(*t*) = 2*N*_{e}(*t*)μ(*t*) into the component functions of *N*_{e}(*t*) and μ(*t*), where μ(*t*) is a stepwise function of time.

In this paper, we have assumed that the times corresponding to changes in substitution rates are fixed either to the sampling times or to some time point known a priori. Similarly, we have also assumed that the phylogeny is known. However, since these times and the phylogeny are parameters embedded in the model, they can also be jointly estimated within the likelihood framework.

The models we have described in this paper apply to any set of molecular sequences of sufficient length, or obtained sufficiently far apart in time, that an appreciable amount of substitutions has accumulated. These include ancient DNA sequences, as well as rapidly evolving viral sequences. In conjunction with efforts to model lineage-specific rates (Thorne, Kishino, and Painter 1998<$REFLINK> ; Huelsenbeck, Larget, and Swofford 2000<$REFLINK> ) and other time- or lineage- dependent processes, the models presented here go some way toward a more realistic description of the evolution of molecular sequences.

MLE and LS estimates under the SR, SRDT, and MRDT models can be obtained using the computer program PEBBLE, available from the website http://www.cebl.auckland.ac.nz or from the authors.

Edward Holmes, Reviewing Editor

Keywords: serial samples stepwise rate changes maximum likelihood least squares substitution rate

Address for correspondence and reprints: Allen G. Rodrigo, School of Biological Sciences, University of Auckland, Private Bag 92019, Auckland, New Zealand. E-mail: a.rodrigo@auckland.ac.nz.

We thank Matthew Goode and Greg Ewing for assistance in developing the PEBBLE software package and the distributed programming techniques used for likelihood surface calculations. We thank Jim Mullins for providing information on the timing of therapy and other aspects of the example data set. For helpful comments and discussion, we thank David Nickle, Andrew Rambaut, and another anonymous reviewer. This research was funded by NIH grant GM59174. A.D. was supported by a New Zealand FRST Bright Futures Scholarship.