## Abstract

Reconstruction of evolutionary relationships from noncontemporaneous molecular samples provides a new challenge for phylogenetic reconstruction methods. With recent biotechnological advances there has been an increase in molecular sequencing throughput, and the potential to obtain serial samples of sequences from populations, including rapidly evolving pathogens, is fast being realized. A new method called the serial-sample unweighted pair grouping method with arithmetic means (sUPGMA) is presented that reconstructs a genealogy or phylogeny of sequences sampled serially in time using a matrix of pairwise distances. The resulting tree depicts the terminal lineages of each sample ending at a different level consistent with the sample's temporal order. Since sUPGMA is a variant of UPGMA, it will perform best when sequences have evolved at a constant rate (i.e., according to a molecular clock). On simulated data, this new method performs better than standard cluster analysis under a variety of longitudinal sampling strategies. Serial-sample UPGMA is particularly useful for analysis of longitudinal samples of viruses and bacteria, as well as ancient DNA samples, with the minimal requirement that samples of sequences be ordered in time.

## Introduction

It is well known that some of the more pernicious human viral pathogens evolve rapidly. Indeed, it is their evolution that stymies attempts to battle infection with antiviral drugs—resistance evolves too quickly. With HIV-1, for instance, 10^{−5}–10^{−4} substitutions accumulate at each site in each generation, and there are an estimated 140–300 generations per year (Perelson et al. 1996<$REFLINK> ; Rodrigo et al. 1999<$REFLINK> ). Parts of the HIV genome have been shown to accumulate substitutions at a rate of 0.92% per year (Shankarappa et al. 1999<$REFLINK> ). There is some thought in the research community that understanding how these viruses evolve is the key to understanding how one may control disease. Recent results give us cause to think that this may be true: a study by Shankarappa et al. (1999)<$REFLINK> found that in nine individuals infected with HIV, the pattern of viral evolution within each patient was strikingly similar, with certain features that appeared predictive of progression to AIDS. If such commonality of pattern is universal, then generalizations can be made about the process of evolution that such patterns suggest, and this, in turn, may lead to a strategy to control progression.

The study by Shankarappa et al. (1999)<$REFLINK> involved repeated sampling of the viral population from each individual over several years, but such sampling schemes are not uncommon for such rapidly evolving pathogens (Holmes et al. 1992<$REFLINK> ; Wolinsky et al. 1996<$REFLINK> ; Rodrigo et al. 1999<$REFLINK> ). A starting point for many evolutionary and population genetic methods is a reconstructed phylogeny of sampled sequences (Felsenstein 1992<$REFLINK> ; Fu 1994<$REFLINK> ; Nee et al. 1995<$REFLINK> ; Pybus, Rambaut, and Harvey 2000<$REFLINK> ), often under the assumption of a molecular clock, but until now, there has been no method for reconstructing evolutionary trees of serially sampled sequences under this assumption. In this paper, we present such a method. The serial-sample unweighted pair grouping method with arithmetic means (sUPGMA) is a fast, flexible phylogenetic reconstruction method that can be used whenever samples have been obtained at different times. These samples may be of sequences from a rapidly evolving viral population obtained from within a patient over the course of infection or from cohorts of individuals sampled over time. We demonstrate the efficiency of sUPGMA at recovering the true topology and describe accessory analyses that allow the estimation of population parameters and mutation rate. Finally, we discuss various extensions of sUPGMA and its associated analyses.

## Serial-Sample UPGMA

Consider the following sampling scheme. A population is sampled several times over the course of a study period, and at each sampling time a number of sequences are obtained. If these sequences have evolved so that all lineages accumulate substitutions at the same rate over the same period of time (i.e., according to a molecular clock), then the best representation or model of their phylogeny will look something like that shown in figure 1*E.* Here, six sequences were sampled, two at each of three time points. One would expect, if clocklike evolution were occurring, that sequences from the same time point would terminate at identical times. One method for reconstructing phylogenies of sequences according to a molecular clock is the unweighted paired group method with arithmetic means (UPGMA; see Sneath and Sokal 1973<$REFLINK> ). However, with UPGMA, all tips on the tree terminate at the same time (i.e., the tree is ultrametric). What is required to reconstruct the phylogeny shown in figure 1*E* is a method that allows the tips to terminate at different times but constrains tips sampled at the same time to terminate at identical distances from the root. Serial-sample UPGMA allows for this. The method consists of four sequential steps.

### Step 1. Estimation of δ's

Simply, step one involves estimating the expected number of substitutions per site accumulating between sampling times. It has been shown how this may be done for pairs of samples (Y.-X. Fu, personal communication). The expected distance between a pair of sequences, one from a later time point and the other from an earlier time point, is

The first term on the right-hand side is simply the expected average distance between any two sequences from the earlier time point. To obtain an estimate of δ, we substitute the average pairwise distance between early and late sequences calculated from our sample for the term on the left and the average pairwise distance between pairs of early sequences for the first term on the right and solve. The problem becomes tricky when there are more than two time points, because now it becomes possible to calculate δ's for every possible pair of sampling times. The problem with this approach is that it may happen that, for any three time points *A, B,* and *C* (where *C* is earlier than *B,* which is earlier than *A*), δ̂_{CA} ≠ δ̂_{CB} + δ̂_{BA} (where δ̂ is the estimated value), when, in fact, under any reasonable model, the equivalent equality must be true. To overcome this problem, we adopted a general regression approach to estimate δ, as follows. Consider a data set of *p* samples, with sample *i* obtained more recently than sample *i* + 1 (*i* ∈ 1, …, *p*). Let *d*(*m _{i}*,

*n*) be the evolutionary distance between the

_{j}*i*th sequence of the

*m*th sample and the

*j*th sequence of the

*n*th sample; by convention, we will assume that

*m*≥

*n*; i.e., we will only consider elements in the diagonal and lower triangular matrix of pairwise distances.

We can model each *d*(*m _{i}*,

*n*) by its expectation

_{j}*E*[

*d*(

*m*,

_{i}*n*)], and from equation (1) , we obtain

_{j}For reasons that will become obvious below, we will designate *E*[*d*(*m*^{(1)}_{i}, *m*^{(2)}_{i})] = Θ_{m}. In addition, δ_{m→n} can be written as the sum of δ_{m→m−1}, δ_{m−1→m−2}, …, δ_{n+1→n}. Thus, we can write the linear equation relating the distances to the parameters as

_{k→k−1}is the expected number of substitutions that have accumulated between the

*k*th and the (

*k*− 1)th samples, and ϵ

_{mi,nj}is the error due to natural variation, measurement, and sampling.

The vector of estimated parameters **â** = {Θ̂_{1}, Θ̂_{2}, …, Θ̂_{p}, δ̂_{2→1}, …, δ̂_{p−1→p}} is obtained by the standard least-squares solution:

**d**is a vector of pairwise distances. With this approach, the estimate of the δ's satisfies the condition δ̂

_{CA}= δ̂

_{CB}+ δ̂

_{BA}. One additional constraint that we make to the δ's is to set any value of δ that has been estimated as a negative value to 0.

For the estimation approach above, it is not essential to know the actual sampling times, only the order in which the samples were drawn. If the actual sampling times are known, then an alternative approach to estimating δ is to estimate a single constant, ω, effectively the number of substitutions per unit time, and multiply this by the time interval between two sampling occasions, i.e., ω(*t*_{1} − *t*_{2}).

Once again, we estimate ω using a regression procedure. In this case,

where*t*is the time at which the

_{k}*k*th sample was obtained. Note that ω is not the mutation rate per generation unless time is expressed in generation units. However, ω can be converted to the mutation rate (i.e., the number of substitutions per site per generation) if the generation time is known.

### Step 2. Correction of Pairwise Distances

Each pairwise distance *d _{ij}* in the distance matrix is now transformed to a corrected distance,

*c*(

*m*,

_{i}*n*), as follows:

_{j}_{m→1}and δ̂

_{n→1}are the δ's associated with the divergence between samples

*m*and

*n*and the most recent sampling occasion (labeled “1”). What this does, in effect, is extend the distances of sequences sampled earlier to a value that approximates the expected divergences of sequences obtained most recently.

### Step 3. Cluster Using UPGMA

In step 3, UPGMA or WPGMA (weighted PGMA; Sneath and Sokal 1973<$REFLINK> ) is applied to the corrected distance matrix.

### Step 4. Trim Back branches

Once the UPGMA tree has been constructed, for any terminal lineages which extend to sequences in sample *m,* δ̂_{t(i)→0} is subtracted from the branch length. The sUPGMA tree has the topology recovered by UPGMA (on corrected distances), with tips terminating in the appropriate order of sampling.

## Estimation of Population Parameters and Mutation Rate

As described in step 1 above, a vector of parameters is estimated as part of the tree-building algorithm. This vector takes the form **â** = {Θ̂_{1}, Θ̂_{2}, …, Θ̂_{p}, δ̂_{2→1}, …, δ̂_{p−1→p}} when the order of samples is known and **â** = {Θ̂_{1}, …, Θ̂_{p}, ω̂} when exact times are known. Of course, within this framework, there is no need to specify a model with different values of Θ; instead, we could estimate a single parameter, Θ_{0}, such that **â** = {Θ̂_{0}, ω̂}. In this case, the average pairwise diversity at each time point is effectively a random variable with expectation Θ_{0}. Setting Θ_{0} as a constant is equivalent to assuming a population model with constant effective size: under such a model, Θ_{0} = 2*N*_{e}μ, where *N*_{e} is the effective population size and μ is the mutation rate per site per generation (Tajima 1983<$REFLINK> ).

Although the interpretation of a single Θ_{0} is easily accommodated within a simple constant-sized population model, this is not the case when multiple Θ's are estimated. Multiple Θ's should not be taken as (independent) estimates of different 2*N*_{e}μ values, because the overlap in genealogies from one sample to the next affects the pairwise distances of the sequences in a complex way. The simple assignment of different Θ's in our model does not incorporate these complexities.

However we choose to define our model, the variance of the estimates cannot be easily calculated analytically. However, for a constant-sized population model at least, a parametric bootstrap method for obtaining the variance of these estimates can be implemented. For a given set of parameter estimates, a large number (typically >1,000) of serially sampled genealogies can be simulated using the estimated parameters (and assuming a constant population size) to generate pseudoreplicate data sets. For each generated pseudoreplicate, the sUPGMA procedure is then repeated, resulting in a range of estimates for Θ's and δ's or ω. For a 95% confidence interval and 1,000 replicates, the 26th and 975th estimates (when ranked) are taken as the upper and lower 95% confidence limits of the original estimate.

## Efficiency of Tree Reconstruction

To test the efficiency of sUPGMA, simulated data sets were created for which the real phylogenetic tree was known. Rodrigo and Felsenstein (1999)<$REFLINK> described how Kingman's (1982*a*, 1982*b*) *n*-coalescent, essentially a diffusion approximation of the times of *n* − 1 coalescent events on an *n*-taxon tree, could be extended to coalescent trees with noncontemporaneous tips. One of the novel properties of coalescent trees of serial samples is that sampling a direct descendant of a sequence sampled at an earlier time point becomes possible (although unlikely when *N*_{e} is very large). The probability of a single lineage from a later time point having a direct ancestor in an earlier sample is equal to the fraction of the total population size sampled at the earlier time (*n*_{t(earlier)}/*N*_{e}) (Epperson 1999<$REFLINK> ). This possibility was also permitted in the simulations performed, representing an extension of the original description of the serial-sample coalescent of Rodrigo and Felsenstein (1999)<$REFLINK> . It should be noted that this inclusion results in the possibility of multiple coalescent events occurring at the same time point when more than one direct ancestor is sampled at one time point. However, this happens at an appreciable rate only when the assumption of a very large population size is broken (i.e., when *n*^{2} ≥ *N*_{e}). At this point, the diffusion approximation of coalescent intervals itself is no longer valid. To avoid this problem, values of *N*_{e} were selected so that *n*^{2} was always smaller than *N*_{e}. Therefore, the simulations were performed under the assumption of a constant population size, and *N*_{e} was set to 10,000, which is large enough to fulfill the requirement that *n*^{2} ≪ *N*_{e}. The mutation rate was set to 5 × 10^{−6} mutations per site per generation. This resulted in an overall Θ value of 0.1 (for a haploid population), comparable to published values for HIV evolution (Leigh Brown 1997; Rodrigo et al. 1999<$REFLINK> ). The model of evolution used in the simulations was a simple Jukes-Cantor substitution model (Jukes and Cantor 1969<$REFLINK> ). The simulated genealogies were drawn from populations with no selection, recombination, or subdivision.

The serial-sample coalescent algorithm was implemented in a small Java program for the purpose of generating coalescent trees under a variety of different sampling strategies (table 1 ). This allowed an appraisal of the effect of different sampling strategies on the accuracy of tree-building algorithms. For each sampling strategy tested, a range of 0.5% to 10% intersample divergence was tested, with an increment of 0.5%. For each sampling strategy and each divergence, 1,000 simulated genealogies were constructed. All simulations resulted in time-ordered DNA sequences of 1,000 nt in length. This length was comparable with lengths of many gene loci available for phylogenetic study and is not so long that assuming no recombination is untenable. For each simulation, a pairwise Jukes-Cantor distance matrix was constructed. The ability of sUPGMA and UPGMA to correctly reconstruct the simulated genealogies using the pairwise distances was evaluated. The reconstructed trees of each method were compared with the real tree using the symmetric difference index (SDI) tree comparison metric (Robinson and Foulds 1981<$REFLINK> ). This metric counts the number of clades in each tree that are not present in the other tree.

Figure 2 shows the performance of sUPGMA and UPGMA on serially sampled data sets with four serial samples. Essentially the same pattern was seen for all sampling strategies. The performance of sUPGMA generally increases with divergence, while the performance of UPGMA generally decreases. The graphs in figure 2 indicate that once some low threshold of intersample divergence is exceeded, sUPGMA reconstructs the genealogy more accurately than UPGMA. Table 2 shows the approximate threshold values for a variety of sampling strategies. Each threshold value was found by picking the lowest divergence for which sUPGMA performed better on average than UPGMA. In general, our simulations indicated that the divergence threshold decreases with an increase in the size of each sample. Therefore, collecting more sequences within each time point improves the ability of the least-squares procedure to detect small divergences.

## Efficiency of Parameter Estimation

The efficiency of parameter estimation of Θ, ω, and δ's was measured by simulating two sets of 1,000 serially sampled genealogies, one parameterized in accordance with equation (3a) , and the other in accordance with equation (5) . One thousand genealogies of four samples, each with five sequences, were simulated under the Jukes-Cantor model of substitution, resulting in time-ordered sequences of 1,000 nucleotides. Figure 3 shows the distribution of estimates of Θ (true value = 0.1) for the 1,000 simulations with a total divergence over the four samples of 6%. The mean estimate of Θ was 0.0986, with a skewness statistic of 1.753, showing that the least-squares procedure produces estimates of Θ that are unbiased but have a positively skewed distribution (tables 3 and 4 ). The least-squares estimators of δ_{1}, δ_{2}, δ_{3}, and ω are also unbiased, although once again, the distributions of the estimates are skewed. Figures 4–7 show frequency distributions for estimates of δ_{1}, δ_{2}, δ_{3}, and ω, respectively.

## An Example Data Set

In this section, we illustrate the use of sUPGMA with a data set of serially sampled partial envelope (*env*) gene sequences of cell-associated HIV DNA obtained from a long-term asymptomatic individual over five sampling occasions. These samples and the patient history have previously been described (Rodrigo et al. 1999<$REFLINK> ). In total, there were 60 sequences in this data set. Pairwise distances were constructed using a general time-reversible model allowing for unequal nucleotide frequencies and relative rates of substitutions. Substitution and frequency parameters of the substitution model were estimated with PAUP*, version 4.0b4 (D. Swofford, Smithsonian Institution). sUPGMA was applied to the pairwise distance matrix to reconstruct the serial genealogy of the sequences, allowing different values of Θ and δ's. We also reconstructed the genealogy by assuming a constant Θ and mutation rate, ω, and used parametric bootstrapping of 1,000 simulated trees to obtain 95% confidence intervals for these parameters. The reconstructed trees are shown in figure 8 , and the associated parameter estimates are given in table 5 .

It is instructive to consider some of the main points of these results. When Θ and δ are allowed to vary, sUPGMA is unable to distinguish between samples 2 and 3; i.e., for this interval, δ = 0. In fact, these two samples were obtained only 1 month apart, so this result is reasonable. When Θ is held constant and ω is estimated, the values obtained are Θ = 0.0446 (95% confidence interval [0.0184, 0.1016]) and ω = 7.8 × 10^{−6} substitutions per site per day (95% confidence interval [−3.47 × 10^{−6}, 3.87 × 10^{−5}]). This estimate of ω translates into an annual substitution rate of 0.3%. This is certainly lower than other HIV-1 *env* gene substitution rate estimates that have previously been obtained, which are on the order of 1% per year (Shankarappa et al. 1999<$REFLINK> ). It is not clear why our estimate of substitution rate is three times as low as other estimates. It is pertinent to note that with the patient from whom the samples were obtained, antiretroviral therapy was initiated at an early stage of the study, and this, in turn, may have lengthened the average generation time of infected cells (see below) and consequently lowered the substitution rate. When a varying substitution rate was allowed, the average rate obtained over the entire 1,005 days of the study was 1.53 × 10^{−5} substitutions per site per day (0.6% per year), which is closer to previously obtained results. However, this mean rate is still deflated by the very slow substitution rate observed in the last 306 days of the study (see table 5 ).

Interestingly, the 95% confidence interval of our estimate of mutation rate encloses 0. While this can mean that there is no evidence that there has been a detectable substitution accumulation over time, it can also mean that there were some sequences obtained at a later time point that appear more closely related to those from an earlier time point. In fact, in the original tree published by Rodrigo et al. (1999)<$REFLINK> , this appears to be the case.

## Discussion

Serial-sample UPGMA is a variant of UPGMA which constructs genealogies of samples of sequences obtained at different times under the assumption of a molecular clock. Serial-sample UPGMA is a two-step procedure. The first step involves estimating the expected sequence divergence between samples obtained at different times. The second step requires the construction of a corrected distance matrix adjusted to take account of these expected divergences, and subsequent clustering using UPGMA. Given a more accurate estimation procedure for the divergences, the accuracy of sUPGMA tree reconstruction can be improved. For example, given a perfect estimate of divergences, the sUPGMA procedure will perform better than UPGMA under all sampling strategies and divergences (simulations not shown). Therefore, the threshold divergences required for sUPGMA to outperform UPGMA will be reduced by the use of better estimators of δ's and/or ω.

When a molecular clock does not apply, UPGMA is known to perform poorly as a tree reconstruction method. However, in the case of clocklike data that have experienced large amounts of evolution, the accuracy of UPGMA in reconstructing clocklike genealogies has been favorably compared with methods such as maximum-likelihood phylogenetic reconstruction (Pybus, Rambaut, and Harvey 2000<$REFLINK> ). Our results demonstrate that the accuracy of UPGMA for phylogenetic reconstruction can be improved by modifying the distances between longitudinally sampled sequences to correct for the extra divergence expected between earlier time points and the most recent time point. The rationale behind using sUPGMA as a basis for a tree reconstruction procedure for serial samples is to provide an (1) accurate and (2) rapid estimation of a serially sampled genealogy. Both the criteria for large divergences and clocklike evolution are fulfilled in at least some virus populations (Gojobori, Moriyama, and Kimura 1990<$REFLINK> ; Leitner and Albert 1999<$REFLINK> ; Shankarappa et al. 1999<$REFLINK> ). However, perhaps most importantly, the speed of sUPGMA allows very large data sets (with hundreds or thousands of sequences) to be analyzed with relative ease. This is an important feature when taking into account the sizes of genealogies already under consideration (e.g., Shankarappa et al. 1999<$REFLINK> ). The distance-corrected matrix that is constructed as part of sUPGMA can also be used with other members of the family of hierarchical algorithmic clustering methods such as WPGMA, complete-linkage and single-linkage clustering.

As part of our parameter estimation procedures, we also introduce two parameterizations of expected intersample sequence divergence. In one case—ω parameterization—divergence is expressed as a product of the sampling interval and mutation rate (with the latter scaled to the same units of time as the sampling interval). A second parameterization that we use, δ parameterization, is less constrained. With δ parameterization, the *i*th interval between two sampling occasions is effectively allowed to have its own mutation rate, ω_{i}, so that δ_{i} = ω* _{i}t_{i}*, where

*t*is the length of the interval. In a sense, δ parameterization provides a new intermediate model of evolution between the two extremes of a strict molecular clock and the absence of a molecular clock. We call this intermediate model the varying-clock model. With HIV, for instance, the application of antiretroviral therapy leads to changes in the relative frequencies of different infected cell types (Perelson et al. 1996<$REFLINK> ). Since each cell type has a different mean generation time, a change in population structure will lead to a change in mean generation time and, consequently, a change in the average mutation rate. This was already alluded to above when we analyzed our example data set. Under such conditions, a varying-clock model may be appropriate. (Note that the varying-clock model we propose is different from lineage-specific models of variable mutation rates. In the latter, the mutation rate is assumed to change independently along different branches of the tree [Thorne, Kishino, and Painter 1998<$REFLINK> ; Huelsenbeck, Larget, and Swofford 2000]<$REFLINK> .)

_{i}Although we focused on rapidly evolving viral populations in this paper, it should be obvious that sUPGMA and its associated procedures of parameter estimation apply equally well to eukaryotic populations from which ancient and/or archival DNA is available. We anticipate that the search for better methods to analyze such populations will only become more important with the increasing frequency of longitudinal sampling strategies and the acquisition of DNA samples from ancient or archival material.

A computer program called PEBBLE that implements sUPGMA and other related methods, written in the Java programming language, can be obtained from *www.cebl.auckland.ac.nz.* This software will run on all computer platforms that support the Java Virtual Machine version 1.1 (JVM 1.1). This includes Microsoft Windows, Linux, and MacOS.

Edward Holmes, Reviewing Editor

Keywords: longitudinal samples serial samples ancient DNA phylogenetic reconstruction UPGMA sUPGMA

Address for correspondence and reprints: Allen G. Rodrigo, School of Biological Sciences, University of Auckland, Private Bag 92019, Auckland, New Zealand. E-mail: a.rodrigo@auckland.ac.nz.

We thank Matthew Goode and Greg Ewing for assistance in developing the PEBBLE software package and the distributed programming techniques used for the simulations. For helpful comments and discussion, we thank Roald Forsberg and David Nickle. This work was supported by NIH grant 59174.

## literature cited

**152**:797–806

**136**:685–692

**87**:10015–10018

**89**:4835–4839

**154**:1879–1892

*in*H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York

*a.*On the genealogy of large populations. J. Appl. Probability

**19A**:27–43

*b.*The coalescent. Stochastic Processes Appl.

**13**:235–248

**94**:1862–1865

**96**:10752–10757

**271**:1582–1586

**155**:1429–1437

*in*K. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md

**96**:2187–2191

**105**:437–460

**272**:537–542