## Abstract

Reconstruction of evolutionary relationships from noncontemporaneous molecular samples provides a new challenge for phylogenetic reconstruction methods. With recent biotechnological advances there has been an increase in molecular sequencing throughput, and the potential to obtain serial samples of sequences from populations, including rapidly evolving pathogens, is fast being realized. A new method called the serial-sample unweighted pair grouping method with arithmetic means (sUPGMA) is presented that reconstructs a genealogy or phylogeny of sequences sampled serially in time using a matrix of pairwise distances. The resulting tree depicts the terminal lineages of each sample ending at a different level consistent with the sample's temporal order. Since sUPGMA is a variant of UPGMA, it will perform best when sequences have evolved at a constant rate (i.e., according to a molecular clock). On simulated data, this new method performs better than standard cluster analysis under a variety of longitudinal sampling strategies. Serial-sample UPGMA is particularly useful for analysis of longitudinal samples of viruses and bacteria, as well as ancient DNA samples, with the minimal requirement that samples of sequences be ordered in time.

## Introduction

It is well known that some of the more pernicious human viral pathogens evolve rapidly. Indeed, it is their evolution that stymies attempts to battle infection with antiviral drugs—resistance evolves too quickly. With HIV-1, for instance, 10−5–10−4 substitutions accumulate at each site in each generation, and there are an estimated 140–300 generations per year (Perelson et al. 1996<$REFLINK> ; Rodrigo et al. 1999<$REFLINK> ). Parts of the HIV genome have been shown to accumulate substitutions at a rate of 0.92% per year (Shankarappa et al. 1999<$REFLINK> ). There is some thought in the research community that understanding how these viruses evolve is the key to understanding how one may control disease. Recent results give us cause to think that this may be true: a study by Shankarappa et al. (1999)<$REFLINK> found that in nine individuals infected with HIV, the pattern of viral evolution within each patient was strikingly similar, with certain features that appeared predictive of progression to AIDS. If such commonality of pattern is universal, then generalizations can be made about the process of evolution that such patterns suggest, and this, in turn, may lead to a strategy to control progression.

The study by Shankarappa et al. (1999)<$REFLINK> involved repeated sampling of the viral population from each individual over several years, but such sampling schemes are not uncommon for such rapidly evolving pathogens (Holmes et al. 1992<$REFLINK> ; Wolinsky et al. 1996<$REFLINK> ; Rodrigo et al. 1999<$REFLINK> ). A starting point for many evolutionary and population genetic methods is a reconstructed phylogeny of sampled sequences (Felsenstein 1992<$REFLINK> ; Fu 1994<$REFLINK> ; Nee et al. 1995<$REFLINK> ; Pybus, Rambaut, and Harvey 2000<$REFLINK> ), often under the assumption of a molecular clock, but until now, there has been no method for reconstructing evolutionary trees of serially sampled sequences under this assumption. In this paper, we present such a method. The serial-sample unweighted pair grouping method with arithmetic means (sUPGMA) is a fast, flexible phylogenetic reconstruction method that can be used whenever samples have been obtained at different times. These samples may be of sequences from a rapidly evolving viral population obtained from within a patient over the course of infection or from cohorts of individuals sampled over time. We demonstrate the efficiency of sUPGMA at recovering the true topology and describe accessory analyses that allow the estimation of population parameters and mutation rate. Finally, we discuss various extensions of sUPGMA and its associated analyses.

## Serial-Sample UPGMA

### Step 4. Trim Back branches

Once the UPGMA tree has been constructed, for any terminal lineages which extend to sequences in sample m, δ̂t(i)→0 is subtracted from the branch length. The sUPGMA tree has the topology recovered by UPGMA (on corrected distances), with tips terminating in the appropriate order of sampling.

The serial-sample coalescent algorithm was implemented in a small Java program for the purpose of generating coalescent trees under a variety of different sampling strategies (table 1 ). This allowed an appraisal of the effect of different sampling strategies on the accuracy of tree-building algorithms. For each sampling strategy tested, a range of 0.5% to 10% intersample divergence was tested, with an increment of 0.5%. For each sampling strategy and each divergence, 1,000 simulated genealogies were constructed. All simulations resulted in time-ordered DNA sequences of 1,000 nt in length. This length was comparable with lengths of many gene loci available for phylogenetic study and is not so long that assuming no recombination is untenable. For each simulation, a pairwise Jukes-Cantor distance matrix was constructed. The ability of sUPGMA and UPGMA to correctly reconstruct the simulated genealogies using the pairwise distances was evaluated. The reconstructed trees of each method were compared with the real tree using the symmetric difference index (SDI) tree comparison metric (Robinson and Foulds 1981<$REFLINK> ). This metric counts the number of clades in each tree that are not present in the other tree. Figure 2 shows the performance of sUPGMA and UPGMA on serially sampled data sets with four serial samples. Essentially the same pattern was seen for all sampling strategies. The performance of sUPGMA generally increases with divergence, while the performance of UPGMA generally decreases. The graphs in figure 2 indicate that once some low threshold of intersample divergence is exceeded, sUPGMA reconstructs the genealogy more accurately than UPGMA. Table 2 shows the approximate threshold values for a variety of sampling strategies. Each threshold value was found by picking the lowest divergence for which sUPGMA performed better on average than UPGMA. In general, our simulations indicated that the divergence threshold decreases with an increase in the size of each sample. Therefore, collecting more sequences within each time point improves the ability of the least-squares procedure to detect small divergences. ## Efficiency of Parameter Estimation The efficiency of parameter estimation of Θ, ω, and δ's was measured by simulating two sets of 1,000 serially sampled genealogies, one parameterized in accordance with equation (3a) , and the other in accordance with equation (5) . One thousand genealogies of four samples, each with five sequences, were simulated under the Jukes-Cantor model of substitution, resulting in time-ordered sequences of 1,000 nucleotides. Figure 3 shows the distribution of estimates of Θ (true value = 0.1) for the 1,000 simulations with a total divergence over the four samples of 6%. The mean estimate of Θ was 0.0986, with a skewness statistic of 1.753, showing that the least-squares procedure produces estimates of Θ that are unbiased but have a positively skewed distribution (tables 3 and 4 ). The least-squares estimators of δ1, δ2, δ3, and ω are also unbiased, although once again, the distributions of the estimates are skewed. Figures 4–7 show frequency distributions for estimates of δ1, δ2, δ3, and ω, respectively. ## An Example Data Set In this section, we illustrate the use of sUPGMA with a data set of serially sampled partial envelope (env) gene sequences of cell-associated HIV DNA obtained from a long-term asymptomatic individual over five sampling occasions. These samples and the patient history have previously been described (Rodrigo et al. 1999<$REFLINK> ). In total, there were 60 sequences in this data set. Pairwise distances were constructed using a general time-reversible model allowing for unequal nucleotide frequencies and relative rates of substitutions. Substitution and frequency parameters of the substitution model were estimated with PAUP*, version 4.0b4 (D. Swofford, Smithsonian Institution). sUPGMA was applied to the pairwise distance matrix to reconstruct the serial genealogy of the sequences, allowing different values of Θ and δ's. We also reconstructed the genealogy by assuming a constant Θ and mutation rate, ω, and used parametric bootstrapping of 1,000 simulated trees to obtain 95% confidence intervals for these parameters. The reconstructed trees are shown in figure 8 , and the associated parameter estimates are given in table 5 .

It is instructive to consider some of the main points of these results. When Θ and δ are allowed to vary, sUPGMA is unable to distinguish between samples 2 and 3; i.e., for this interval, δ = 0. In fact, these two samples were obtained only 1 month apart, so this result is reasonable. When Θ is held constant and ω is estimated, the values obtained are Θ = 0.0446 (95% confidence interval [0.0184, 0.1016]) and ω = 7.8 × 10−6 substitutions per site per day (95% confidence interval [−3.47 × 10−6, 3.87 × 10−5]). This estimate of ω translates into an annual substitution rate of 0.3%. This is certainly lower than other HIV-1 env gene substitution rate estimates that have previously been obtained, which are on the order of 1% per year (Shankarappa et al. 1999<$REFLINK> ). It is not clear why our estimate of substitution rate is three times as low as other estimates. It is pertinent to note that with the patient from whom the samples were obtained, antiretroviral therapy was initiated at an early stage of the study, and this, in turn, may have lengthened the average generation time of infected cells (see below) and consequently lowered the substitution rate. When a varying substitution rate was allowed, the average rate obtained over the entire 1,005 days of the study was 1.53 × 10−5 substitutions per site per day (0.6% per year), which is closer to previously obtained results. However, this mean rate is still deflated by the very slow substitution rate observed in the last 306 days of the study (see table 5 ). Interestingly, the 95% confidence interval of our estimate of mutation rate encloses 0. While this can mean that there is no evidence that there has been a detectable substitution accumulation over time, it can also mean that there were some sequences obtained at a later time point that appear more closely related to those from an earlier time point. In fact, in the original tree published by Rodrigo et al. (1999)<$REFLINK> , this appears to be the case.

## Discussion

Serial-sample UPGMA is a variant of UPGMA which constructs genealogies of samples of sequences obtained at different times under the assumption of a molecular clock. Serial-sample UPGMA is a two-step procedure. The first step involves estimating the expected sequence divergence between samples obtained at different times. The second step requires the construction of a corrected distance matrix adjusted to take account of these expected divergences, and subsequent clustering using UPGMA. Given a more accurate estimation procedure for the divergences, the accuracy of sUPGMA tree reconstruction can be improved. For example, given a perfect estimate of divergences, the sUPGMA procedure will perform better than UPGMA under all sampling strategies and divergences (simulations not shown). Therefore, the threshold divergences required for sUPGMA to outperform UPGMA will be reduced by the use of better estimators of δ's and/or ω.

When a molecular clock does not apply, UPGMA is known to perform poorly as a tree reconstruction method. However, in the case of clocklike data that have experienced large amounts of evolution, the accuracy of UPGMA in reconstructing clocklike genealogies has been favorably compared with methods such as maximum-likelihood phylogenetic reconstruction (Pybus, Rambaut, and Harvey 2000<$REFLINK> ). Our results demonstrate that the accuracy of UPGMA for phylogenetic reconstruction can be improved by modifying the distances between longitudinally sampled sequences to correct for the extra divergence expected between earlier time points and the most recent time point. The rationale behind using sUPGMA as a basis for a tree reconstruction procedure for serial samples is to provide an (1) accurate and (2) rapid estimation of a serially sampled genealogy. Both the criteria for large divergences and clocklike evolution are fulfilled in at least some virus populations (Gojobori, Moriyama, and Kimura 1990<$REFLINK> ; Leitner and Albert 1999<$REFLINK> ; Shankarappa et al. 1999<$REFLINK> ). However, perhaps most importantly, the speed of sUPGMA allows very large data sets (with hundreds or thousands of sequences) to be analyzed with relative ease. This is an important feature when taking into account the sizes of genealogies already under consideration (e.g., Shankarappa et al. 1999<$REFLINK> ). The distance-corrected matrix that is constructed as part of sUPGMA can also be used with other members of the family of hierarchical algorithmic clustering methods such as WPGMA, complete-linkage and single-linkage clustering. As part of our parameter estimation procedures, we also introduce two parameterizations of expected intersample sequence divergence. In one case—ω parameterization—divergence is expressed as a product of the sampling interval and mutation rate (with the latter scaled to the same units of time as the sampling interval). A second parameterization that we use, δ parameterization, is less constrained. With δ parameterization, the ith interval between two sampling occasions is effectively allowed to have its own mutation rate, ωi, so that δi = ωiti, where ti is the length of the interval. In a sense, δ parameterization provides a new intermediate model of evolution between the two extremes of a strict molecular clock and the absence of a molecular clock. We call this intermediate model the varying-clock model. With HIV, for instance, the application of antiretroviral therapy leads to changes in the relative frequencies of different infected cell types (Perelson et al. 1996<$REFLINK> ). Since each cell type has a different mean generation time, a change in population structure will lead to a change in mean generation time and, consequently, a change in the average mutation rate. This was already alluded to above when we analyzed our example data set. Under such conditions, a varying-clock model may be appropriate. (Note that the varying-clock model we propose is different from lineage-specific models of variable mutation rates. In the latter, the mutation rate is assumed to change independently along different branches of the tree [Thorne, Kishino, and Painter 1998<$REFLINK> ; Huelsenbeck, Larget, and Swofford 2000]<$REFLINK> .)

Although we focused on rapidly evolving viral populations in this paper, it should be obvious that sUPGMA and its associated procedures of parameter estimation apply equally well to eukaryotic populations from which ancient and/or archival DNA is available. We anticipate that the search for better methods to analyze such populations will only become more important with the increasing frequency of longitudinal sampling strategies and the acquisition of DNA samples from ancient or archival material.

A computer program called PEBBLE that implements sUPGMA and other related methods, written in the Java programming language, can be obtained from www.cebl.auckland.ac.nz. This software will run on all computer platforms that support the Java Virtual Machine version 1.1 (JVM 1.1). This includes Microsoft Windows, Linux, and MacOS.

Edward Holmes, Reviewing Editor

1

Keywords: longitudinal samples serial samples ancient DNA phylogenetic reconstruction UPGMA sUPGMA

2

Address for correspondence and reprints: Allen G. Rodrigo, School of Biological Sciences, University of Auckland, Private Bag 92019, Auckland, New Zealand. E-mail: a.rodrigo@auckland.ac.nz.

Table 1 Sampling Strategies Under Which Phylogenetic Reconstruction Was Tested

Table 2 Threshold Values for Total Divergence (expected substitutions per site) Over Which sUPGMA Outperforms UPGMA

Table 3 Statistics of Estimated {ϒ} and {δ}'s for 1,000 Simulated Data Sets of Four Samples of Five Sequences

Table 4 Statistics of estimated {ϒ} and {ω} for 1,000 Simulated Data Sets of Four Samples of Five Sequences

Table 5 Estimated {ϒ} and {ω} for 1,000 Simulated Data Sets of Four Samples of Five Sequences

Fig. 1.—The main steps involved in the sUPGMA procedure. A, First, a distance matrix of the sequences sampled must be collected. B, A matrix is constructed that relates each observed distance to the parameters to be estimated. Each row in B corresponds to an instance of equation (2) , and the binary values in the columns correspond to the X's in equation (3). For convenience, only a single Θ is estimated in this example. Once this matrix is constructed, the least-squares solution (eq. 4 ) can be used to estimate the parameters. C, The estimated values of δ are then used to correct the original distance matrix (eq. 6 ). D, A standard UPGMA tree is constructed from these corrected distances. E, The branches in the UPGMA tree are then trimmed using the estimated δ's to produce the serially sampled genealogy

Fig. 1.—The main steps involved in the sUPGMA procedure. A, First, a distance matrix of the sequences sampled must be collected. B, A matrix is constructed that relates each observed distance to the parameters to be estimated. Each row in B corresponds to an instance of equation (2) , and the binary values in the columns correspond to the X's in equation (3). For convenience, only a single Θ is estimated in this example. Once this matrix is constructed, the least-squares solution (eq. 4 ) can be used to estimate the parameters. C, The estimated values of δ are then used to correct the original distance matrix (eq. 6 ). D, A standard UPGMA tree is constructed from these corrected distances. E, The branches in the UPGMA tree are then trimmed using the estimated δ's to produce the serially sampled genealogy

Fig. 2.—Phylogenetic reconstruction performances of sUPGMA and UPGMA on four samples

Fig. 2.—Phylogenetic reconstruction performances of sUPGMA and UPGMA on four samples

Fig. 3.—Θ estimates for four samples of five sequences for 1,000 simulated trees (real value of Θ = 0.1, total divergence = 0.06 expected substitutions)

Fig. 3.—Θ estimates for four samples of five sequences for 1,000 simulated trees (real value of Θ = 0.1, total divergence = 0.06 expected substitutions)

Fig. 4.—δ1 estimates for four samples of five sequences for 1,000 simulated trees (real value of δ1 = 0.02)

Fig. 4.—δ1 estimates for four samples of five sequences for 1,000 simulated trees (real value of δ1 = 0.02)

Fig. 5.—δ2 estimates for four samples of five sequences for 1,000 simulated trees (real value of δ2 = 0.02)

Fig. 5.—δ2 estimates for four samples of five sequences for 1,000 simulated trees (real value of δ2 = 0.02)

Fig. 6.—δ3 estimates of four samples of five sequences for 1,000 simulated trees (real value of δ3 = 0.02)

Fig. 6.—δ3 estimates of four samples of five sequences for 1,000 simulated trees (real value of δ3 = 0.02)

Fig. 7.—Estimated mutation rate, ω, from 1,000 simulations of four samples of five sequences (real value = 5 × 10−6 substitutions per site per generation)

Fig. 7.—Estimated mutation rate, ω, from 1,000 simulations of four samples of five sequences (real value = 5 × 10−6 substitutions per site per generation)

Fig. 8.—Two sUPGMA trees constructed from an example data set. Tree A was constructed under the assumption of a constant population size and a constant mutation rate. Tree B was constructed allowing a different population size at each sampling point and allowing the varying-rate model, in which each time interval has a different mutation rate

Fig. 8.—Two sUPGMA trees constructed from an example data set. Tree A was constructed under the assumption of a constant population size and a constant mutation rate. Tree B was constructed allowing a different population size at each sampling point and allowing the varying-rate model, in which each time interval has a different mutation rate

We thank Matthew Goode and Greg Ewing for assistance in developing the PEBBLE software package and the distributed programming techniques used for the simulations. For helpful comments and discussion, we thank Roald Forsberg and David Nickle. This work was supported by NIH grant 59174.

## literature cited

Epperson, B. K.
1999
. Gene genealogies in geographically structured populations. Genetics 152:797–806
Felsenstein, J.
1992
. Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates.
Genet. Res.

59
:
139
–147
Fu, Y. X.
1994
. A phylogenetic estimator of effective population size or mutation rate. Genetics 136:685–692
Gojobori, T., E. N. Moriyama, and M. Kimura.
1990
. Molecular clock of viral evolution, and the neutral theory. Proc. Natl. Acad. Sci. USA 87:10015–10018
Holmes, E. C., L. Q. Zhang, P. Simmonds, C. A. Ludlam, and A. J. Leigh Brown.
1992
. Convergent and divergent sequence evolution in the surface envelope glycoprotein of HIV-1 within a single infected patient. Proc. Natl. Acad. Sci. USA 89:4835–4839
Huelsenbeck, J. P., B. Larget, and D. Swofford.
2000
. A compound Poisson process for relaxing the molecular clock. Genetics 154:1879–1892
Jukes, T. H., and C. R. Cantor.
1969
. Evolution of protein molecules. Pp. 21–132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York
Kingman, J. F. C. 1982a. On the genealogy of large populations. J. Appl. Probability 19A:27–43
———. 1982b. The coalescent. Stochastic Processes Appl. 13:235–248
Leigh Brown, A. J.
1997
. Analysis of HIV-1 env gene sequences reveals evidence for a low effective number in the viral population. Proc. Natl. Acad. Sci. USA 94:1862–1865
Leitner, T., and J. Albert.
1999
. The molecular clock of HIV-1 unveiled through analysis of a known transmission history. Proc. Natl. Acad. Sci. USA 96:10752–10757
Nee, S., E. C. Holmes, A. Rambaut, and P. H. Harvey.
1995
. Inferring population history from molecular phylogenies.
Philos. Trans. R. Soc. Lond. B Biol Sci.

349
:
25
–31
Perelson, A. S., A. U. Neumann, M. Markowitz, J. M. Leonard, and D. D. Ho.
1996
. HIV-1 dynamics in vivo: virion clearance rate, infected cell life-span, and viral generation time. Science 271:1582–1586
Pybus, O. G., A. Rambaut, and P. H. Harvey.
2000
. An integrated framework for the inference of viral population history from reconstructed genealogies. Genetics 155:1429–1437
Robinson, D. F., and L. R. Foulds.
1981
. Comparison of phylogenetic trees.
Math. Biosci.

53
:
131
–148
Rodrigo, A. G., and J. Felsenstein.
1999
. Coalescent approaches to HIV population genetics. Pp. 233–272 in K. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md
Rodrigo, A. G., E. G. Shpaer, E. L. Delwart, A. K. N. Iversen, M. V. Gallo, J. Brojatsch, M. S. Hirsch, B. D. Walker, and J. I. Mullins.
1999
. Coalescent estimates of HIV-1 generation time in vivo. Proc. Natl. Acad. Sci. USA 96:2187–2191
Shankarappa, R., J. B. Margolick, S. J. Gange et al. (12 co-authors).
1999
. Consistent viral evolutionary dynamics associated with the progression of HIV-1 infection.
J. Virol.

73
:
10489
–10502
Sneath, P. H. A., and R. R. Sokal.
1973
. Numerical taxonomy. W. H. Freeman, San Francisco
Tajima, F.
1983
. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437–460
Thorne, J. L., H. Kishino, and I. S. Painter.
1998
. Estimating the rate of evolution of the rate of molecular evolution.
Mol. Biol. Evol.

15
:
1647
–1657
Wolinsky, S. M., B. T. M. Korber, A. U. Neumann, M. Daniels, K. J. Kuntsman, A. J. Whetsell, M. R. Furtado, Y. Cao, D. D. Ho, and J. T. Safrit.
1996
. Adaptive evolution of HIV-1 during the natural course of infection. Science 272:537–542