## Abstract

We introduce the Bayesian skyline plot, a new method for estimating past population dynamics through time from a sample of molecular sequences without dependence on a prespecified parametric model of demographic history. We describe a Markov chain Monte Carlo sampling procedure that efficiently samples a variant of the generalized skyline plot, given sequence data, and combines these plots to generate a posterior distribution of effective population size through time. We apply the Bayesian skyline plot to simulated data sets and show that it correctly reconstructs demographic history under canonical scenarios. Finally, we compare the Bayesian skyline plot model to previous coalescent approaches by analyzing two real data sets (hepatitis C virus in Egypt and mitochondrial DNA of Beringian bison) that have been previously investigated using alternative coalescent methods. In the bison analysis, we detect a severe but previously unrecognized bottleneck, estimated to have occurred 10,000 radiocarbon years ago, which coincides with both the earliest undisputed record of large numbers of humans in Alaska and the megafaunal extinctions in North America at the beginning of the Holocene.

## Introduction

Considerable progress in the field of population genetic inference has been made during the past decade, following parallel increases in computer processing speed and available gene sequence data. Most current methods are based on coalescent theory, a stochastic process that describes how population genetic processes determine the shape of the genealogy of sampled gene sequences. Coalescent-based inference methods enable population genetic parameters to be estimated directly from gene sequence data under a variety of scenarios, including recombination (Griffiths and Marjoram 1996; Kuhner, Yamato, and Felsenstein 2000; Fearnhead and Donnelly 2001), population subdivision (Bahlo and Griffiths 2000; Beerli and Felsenstein 2001), and variable population size (Kuhner, Yamato, and Felsenstein 1998; Beaumont 1999; Drummond et al. 2002).

The variable population size coalescent model, introduced by Griffiths and Tavare (1994) and Donnelly and Tavare (1995), allows the inference of past population dynamics from contemporary gene sequences. The model was subsequently extended by Rodrigo and Felsenstein (1999) to sequences that have been sampled at significantly different points in time (e.g., rapidly evolving pathogen sequences or ancient DNA sequences). The inference of demographic history from genetic data has proved invaluable for testing hypotheses in a variety of biological disciplines, for example, anthropology (Reich and Goldstein 1998), epidemiology (Pybus et al. 2001; Joy et al. 2003), conservation biology (Roman and Palumbi 2003), and ecology (Storz, Beaumont, and Alberts 2002; Flanagan et al. 2004).

Coalescent methods for inferring demographic histories require a “demographic model,” which is simply a mathematical function used to describe the change in effective population size through time. Each demographic model has one or more “demographic parameters.” Commonly used demographic models are constant size (one parameter), exponential growth (constant growth rate through time; two parameters), logistic growth (decreasing growth rate through time; three parameters), and expansion growth (increasing growth rate through time; three parameters) (see Pybus and Rambaut [2002] for more details). In addition, combining the models above in a piecewise manner enables an array of more complex models to be constructed. Past population dynamics are reconstructed by estimating the demographic parameters, typically by maximum likelihood or Bayesian methods (Kuhner, Yamato, and Felsenstein 1998; Pybus, Rambaut, and Harvey 2000; Drummond et al. 2002).

It is not usually known in advance which demographic model will fit the gene sequences under investigation. Although it is possible to compare the fit of different demographic models using standard model selection techniques (Pybus, Rambaut, and Harvey 2000), this is a time-consuming process and there is no guarantee that any of the compared models will fit the data adequately. The use of an incorrect demographic model will lead to biased and invalid estimates of demographic history. Furthermore, an incorrect demographic model could lead to bias in other estimated parameters of the evolutionary model, such as recombination rate or overall mutation rate, when the demographic history is being treated as a nuisance parameter to be averaged over. To address the first of these concerns, a flexible model called the skyline plot was developed (Pybus, Rambaut, and Harvey 2000). The “skyline plot” is a piecewise-constant model of population size that can fit a wide range of demographic scenarios. It has proved very useful as a model selection tool that is used to indicate the most appropriate demographic model for any given data set (Pybus and Rambaut 2002). The skyline plot typically produces “noisy” plots that display the stochastic variability inherent in the coalescent process. To reduce this noise, the “generalized skyline plot” was developed (Strimmer and Pybus 2001), which uses the Akaike Information Criterion (Akaike 1974) to reduce the number of parameters employed and thus produces smoother estimated population size plots.

Like early genealogical estimators of population size (Felsenstein 1992; Fu 1994), both of the skyline plot methods have a fundamental drawback—they infer demographic history from an estimated genealogy, rather than from the sampled gene sequences, and thus ignore the error associated with phylogenetic reconstruction. Although this error may be small for highly variable sequences, such as those from rapidly evolving viruses (Drummond, Pybus, and Rambaut 2003; Drummond et al. 2003), it is impractical to ignore phylogenetic error in the majority of less variable data sets. Here we resolve this problem by introducing the “Bayesian skyline plot.” The Bayesian skyline plot model uses standard Markov chain Monte Carlo (MCMC) sampling procedures to estimate a posterior distribution of effective population size through time directly from a sample of gene sequences, given any specified nucleotide-substitution model. Unlike previous methods, the Bayesian skyline plot includes credibility intervals for the estimated effective population size at every point in time, back to the most recent common ancestor of the gene sequences. These credibility intervals represent both phylogenetic and coalescent uncertainty. In addition, the “averaging” effect of MCMC sampling naturally produces smoother estimates than previous skyline plots.

First, we generalize previous descriptions of the generalized skyline plot and show how it can be extended to trees relating sequences sampled at different points in time (termed heterochronous trees). Second, we describe how the generalized skyline plot model can be embedded in a Bayesian MCMC analysis. Third, we introduce an MCMC sampling procedure that efficiently samples the distribution of generalized skyline plots given the sequence data and combines these plots to generate a posterior distribution of effective population size through time. By doing this, we are able to coestimate the evolutionary rate, substitution model parameters, phylogeny, and ancestral population dynamics within a single analysis. Fourth, we apply the Bayesian skyline plot to simulated data sets and show that it correctly reconstructs demographic history under canonical scenarios. Last, we compare the Bayesian skyline plot model to previous approaches by analyzing two real data sets that have been investigated using other coalescent-based methods.

## Methods

### Background: Classic and Generalized Plots

The classic and generalized skyline plots were introduced by Pybus, Rambaut, and Harvey (2000) and Strimmer and Pybus (2001), respectively. The raw data for the classic and generalized skyline plots are a genealogy with specified branch lengths, denoted *g*, estimated from *n* contemporaneous sequences (see fig. 1*a*). The internal nodes of the input genealogy must be dated according to a given timescale (genetic distance or time) and thus define *n* − 1 times at which coalescent events occur, **u** = {*u*_{1}, *u*_{2}, …, *u _{n}*

_{−1}} (see fig. 1

*a*). The times

**u**are measured from the tips, such that

*u*

_{0}= 0 at the tips. The waiting times between coalescent events (coalescent intervals) are defined Δ

*u*=

_{i}*u*−

_{i}*u*

_{i}_{−1}. The number of lineages present during each Δ

*u*defines a corresponding series of values, denoted

_{i}*k*

_{1},

*k*

_{2}, …,

*k*

_{n}_{−1}(see fig. 1

*a*).

Strimmer and Pybus (2001) demonstrated that the skyline plot is, in fact, a method-of-moments estimate of a piecewise model of effective population size through time and therefore has a definable likelihood function. The piecewise model supposes that effective population size is constant between coalescent events but may change at the coalescent event times, **u**. The classic skyline plot assumes that each coalescent interval has a different effective population size, whereas the generalized skyline plot allows adjacent coalescent intervals to be grouped and has the same effective population size (fig. 1*a*).

The generalized skyline plot therefore requires an ordered subset of group sizes

*a*> 0 and

_{i}*m*is the number of grouped intervals

**u**(

*w*

_{0}= 0). The time spanned by each grouped interval is defined

*a*).

The log likelihood of the piecewise demographic model is given by:

**u**to the indices of

**w**and is defined:

### Likelihoods of Skyline Plots for Heterochronous Trees

The generalized skyline plot approach defined above is only applicable to gene sequences sampled at the same point in time. Here we show how the skyline plot approach can be extended to heterochronous sequences, using the serial-sample coalescent model (Rodrigo and Felsenstein 1999).

Figure 1*b* shows a genealogy of heterochronous sequences with internal nodes and sampling times dated according to a given timescale. In contrast to a genealogy of contemporaneous sequences, heterochronous genealogies have two types of intervals: coalescent intervals (which, going back in time, end with a coalescent event) and sample intervals (which end with the appearance of one or more sampled sequences; a sample event). If there are *n* sequences sampled at *s* different times, then there are

*b*). The ordered interval times, starting at the tips and ending at the root, are denoted

*i*th event is a coalescent event.

*b*). Note also that the number of lineages,

Coalescent-ended intervals in heterochronous trees can be grouped in a manner similar to that described above. As before, the values

**w**=

**u**, where

*m*is the number of grouped intervals

**w**, only contain coalescent events, not sample events. The time spanned by each grouped interval is defined

*b*).

As before,

**u**to indices in

**w**and is defined:

### The Bayesian Skyline Plot Model

The vectors

*g*), define a piecewise demographic history with 2

*m*− 1 demographic parameters and

*n*− 1 coalescent time parameters. The probability

*g*) given the demographic parameters is given in equation (3). The hyperparameter

*m*is either chosen a priori or is sampled in a hierarchical model via reversible-jump MCMC (rjMCMC).

In addition to this model, we also introduce a simple smoothing on Θ which represents our belief that effective population size is autocorrelated through time. The prior distribution we assume in all subsequent simulations and analyses is that, going back in time, each new population size is drawn from an exponential distribution with a mean equal to the previous population size:

### MCMC Implementation

Here we extend a previously published MCMC method (Drummond et al. 2002) to sample the parameters of the Bayesian skyline plot model described above. Our implementation samples both Θ and A, but for computational reasons, we do not use rjMCMC to sample the hyperparameter *m*. Instead, we condition all our runs on a fixed value of *m* because the resulting posterior demographic function is highly consistent for a range of a priori values of *m* (data not shown).

In the case of contemporaneous sequences, the posterior distribution sampled in our scheme is:

*p*]), and the parameter μ is a mutation rate that scales the genealogy from units of mutations per site to units of time. In the case of heterochronous sequences, the posterior distribution sampled in our scheme is:

_{inv}*f*

_{μ}(μ) is chosen to be a probability density function (i.e., a proper prior), then this second scheme could also be used for contemporaneous sequences to allow uncertainty in the scaling between mutations per site and time.

The resulting output of the Bayesian MCMC analysis is a sample from the posterior distribution described in equation (8), (Θ,*A*,Ω,*g*,μ) ∼ *f _{het}*. For the purposes of reconstructing the demographic history, the sampled substitution model parameters and mutation rates can be thought of as uninteresting “missing data.” This leaves us with a list of

*j*states, each with an associated genealogy and demographic parameters: (Θ

_{j},

*A*,

_{j}*g*). We can therefore reconstitute the demographic history θ(

_{j}*t*) as a piecewise function of time, for each of the

*j*states. To display this information, we calculate the marginal posterior distribution of θ(

*t*) at a series of times of interest,

*t*. For each

_{i}*t*, the values θ(

_{i}*t*) for all

_{i}*j*form a marginal posterior distribution of population size at time

*t*. From these marginal distributions, the mean, median, and the 95% highest posterior density (HPD) intervals can be calculated for θ(

_{i}*t*) at each time of interest,

*t*, leading to an estimated plot of population size through time with an associated measure of uncertainty.

_{i}## Results

### Simulated Data Sets

To investigate the behavior of the Bayesian skyline plot model, we analyzed two simulated data sets using our MCMC method. We followed Strimmer and Pybus (2001) and performed the following simulations: (1) Coalescent trees were simulated under two demographic models,

*m*) was set to 12, and MCMC chains were run for 10,000,000 iterations, of which the first 1% was discarded to allow for burn-in. The substitution model used was HKY, and the transition/transversion ratio was coestimated along with the parameters of the Bayesian skyline plot and the ancestral genealogy. Genealogies and model parameters were sampled every 1,000 iterations, and the results are summarized as a Bayesian skyline plot, shown in figure 2.

As noted by Strimmer and Pybus (2001), many of the sequences simulated under the constant-size model are not unique and many of the coalescent intervals in the estimated tree are very small. The effect of this is most apparent at the tipward (most recent) end of the Bayesian skyline plot (fig. 2*a*) where the variance in the estimate of population size is larger than at other times in the plot. Under the exponential model, the simulation resulted in only two identical sequences. The Bayesian skyline plot appears to slightly overestimate population size in this data set; however, this is well within the uncertainty admitted by the 95% HPD limits (fig. 2*b*) and may well be due to the stochastic error associated with this particular simulation.

To compare the Bayesian skyline plot with previous approaches, we also analyzed two published data sets that have been investigated previously using other coalescent inference methods.

### Hepatitis C Virus in Egypt

The hepatitis C virus (HCV) is a genetically diverse RNA virus and is a leading global cause of liver disease. Egypt has the highest prevalence of HCV in the world, significantly higher than those in neighboring countries, even though HCV appears to have been present in the Middle East for several centuries (Smith et al. 1997). The past dynamics of HCV in Egypt are therefore of considerable interest. Here we analyze a data set of 63 partial E1 gene sequences obtained from a comprehensive study of Egyptian HCV genetic diversity (Ray et al. 2000). The demographic history of these data was analyzed by Pybus et al. (2003) using a four-parameter coalescent model, estimated within a Bayesian inference framework. As previously noted, these sequences are suitable for coalescent analysis because (1) they represent a geographically diverse and approximately random sample, (2) they show no obvious population subdivision, (3) they contain ample phylogenetic information, and (4) an independent estimate of nucleotide-substitution rate for the sequences is available (see Pybus et al. [2003] for details). Most importantly, there is substantial nongenetic evidence concerning the history of HCV spread in Egypt (Frank et al. 2000; Strickland et al. 2002), so this data set provides a unique opportunity to test the reliability and accuracy of coalescent demographic methods.

The analysis of Pybus et al. (2003) demonstrated that the effective number of HCV infections underwent a dramatic period of growth in the middle of the 20th century, followed by a more recent slowdown in the rate of spread. Figure 3 shows the results of the Bayesian skyline plot (*m* = 24) on this data set, the HPD limits of which entirely contain the previously published estimate. The Bayesian skyline plot is also in complete agreement with known epidemiological data—the Egyptian HCV epidemic was likely caused by viral contamination of injectable antischistosomiasis treatment. This treatment was extensively used in Egypt from the 1920s to the 1980s but was phased out gradually during the latter part of this period. The Bayesian skyline plot shows a period of rapid HCV population increase between 1920 and 1950. The growth phase is preceded by a dip in the effective number of infections. This dip was not observed in the previous coalescent analysis but is not statistically significant given the size of the estimated confidence limits.

### Bison in Beringia

Bison were one of the most abundant and widely distributed large mammals during the Late Pleistocene (ca. 500–10,000 years ago [1000 years ago, ka B.P.]). A recent coalescent analysis (Shapiro et al. 2004) of mtDNA control region sequences from ancient bison (*Bison* cf. *priscus*) in Beringia (Siberia, Alaska, and northwestern Canada) and central North America revealed a demographic history over the last 150 ka B.P. of sustained population growth followed by rapid decline. The estimated time of the transition from growth to decline (mean: 37 ka B.P.) led the authors to conclude that this change in demographic trend was more likely the consequence of climate changes associated with the onset of the Last Glacial Maximum than human overhunting. The authors used a four-parameter, two-epoch model to describe this demographic history, in which an initial phase of exponential growth was followed by a period of exponential decline. We reanalyzed the Shapiro et al. (2004) data using a Bayesian skyline plot with 15 groups (*m* = 15). The analysis was run for 30 million iterations, with the first 10% discarded as burn-in. Genealogies and model parameters were sampled every 1,000 iterations thereafter. Despite the increased number of parameters, a comparison of log marginal posterior scores between the two analyses shows that the Bayesian skyline plot is a significantly better fit to the data than the simpler model (data not shown).

Figure 4 shows the Bayesian skyline plot for Beringian bison, along with the demographic history estimated in Shapiro et al. (2004). Although the two analyses are very similar, the sensitivity of the Bayesian skyline plot allows more complex demographic trends to be identified. Most striking is the population bottleneck following the rapid decline of bison population size, between its maximum 30–45 ka B.P. and its minimum ∼10 ka B.P. The postbottleneck recovery, which the simpler four-parameter model did not identify, is seen in the Bayesian skyline plot as a second transition to population growth around 10 ka B.P. This bottleneck is coincident with the earliest undisputed evidence of human settlement in Alaska (Yesner 2001), the period most often associated with the megafaunal mass extinctions in North America (Alroy 1999). In agreement with the Shapiro et al. (2004) analysis, the Bayesian skyline plot describes the initial transition from growth to decline as occurring prior to the first evidence of humans in North America; however, figure 4 suggests that humans could have played a larger role in the decline of bison populations than previously suggested.

## Conclusions

The Bayesian skyline plot is a powerful new method for estimating ancestral population dynamics from a sample of molecular sequences. In common with the classic and generalized skyline plots, the Bayesian skyline plot allows us to discover novel demographic signatures that are not readily described by simple demographic models. Unlike previous skyline plots, our new method takes into account both the error inherent in phylogenetic reconstruction and the stochastic error intrinsic to the coalescent process and thus produces more correct estimates of statistical uncertainty. In addition, the MCMC method coestimates the ancestral genealogy and parameters of the substitution process as well as the demographic parameters.

In both the HCV and bison examples presented above, the Bayesian skyline plot reveals previously undetected demographic signatures, demonstrating its ability to uncover changes in demographic trends over ecological, paleontological, and evolutionary time spans. The Bayesian skyline plot may also prove useful when the aim of inference is a population genetic parameter such as recombination or mutation rate, and the demographic history of the sequences is a nuisance parameter that must be “averaged out.” In such circumstances, an unrealistic or misspecified model of population size change may lead to strong biases in the estimated parameter of primary interest. Ascertaining the direction and magnitude of such a bias is outside the scope of this paper because it will depend on the data set and sampling strategy. However, given the nontrivial demographic histories estimated from real data sets in this paper, estimates of mutation rate or recombination rate obtained by assuming a simple demographic history should probably be supported with simulations that verify the robustness of the focal parameter estimates to the simplifying demographic assumptions.

The Bayesian skyline plot takes a similar amount of time to compute as other demographic models (such as constant size or exponential growth) estimated under the same MCMC genealogy sampling method (Drummond et al. 2002). For example, the HCV data set took a few hours to calculate on a desktop computer. However, this is substantially longer than the classic and generalized skyline plots estimators, which typically take a few seconds to compute (although obviously this does not include the time necessary to estimate the tree).

During preparation of the manuscript, we became aware of a related method (Opgen-Rhein, Fahrmeir, and Strimmer 2005) that uses rjMCMC to estimate smooth demographic functions directly from a single reconstructed genealogy. This approach, unlike the Bayesian skyline plot, does not take phylogenetic error into account and is therefore less appropriate for data sets containing limited genetic variation. However, the use of rjMCMC by Strimmer and coworkers is preferable to our method of fixing the amount of smoothing a priori (i.e., fixing the number of groupings, *m*). The two methods are therefore complementary especially as they use different information as starting data.

Future improvements to the Bayesian skyline plot method should further extend its utility. Future research directions include (1) implementation of an rjMCMC method so that the degree of grouping can be automatically chosen from the data, (2) use of a smoother function, such as piecewise linear or truncated regression spline, as the underlying parametric model, and (3) extension of the Bayesian skyline plot to sequence data from multiple unlinked loci.

The MCMC method described in this paper is implemented in the latest version of the BEAST (v. 1.2) software package (available from http://evolve.zoo.ox.ac.uk/beast/). The Bayesian skyline plot figures presented in this paper were generated using Tracer (v. 1.2) (available from http://evolve.zoo.ox.ac.uk/software/).

We are grateful to Korbinian Strimmer for providing us early access to a stimulating manuscript describing related research. In addition, A.J.D. thanks Chris Holmes and Alexandre Pintore for helpful discussions. A.J.D. and B.S. are funded by the Wellcome Trust. A.R. and O.G.P. are funded by the Royal Society.

## References

*in*R. D. E. MacPhee, ed. Extinctions in near time: causes, contexts, and consequences. Kluwer Academic, New York.

*in*K. Crandall, ed. Molecular evolution of HIV. Johns Hopkins University Press, Baltimore, Md.