## Abstract

Epidemiological processes leave a fingerprint in the pattern of genetic structure of virus populations. Here, we provide a new method to infer epidemiological parameters directly from viral sequence data. The method is based on phylogenetic analysis using a birth–death model (BDM) rather than the commonly used coalescent as the model for the epidemiological transmission of the pathogen. Using the BDM has the advantage that transmission and death rates are estimated independently and therefore enables for the first time the estimation of the basic reproductive number of the pathogen using only sequence data, without further assumptions like the average duration of infection. We apply the method to genetic data of the HIV-1 epidemic in Switzerland.

## Introduction

RNA viruses are characterized by short generation time and high mutation rates. Therefore, even over relatively short time spans epidemiological processes (i.e., transmission, recovery, and death) are expected to leave signals in the genetic structure of viral sequences sampled from the host population. Bayesian phylogenetic methods are commonly used for viruses to infer epidemiological processes from genetic data (Pybus et al. 2001, Drummond et al. 2003, Grenfell et al. 2004, Pomeroy et al. 2008). These methods require the specification of a process that generates the phylogenetic trees, which is commonly the coalescent (Kingman 1982, Griffiths and Tavare 1994, Drummond et al. 2002).

The coalescent is an appropriate choice for many applications. However, in the context of virus transmission, it is primarily used for its mathematical convenience rather than the accurate reflection of the underlying transmission process. In particular, there are two main shortcomings of the coalescent as a model of epidemiological transmission. First, the coalescent can detect changes over time in the number of infected people (i.e., population size changes) but not whether such changes are due to a change in transmission rates or a change in death or recovery rates. However, the separate estimation of transmission and death/recovery is key in order to determine central epidemiological quantities such as the basic reproductive number (Anderson and May 1979, 1992). Thus, if the coalescent is used in a phylogenetic study, an independent estimate of the death/recovery rate (or the transmission rate) is required for estimating quantities such as the basic reproductive number. Second, the coalescent is appropriate only if the number of sampled infected hosts is small compared with the total infected host population size. This is a problematic assumption for important diseases such as HIV, where we have particularly dense sampling.

A more appropriate choice for the tree-generating process is the birth–death model (BDM; Kendall 1948, Feller1968). This model has neither of the two shortcomings of the coalescent. It explicitly assumes a separate transmission (i.e., birth) and death rate. Moreover, it can be applied to situations of sparse or dense sampling because sampling proportion is treated in the model as a separate parameter.

The BDM has been used in phylogenetic analysis for inferring species phylogenies (Patané et al. 2009, Couvreur et al. 2010, Fernández-Mazuecos and Vargas 2010, Weksler et al. 2010), including scenarios of incomplete sequence sampling. However, the sampled sequences were assumed to be from one point in time (Maddison et al. 2007, FitzJohn et al. 2009, Stadler 2009). In viral epidemics, sequences are typically sampled over a long time span. Hence, after an initial application of the BDM to viral sequences from one single time point (Holmes et al. 1995), focus shifted towards assuming the coalescent (Nee et al. 1995), as coalescent-based methodology became available allowing for sequences being sampled sequentially through time (Rambaut 2000), allowing for changing population sizes through time (Pybus et al. 2000, Drummond et al. 2005), and accounting for phylogenetic uncertainty by using a Bayesian framework (Drummond et al. 2002).

Here, we develop a Bayesian method of phylogenetic analysis for sequentially sampled viral sequence data based on the BDM in order to infer key epidemiological parameters directly from viral sequence data. To this end, we extend previous work (Stadler 2010) to solve the BDM for sequentially sampled viral sequences. We generalize the BDM such that it specifically reflects aspects of viral transmission, determine the accuracy in parameter inference using simulations, and in particular, show that the BDM estimates are more accurate than the coalescent estimates. The basic reproductive number can be estimated very accurately using the BDM, whereas the transmission rate correlates with the death rate and its estimate is thus less accurate. We apply the new BDM method to data from the HIV-1 epidemic in Switzerland and further validate the accuracy of estimates on the basis of other epidemiological estimates.

## Materials and Methods

### Framework for the Inference of EpidemiologicalParameters

In its most general form, a birth–death process is a stochastic description of populations, in which individuals can be born or die at any time point. We use this framework to describe the process of epidemiological transmission. A birth event corresponds to the infection of an individual. A death event corresponds to an individual becoming noninfectious, which can be due to several events such as death, treatment, or behavioral changes of an individual.

We model an epidemic as follows: An infected individual starts a new epidemic at time *t*_{or} in the past. Each infected individual transmits with a constant rate *λ* (transmission rate) and becomes noninfectious with a constant rate *μ* (becoming noninfectious rate). To capture the process of infected individuals being sampled (i.e., included into the data set), we introduce a sampling rate *ψ*. As sampling in human infectious diseases is typically linked to treatment or behavioral changes, we assume that a sampled individual becomes noninfectious immediately after the sampling. Thus, the term *ψ* is formally equivalent to a death term.

Our BDM is a forward in time description of the epidemiological process, and the sampling rate *ψ* allows to specify the sampling intensity; thus, the number and time of sampling points is a random outcome of the process. In contrast, the coalescent is a model backward in time where the number of samples is assumed to be very small compared with the population size (however, the precise sampling fraction cannot be specified), and the analysis is conditioned on the number and time of the sampling points.

#### Bayesian BDM Method

In Bayesian phylogenetic inference using a Markov chain Monte Carlo (MCMC) approach, the idea is to sample trees and parameters from the posterior distribution *f*[*𝒯*,*η*,*θ*|data] where “data” are alignment of sequences sampled through time, *θ* are the parameters of the sequence evolution model, *η* are the parameters of the tree-generating model, and *𝒯* is the transmission tree describing the epidemiological relationships of the sampled sequences in the alignment. By Bayes' theorem, the posterior distribution is equivalent to

The quantity *f*[data|*𝒯*,*θ*] is the probability density of the sequences having evolved on a tree *𝒯*. This likelihood can be computed efficiently with Felsenstein's pruning algorithm (Felsenstein 2004). *f*[*𝒯*|*η*] is the probability density of the tree given the tree-generating model parameters. This density is known if the coalescent is assumed as the tree-generating model. A prior distribution is assumed for the probability densities of the parameters, *f*[*η*] and *f*[*θ*]. The quantity *f*[data] is the normalizing constant and can be disregarded when sampling from the posterior (it is constant for all trees and parameters as the data are fixed).

In our approach, we assume the BDM instead of the coalescent as a tree-generating model, that is, *η* = (*λ*,*μ*,*ψ*,*t*_{or}). The BDM generates a transmission tree. The “sampled tree” *𝒯* is obtained from the transmission tree by suppressing all edges without sampled descendants, see figure 1 and Stadler (2010) for more details. In order to do Bayesian phylogenetic inference using an MCMC approach, the probability density of a sampled tree under the BDM, *f*[*𝒯*|*λ*,*μ*,*ψ*,*t*_{or}], has to be derived.

#### Calculation of *f*[*𝒯*|*λ*,*μ*,*ψ*,*t*_{or}]

For calculating the probability density of a sampled tree, we need some notation (see also fig. 1). Let the sampled tree *𝒯* have *m* sampled leaves. Let *x*_{0} = *t*_{or} be the time of origin of the process. Let *x*_{1} > ⋯ > *x*_{m − 1} be the *m* − 1 bifurcation times in the sampled tree where time is measured as the distance to the present. Let *y*_{1} > ⋯ > *y*_{m} be the *m* sampling times (i.e., times of the leaves).

In order to calculate *f*[*𝒯*|*λ*,*μ*,*ψ*,*t*_{or}], we define *g*_{e}(*t*) to be the probability density that the infectious individual corresponding to edge *e* at time *t* in the past gives rise to the transmission tree between *t* and the present as observed in *𝒯*. Then, *g*_{e}(*t*_{or}) = *f*[*𝒯*|*λ*,*μ*,*ψ*,*t*_{or}] (see fig. 2). In order to derive a formula for *g*_{e}(*t*), we need the probability that an infectious individual has no sampled descendants for a time span of length *t*, *p*_{0}(*t*). The probability *p*_{0}(*t*) is calculated in Stadler (2010):

We derive the probability density for *g*_{e}(*t*) with a Master equation approach. Let Δ*t* be a very small time step. An event (in our case, transmission, becoming noninfectious, or sampling) happening with a rate *α* means that in a small time interval Δ*t*, the probability of the event happening once is *α*Δ*t*; the event happening several times has probability of order *O*(Δ*t*^{2}) (here, we do not calculate the exact probability of several events happening during one time interval, as this probability will tend to 0 and thus cancel out, as shown below).

We calculate the probability density *g*_{e}(*t* + Δ*t*), that is, the probability density that the individual *e* at time *t* + Δ*t* gives rise to the the transmission tree between *t* + Δ*t* and the present as observed in *𝒯*, assuming we know *g*_{e}(*t*) (see fig. 2). Recall that time is measured as a distance to the present, so with an additional time step Δ*t*, we move further into the past, and the tree likelihood is thus calculated going backward in time. The infected individual at time *t* + Δ*t* can either undergo no event or transmit during time Δ*t* (becoming noninfectious would not yield the observed transmission tree). The equation for *g*_{e}(*t* + Δ*t*) is therefore

where 1) 1 − (*λ* + *μ* + *ψ*)Δ*t* − *O*(Δ*t*^{2}) is the probability that the individual corresponding to edge *e* at time *t* + Δ*t* does not undergo an event during time Δ*t*, 2) *λ*Δ*t* is the probability that the individual corresponding to edge *e* infects one individual during time Δ*t*, and 2*p*_{0}(*t*) is the probability that one of the two infected individuals has no sampled descendants between time *t* and time 0, and 3) *O*(Δ*t*^{2}) summarizes the terms when more than one transmission event happens during time Δ*t*. We transform this equation to

which yields in the limit Δ*t*→0 the Master equation for *g*_{e}(*t*),

Let the edge *e* last between time *s*_{0} and *s*_{1}, with *s*_{0} ≥ *t* ≥ *s*_{1} (see also fig. 2). Note that at time *s*_{1}, there are two descending branches (transmission event with rate *λ*) or no descending branches (sampling event with rate *ψ*). Thus, we have the initial value at *t* = *s*_{1}:

Following Stadler (2010), we can solve the differential equation for *g*_{e}(*t*) and obtain

We implemented the Bayesian inference procedure to estimate the parameters of the BDM (using eq. 1) in the software package BEAST (Drummond and Rambaut 2007), replacing the previously used coalescent. We assume that the process is stopped after the last sampled leave, that is, *y*_{m} = 0, this has the advantage that we reduce the number of parameters by one.

Our BDM is a modification of the framework in Stadler (2010): The previous model in Stadler (2010) allows sampling through time and sampling of present-day individuals (this is relevant when the considered individuals are extinct and extant species, rather than infected hosts). Further, a sampled individual remained infectious. Modifying the previous model such that there is no sampling of present-day individuals, and assuming that infected individuals become noninfectious when being sampled, yields the model in this paper.

A natural generalization of the model presented here, and the model presented in Stadler (2010) is to assume that a sampled individual becomes noninfectious immediately after the sampling with probability *r* and remains infectious after sampling with probability 1 − *r*. Under this general model, *m* sampled individuals may have no sampled descendants and *k* sampled individuals may have sampled descendants. The likelihood of the tree becomes

The current version of Beast cannot use the generalized method, as it cannot account for the *k* sampled individuals having sampled descendants but requires instead all the sampled individuals being leaves. Thus, we cannot do data analysis yet with this general model.

### Inferring Parameters from Simulated Data Sets

We simulated trees under the BDM and then analyzed these trees using the Bayesian BDM method implemented in BEAST (Drummond and Rambaut 2007) in order to validate whether we can accurately re-estimate parameters of the BDM. There are two stages in the simulation process:

Stage 1:

In Stage 1, we simulate trees under the BDM model for fixed

λ= 8.23×10^{ − 4},μ/λ= 0.11, andψ= 2.78×10^{ − 4}(which were the mean estimates obtained from empirical data, see below). We ran an MCMC in Beast without data, which samples trees from the prior (i.e., the BDM model). We used 100 tips without sequence data to simulate the tip times (except the most recent tip whose time is 0), the tree (including all internal nodes and topology), and the time of origint_{or}. MCMC chain length was set up to 100,000,000.Stage 2:

We picked up ten trees from the sample produced from Stage 1 respectively at the state 100,000,000, 90,000,000, …, 10,000,000 and created ten new MCMC simulations by fixing the tree. In the ten new MCMC simulations, we estimated

λ,μ,ψ, andt_{or}and calculatedR_{0}=λ/(μ+ψ) for the ten chosen trees. MCMC chain length was set up to 100,000,000.

To increase the reliability of our tests, we looped 10 times from Stage 1 to Stage 2 and thus re-estimated *λ*,*μ*,*ψ*,*t*_{or} based on 100 random trees.

We also simulated sets of ten trees with ten tips to compare the accuracy of the method when having several small trees instead of one big tree.

### Inferring Parameters from Swiss HIV-1 Sequence Data

We use sequence data of the HIV polymerase gene from the Swiss HIV Cohort study (SHCS, The Swiss HIV Cohort Study 2010), a cohort where more than 16,000 Swiss HIV-infected persons are enrolled covering at least 45% of all Swiss HIV-infected individuals. In order to exclude biases due to migration, we determined Swiss transmission clusters in which no migration occurred. We built a maximum likelihood tree using all available Swiss HIV pol sequences plus the same number of randomly selected foreign sequences. We defined a cluster in the tree to be a Swiss cluster if it contains at least 80% Swiss sequences and has bootstrap support of at least 70%. This procedure follows the analysis in Kouyos et al. (2010), with the only two differences that we 1) considered only clusters with a bootstrap support of > 70%, and that 2) the phylogenetic tree was inferred on the basis of the general time reversible + GAMMA model (instead of the general time reversible with per site rate categories [GTRCAT] model).

The ten largest clusters (subepidemics) contain between 12 and 34 individuals, which have been sampled between 1995 and 2008. The sampling times of the sequences are summarized in figure 3.

From the subepidemics, we estimated transmission and sequence evolution parameters based on the pol sequences, using our Bayesian BDM method. To model the sequence evolution, we partitioned the alignment into two classes of sites: The first class consisted of all first and second codon positions and the second class was made up of third codon positions. Each class was modeled with an independent Hasegawa-Kishino-Yano + Γ model, allowing for a class-specific transition/transversion bias (*κ*), a class-specific shape parameter (*α*) describing rate heterogeneity across sites, and a class-specific mean rate of evolution. The prior distribution on the two shape parameters was an exponential distribution with a mean of 1. The prior on the *κ* parameters was a Gamma distribution with a shape of 0.05 and a scale of 40. In addition, a lognormally distributed uncorrelated relaxed clock model (Drummond et al. 2006) was used to model rate variation across lineages. The *S* parameter of the lognormal relaxed clock had an exponential prior with a mean of 1/3. We ran the MCMC chain for 200 million generations and neglected the first 10% of output as the burn-in.

We analyzed the subepidemics under two different assumptions: 1) We assume that the epidemiological parameters *λ*,*μ*, and *ψ* are the same in all ten subepidemics but leave the time of origin *t*_{or} variable and 2) we assume that all subepidemic can have different parameters *λ*,*μ*,*ψ*, and *t*_{or}. The runs converged in both Case (i) and Case (ii). The effective sampling size (ESS) was usually several thousands and the minimum ESS was 282.

## Results

We implemented the Bayesian inference procedure to estimate the parameters of the BDM in the software package BEAST (Drummond and Rambaut 2007). From the estimates obtained with this new method, we can determine the total number of infections caused by an individual over the time during which it is infectious. This number is equivalent to the basic reproductive number *R*_{0} and is given by the ratio of the birth and death rates:

Thus, our Bayesian BDM method offers a possibility to estimate a key epidemiological parameter using only sequence data.

### Estimates Obtained for Simulated Data

The estimated values for *R*_{0},*λ*,*μ*,*ψ* from the 100 simulated trees are shown in table 1. The relative error for a parameter *p* was defined as

True Value | 1 Tree (100 Tips) | 10 Trees (10 tTips) | |

R_{0} = 2.25 | Mean | 1.81 | 3.02 |

Relative error | 0.20 | 0.35 | |

Relative bias | –0.20 | 0.35 | |

HPD interval width | 1.75 | 2.75 | |

95% HPD accuracy | 94% | 100% | |

λ = 8.23 × 10^{−4} | Mean | 15.73 × 10^{−4} | 9.38 × 10^{−4} |

Relative error | 0.91 | 0.14 | |

Relative bias | 0.91 | 0.14 | |

HPD interval width | 29.08 × 10^{−4} | 4.69 × 10^{−4} | |

95% HPD accuracy | 100% | 91% | |

μ = 0.88 × 10^{−4} | Mean | 8.94 × 10^{−4} | 1.06 × 10^{−4} |

Relative error | 9.15 | 0.21 | |

Relative bias | 9.15 | 0.21 | |

HPD interval width | 29.73 × 10^{−4} | 3.28 × 10^{−4} | |

95% HPD accuracy | 100% | 100% | |

ψ = 2.78 × 10^{−4} | Mean | 2.04 × 10^{−4} | 2.30 × 10^{−4} |

Relative error | 0.27 | 0.19 | |

Relative bias | –0.27 | –0.17 | |

HPD interval width | 3.03 × 10^{−4} | 1.72 × 10^{−4} | |

95% HPD accuracy | 92% | 79% | |

λ–μ–ψ=4.57 × 10^{−4} | Mean | 4.75 × 10^{−4} | 6.02 × 10^{−4} |

Relative error | 0.14 | 0.33 | |

Relative bias | 0.04 | 0.32 | |

HPD interval width | 3.67 × 10^{−4} | 3.93 × 10^{−4} | |

95% HPD accuracy | 97% | 68% | |

λ–μ–ψ = 4.57 × 10^{−4} | Mean | 5.27 × 10^{−4} | 6.24 × 10^{−4} |

exponential growth tree prior | Relative error | 0.25 | 0.41 |

Relative bias | 0.15 | 0.36 | |

HPD interval width | 2.13 × 10^{−4} | 3.37 × 10^{−4} | |

95% HPD accuracy | 55% | 52% |

True Value | 1 Tree (100 Tips) | 10 Trees (10 tTips) | |

R_{0} = 2.25 | Mean | 1.81 | 3.02 |

Relative error | 0.20 | 0.35 | |

Relative bias | –0.20 | 0.35 | |

HPD interval width | 1.75 | 2.75 | |

95% HPD accuracy | 94% | 100% | |

λ = 8.23 × 10^{−4} | Mean | 15.73 × 10^{−4} | 9.38 × 10^{−4} |

Relative error | 0.91 | 0.14 | |

Relative bias | 0.91 | 0.14 | |

HPD interval width | 29.08 × 10^{−4} | 4.69 × 10^{−4} | |

95% HPD accuracy | 100% | 91% | |

μ = 0.88 × 10^{−4} | Mean | 8.94 × 10^{−4} | 1.06 × 10^{−4} |

Relative error | 9.15 | 0.21 | |

Relative bias | 9.15 | 0.21 | |

HPD interval width | 29.73 × 10^{−4} | 3.28 × 10^{−4} | |

95% HPD accuracy | 100% | 100% | |

ψ = 2.78 × 10^{−4} | Mean | 2.04 × 10^{−4} | 2.30 × 10^{−4} |

Relative error | 0.27 | 0.19 | |

Relative bias | –0.27 | –0.17 | |

HPD interval width | 3.03 × 10^{−4} | 1.72 × 10^{−4} | |

95% HPD accuracy | 92% | 79% | |

λ–μ–ψ=4.57 × 10^{−4} | Mean | 4.75 × 10^{−4} | 6.02 × 10^{−4} |

Relative error | 0.14 | 0.33 | |

Relative bias | 0.04 | 0.32 | |

HPD interval width | 3.67 × 10^{−4} | 3.93 × 10^{−4} | |

95% HPD accuracy | 97% | 68% | |

λ–μ–ψ = 4.57 × 10^{−4} | Mean | 5.27 × 10^{−4} | 6.24 × 10^{−4} |

exponential growth tree prior | Relative error | 0.25 | 0.41 |

Relative bias | 0.15 | 0.36 | |

HPD interval width | 2.13 × 10^{−4} | 3.37 × 10^{−4} | |

95% HPD accuracy | 55% | 52% |

where $p\u2212$ was the true value of *p* and $p^$ was the mean of *p*. The relative bias was defined as

The 95% highest probability density (HPD) accuracy is the percentage of trees with 95% HPD intervals containing the true value (where a 95 % HPD interval is defined as the shortest interval, which contains 95% of the posterior probability).

The estimation of *R*_{0} was accurate; the true *R*_{0} was contained in the 95% HPD in 94 of the 100 trees. The HPDs for each tree are plotted in figure 4. The accuracy of all estimates (*R*_{0}, *λ*, *μ*, *ψ*, and *λ* − *μ* − *ψ*) is given in table 1. Further, the table contains the result of the re-estimation of the net growth *λ* − *μ* − *ψ* based on the same 100 simulated trees but using the coalescent with an exponential growth prior instead of the BDM method for re-estimation.

The parameter estimates for *R*_{0}, *λ*, *μ*, *ψ*, and *λ* − *μ* − *ψ* using the Bayesian BDM method were reliable: 95% HPD intervals contained the true parameters in more than 90% of the cases. In particular, the true *λ* − *μ* − *ψ* was contained in the 95% HPD interval in 97% of the cases. However, when using the coalescent and estimating *λ* − *μ* − *ψ*, the 95% HPD intervals contained the true parameter in only 55% of the cases.

Using 10 trees with 10 tips instead of 1 tree with 100 tips yielded similar results. The improvement obtained when using ten trees is that the HPD interval width for *λ*,*μ*, and *ψ* become significantly smaller; however, the 95% accuracy for the net growth *λ* − *μ* − *ψ* drops to 68% (from 97%).

We further investigated correlations of parameters and recovered that *λ* is positively correlated with *μ* + *ψ*, see figure 5. Thus, the method can estimate the ratio *λ*/(*μ* + *ψ*) accurately but cannot determine well *λ* and *μ* + *ψ* separately. This explains the large HPD intervals for *λ*,*μ*,*ψ*, while having a confined HPD interval for .

### Estimates Obtained for HIV-1 in Switzerland

We applied our Bayesian BDM method to HIV-1 sequence data from the SHCS (The Swiss HIV Cohort Study 2010). As explained in the Materials and Methods section, we focused our analysis on ten Swiss HIV-1 subepidemics in order to exclude biases due to migration from outside of Switzerland. In Case (i), we assumed the same epidemiological parameters *λ*,*μ*,*ψ* in the ten subepidemics, and in Case (ii), we allowed for different parameters in the ten subepidemics.

In Case (i), we obtain the following mean estimates: transmission rate *λ* = 8.23×10^{ − 4}/day (or 0.30/year); becoming noninfectious rate *μ* = 9.46×10^{ − 5}/day; sampling rate *ψ* = 2.78×10^{ − 4}/day. The 95% HPD intervals are given in table 2. As the average time until becoming noninfectious is $1\mu +\psi $, we observe an average time of infectiousness of 7.74[4.39,10.99] years. From the estimates of *λ*,*μ*,*ψ*, we calculate the mean posterior *R*_{0} = 2.29 with a 95% HPD interval [1.61,3.05] (see table 2). The times of origin *t*_{or} of the ten different subepidemics are between 1987 and 1994 (see also fig. 3).

R_{0} | λ | ||||||

Mean | 95% HPD | Mean | 95% HPD | ||||

Case (i) | 2.29 | [1.61, 3.05] | Case (i) | 8.23 | [6.62, 10.17] | ||

1 | 2.95 | [1.01, 6.82] | 1 | 10.93 | [3.58, 24.52] | ||

2 | 1.74 | [0.92, 3.03] | 2 | 13.00 | [4.12, 31.18] | ||

3 | 1:06 | [0.64, 1.50] | 3 | 18.47 | [5.38, 41.70] | ||

4 | 1.57 | [0.83, 2.67] | 4 | 31.55 | [9.72, 72.86] | ||

Case (ii) | 5 | 2.92 | [0.93, 7.46] | Case (ii) | 5 | 11.56 | [3.69, 27.90] |

6 | 1.88 | [0.90, 3.68] | 6 | 21.32 | [5.72, 48.83] | ||

7 | 2.85 | [1.01, 6.81] | 7 | 17.18 | [4.45, 41.34] | ||

8 | 1.26 | [0.52, 2.21] | 8 | 16.34 | [3.57, 37.46] | ||

9 | 3.00 | [0,77, 7.82] | 9 | 22.06 | [5.20, 51.37] | ||

10 | 1.70 | [0.60, 3.28] | 10 | 20.71 | [4.04, 49.71] | ||

μ | ψ | ||||||

Mean | 95% HPD | Mean | 95% HPD | ||||

Case (i) | 0.95 | [0.00, 2.80] | Case (i) | 2.78 | [1.84, 3.78] | ||

1 | 5.88 | [0.00, 19.93] | 1 | 0.67 | [0.01, 1.71] | ||

2 | 7.33 | [0.00, 26.19] | 2 | 1.95 | [0.17, 3.87] | ||

3 | 11.63 | [0.00, 37.37] | 3 | 6.16 | [1.03, 1128] | ||

4 | 18.26 | [0.00, 61.33] | 4 | 5.65 | [0.72, 11.48] | ||

Case (ii) | 5 | 6.19 | [0.00, 22.97] | Case (ii) | 5 | 0.95 | [0.00, 2.59] |

6 | 11.93 | [0.00, 39.13] | 6 | 2.93 | [0.08, 6.29] | ||

7 | 8.96 | [0.00, 33.20] | 7 | 1.15 | [0.04, 2.80] | ||

8 | 10.14 | [0.00, 32.79] | 8 | 4.46 | [0.22, 9.49] | ||

9 | 11.83 | [0.00, 40.59] | 9 | 1.73 | [0.01, 5.33] | ||

10 | 11.99 | [0.00, 41.10] | 10 | 3.33 | [0.19, 7.19] |

R_{0} | λ | ||||||

Mean | 95% HPD | Mean | 95% HPD | ||||

Case (i) | 2.29 | [1.61, 3.05] | Case (i) | 8.23 | [6.62, 10.17] | ||

1 | 2.95 | [1.01, 6.82] | 1 | 10.93 | [3.58, 24.52] | ||

2 | 1.74 | [0.92, 3.03] | 2 | 13.00 | [4.12, 31.18] | ||

3 | 1:06 | [0.64, 1.50] | 3 | 18.47 | [5.38, 41.70] | ||

4 | 1.57 | [0.83, 2.67] | 4 | 31.55 | [9.72, 72.86] | ||

Case (ii) | 5 | 2.92 | [0.93, 7.46] | Case (ii) | 5 | 11.56 | [3.69, 27.90] |

6 | 1.88 | [0.90, 3.68] | 6 | 21.32 | [5.72, 48.83] | ||

7 | 2.85 | [1.01, 6.81] | 7 | 17.18 | [4.45, 41.34] | ||

8 | 1.26 | [0.52, 2.21] | 8 | 16.34 | [3.57, 37.46] | ||

9 | 3.00 | [0,77, 7.82] | 9 | 22.06 | [5.20, 51.37] | ||

10 | 1.70 | [0.60, 3.28] | 10 | 20.71 | [4.04, 49.71] | ||

μ | ψ | ||||||

Mean | 95% HPD | Mean | 95% HPD | ||||

Case (i) | 0.95 | [0.00, 2.80] | Case (i) | 2.78 | [1.84, 3.78] | ||

1 | 5.88 | [0.00, 19.93] | 1 | 0.67 | [0.01, 1.71] | ||

2 | 7.33 | [0.00, 26.19] | 2 | 1.95 | [0.17, 3.87] | ||

3 | 11.63 | [0.00, 37.37] | 3 | 6.16 | [1.03, 1128] | ||

4 | 18.26 | [0.00, 61.33] | 4 | 5.65 | [0.72, 11.48] | ||

Case (ii) | 5 | 6.19 | [0.00, 22.97] | Case (ii) | 5 | 0.95 | [0.00, 2.59] |

6 | 11.93 | [0.00, 39.13] | 6 | 2.93 | [0.08, 6.29] | ||

7 | 8.96 | [0.00, 33.20] | 7 | 1.15 | [0.04, 2.80] | ||

8 | 10.14 | [0.00, 32.79] | 8 | 4.46 | [0.22, 9.49] | ||

9 | 11.83 | [0.00, 40.59] | 9 | 1.73 | [0.01, 5.33] | ||

10 | 11.99 | [0.00, 41.10] | 10 | 3.33 | [0.19, 7.19] |

Note.—Case (i) assumes the same epidemiological parameters *λ*, *μ*, *ψ* in all subepidemics; Case (ii) allows the parameters to vary between the subepidemics 1, …, 10. The rates *λ*, *μ*, *ψ* are stated in units 10^{−4}/day.

In Case (ii), the estimates lie in the same range as for the analysis of Case (i) (i.e., the 95% HPD intervals largely overlap). The estimated mean rates and *R*_{0} for all analyses together with the 95% HPD intervals are given in table 2. The mean times of origin of the ten subepidemics are shown in figure 3.

The 95% HPD intervals for Case (ii) are much wider than for Case (i), due to less data being available for the estimation of the epidemiological parameters. Further, we note that in Case (ii) compared with Case (i), the mean of *λ* and *μ* is always larger. This bias can partially be explained through correlations in parameter estimates. We observe from the simulations and the data that *λ* correlates linearly with *μ* + *ψ*, see figure 5. Determining why larger *λ* and *μ* values are estimated for smaller data sizes needs to be investigated in future simulation studies. For the present work, it is important that the biases vanish when considering *R*_{0}.

In Case (ii), we obtain different *R*_{0} estimates for the different subepidemics. These differences could be due to a variety of factors such as transmission group, the size of the epidemic, the time of origin of the subepidemic, or stochastic fluctuations.

The subepidemics studied here are dominated by different transmission groups (Kouyos et al. 2010). In particular, the subepidemics here are characterized either by predominant transmission between men having sex with men (MSM) or by predominant transmission between mixed groups of heterosexuals (HET) and intravenous drug users (IDU). The composition of the subepidemics according to transmission group are shown in table 3. The subepidemics 3,6, and 9 are dominated by HET and IDU. The other subepidemics are dominated by MSM. The mean *R*_{0} in the HET/IDU subepidemics shows no trend to be lower or higher than the mean *R*_{0} in the MSM subepidemics: The three HET/IDU subepidemics have 1st, 5th, and 10th largest mean *R*_{0} of the ten mean *R*_{0}. This is confirmed by statistical analysis: A nonparametric test (runs statistic, Hogg and Craig 1994) does not reject the null hypothesis, that *R*_{0} is the same in the HET/IDU and MSM transmission groups, with a *P* value of 0.58.

Transmission Group | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

MSM | 31 | 24 | 0 | 25 | 19 | 0 | 15 | 9 | 0 | 9 |

HET | 3 | 4 | 6 | 1 | 3 | 6 | 2 | 5 | 8 | 3 |

IDU | 0 | 0 | 21 | 0 | 2 | 12 | 0 | 0 | 0 | 0 |

Other | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 0 |

Unknown | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |

Foreign | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |

Transmission Group | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

MSM | 31 | 24 | 0 | 25 | 19 | 0 | 15 | 9 | 0 | 9 |

HET | 3 | 4 | 6 | 1 | 3 | 6 | 2 | 5 | 8 | 3 |

IDU | 0 | 0 | 21 | 0 | 2 | 12 | 0 | 0 | 0 | 0 |

Other | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 0 |

Unknown | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |

Foreign | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |

NOte.—The individuals are sorted with respect to transmission group: men having MSM, HET, IDU, other, unknown, and non-Swiss individual (foreign). Note that the foreign individuals were included to detect Swiss clusters, but they are excluded when estimating the rates in the subepidemics.

We test whether the size of the subepidemic correlates with the *R*_{0} by regressing the size of the subepidemic against *R*_{0}. The absence of a significant correlation (Pearson correlation 0.07, *P* value 0.84) suggests that size is not a major determinant of *R*_{0} (see also fig. 6). Moreover, this also suggests that our estimate of *R*_{0} is not affected by the fact that we only include subepidemics with sample size of 12 and larger in our analysis.

In order to investigate if the *R*_{0} is different in old and new subepidemics, we plot the estimated mean time of origin of each subepidemic against the estimated mean *R*_{0} (see fig. 7). Although the *R*_{0} decreases for younger clusters, the correlation is nonsignificant (Pearson correlation − 0.49, p-value 0.15). (Pearson correlation − 0.49, *P* value 0.15).

Clearly, the variation in *R*_{0} in the different subepidemics could also be explained by further factors such as differences in behavior within different subepidemics independent of transmission group or the founder strains of the subepidemics having varying virulence (Alizon et al. 2010). However, in the absence of any such data, we cannot test for the role of these factors. Finally, it should be noted that the differences in *R*_{0} could also be simply due to chance given that the 95% HPD intervals overlap in most cases.

### Validation of HIV-1 Estimates

To verify that our estimates are compatible with other data regarding the Swiss HIV epidemic, we compared the estimates obtained from our analysis with estimates obtained through other methods.

First, the mean probability of an infected individual having been sampled before dying, $\psi \psi +\mu $, is estimated with our method to be 77.6% (95% HPD [44.7,100.0]). This is in very good agreement with the recent Swiss HIV Cohort report (The Swiss HIV Cohort Study 2010), where it is estimated that 75% of all Swiss HIV infected which developed AIDS are enrolled in the Swiss Cohort.

The expected number of secondary infections caused by an infected individual over a year is 365×*λ* = 0.30 with 95 % HPD [0.24,0.37] (table 2). This parameter has been estimated from transmission clusters using the SHCS (The Swiss HIV Cohort Study 2010) together with the Zurich primary HIV infection study (Metzner et al. 2010) to be 1.8 (95% confidence interval [0.5,5.8]) for people infecting in the chronic phase (Rieder et al. 2010). Their confidence interval is not overlapping with our confidence interval, which might be due to real transmission differences in the different data sets (recall that simulations show a 100% accuracy for our confidence interval, table 1).

Thus, we applied our Bayesian BDM method to the largest transmission cluster found in Rieder et al. (2010). For this cluster, we obtained a mean estimate of 365×*λ* = 1.14 with 95 % HPD [0.31,2.66], which largely overlaps with but is more confined than the previously estimated confidence interval [0.5,5.8] (Rieder et al. 2010).

### Comparison of BDM Analysis to Classical Analysis Using the Coalescent

Under the coalescent with exponential growth, the net growth parameter *λ* − *μ* − *ψ* can be estimated. Recall that based on the simulated trees, the net growth is estimated more accurate under the BDM than under the coalescent (97% vs. 55% HPD accuracy).

To investigate the impact of model choice on empirical data results, we analyzed the ten Swiss HIV subepidemics assuming the coalescent with exponential growth and compared these results with the BDM analysis (Case [i]). Under the coalescent, the mean exponential growth parameter was estimated to be 6.89×10^{ − 4}/day. The 95% HPD interval is [3.01×10^{ − 4},11.09×10^{ − 4}]. Under the BDM, we estimated for the exponential growth parameter *λ* − *μ* − *ψ* a mean of 4.51×10^{ − 4}/day with 95% HPD interval [3.20×10^{ − 4},5.72×10^{ − 4}]. The 95% HPD interval of the BDM analysis is fully contained within the HPD interval of the coalescent analysis meaning that the BDM analysis has the power to provide more confined HPD intervals. Further, recall that the BDM method is able to provide the parameters *λ*,*μ*,*ψ* independently such that *R*_{0} can be calculated; the coalescent only provides the net growth *λ* − *μ* − *ψ*.

## Discussion

### Estimation of Key Epidemiological Parameters

Our study presents and applies a method to infer key epidemiological parameters directly from viral sequence data. Various attempts have been made to use genetic sequences in order to estimate epidemiological parameters. However, none of these studies have been able to infer the basic reproductive number *R*_{0} only from sequence data. *R*_{0} of an epidemic (Hepatitis C Virus, HCV) was estimated from viral sequence data for the first time in Pybus et al. (2001). Because these authors used the coalescent as a transmission model, they could not directly infer transmission and death rates but instead required an independent estimate of average duration of infectiousness. In Volz et al. (2009) and Frost and Volz (2010), transmission rates are introduced to the coalescent framework, but an independent estimate is still required for the duration of infectiousness in order to calculate *R*_{0}.

Although assuming an estimate of the duration of infectiousness may be appropriate for HCV, the estimation of the time of infectiousness is fraught with difficulties for many infections. In particular in HIV, the duration of infection is highly variable between patients as the time until AIDS can vary between 2 and 20 years. Moreover, the time span over which a patient is (highly) infectious is debated (Yerly et al. 2001, Yerly et al. 2007, Brenner et al. 2008, Hollingsworth et al. 2008, Rieder et al. 2010). Given the uncertainties in estimating the duration of infectiousness together with its variability and given the observation that fixing the time of infectiousness to different values yields different *R*_{0} estimates (Pybus et al. 2001, Wallinga and Lipsitch 2007), such coalescent-based methods to infer *R*_{0} for HIV have to be used with great care.

Our Bayesian BDM method overcomes these problems essentially by independently estimating transmission and death rates together with the transmission chain given sequentially sampled sequence data. Although our method thus represents an advance over coalescent-based methods, it also has certain shortcomings. In particular, being based on a BDM, our approach assumes exponential growth of the epidemic. To fully account for the population dynamics of the epidemic would require a tree-generating model based on more explicit epidemiological models (such as SI, SIR, or SEIR models [S: susceptible, E: exposed, I: infectious, R: recovered]; Keeling and Rohani 2008). Such models would not only account for an initial exponential increase but also a saturation phase as susceptible hosts are becoming limited.

For the application of our method to the HIV data from Switzerland, we believe that saturation is not a major concern. If the considered subepidemics had progressed beyond the exponential phase, late edges in the trees would be expected to be long compared with early edges. Long late edges should result in vanishing estimates for the becoming noninfectious rate (Nee et al. 1994). The fact that we obtain nonzero estimates for the becoming noninfectious rates indicates that the subepidemics did not yet reach the postexponential phase. We emphasize that the considered subepidemics being in the exponential phase does not contradict the Swiss epidemic as a whole being in the postexponential phase.

One study has been published that estimates *R*_{0} in HIV from sequence data (Volz et al. 2009) based on a modified coalescent model. Using a transmission tree inferred from sequence data of HIV-infected individuals sampled at one time point (1993) in the United States, the estimate *R*_{0} = 2.29 is obtained, assuming a time of infectiousness of 10 years. A few studies have been published that estimate *R*_{0} in HIV from epidemiological data (Bezemer et al. 2010, Nishiura 2010). These estimates were obtained from the temporal changes of the incidence of the infection and ad hoc estimates for the time of infectiousness. For example, using data for the early HIV epidemic until 1984, the *R*_{0} of HIV in Western Europe was estimated to be between 3.5 and 4.1 (Nishiura 2010) using published estimates for the time span of infectiousness and time-dependent infection intensity (Hollingsworth et al. 2008). In Bezemer et al. (2010), the *R*_{0} for the MSM transmission group in the Netherlands was estimated with a likelihood approach for different time periods showing temporal fluctuations (1980–1983: *R*_{0} = 2.39[2.17,2.76]; 1984–1995: *R*_{0} = 0.89[0.85,0.93]; 1996–1999: *R*_{0} = 0.76[0.70,0.86]; 2000–2003: *R*_{0} = 1.04[0.98,1.09]).

The actual reproductive number, *R*_{a}, is defined as the number of secondary infections caused by a single infected individual for the current frequency of susceptibles (Amundsen et al. 2004), whereas *R*_{0} is defined as the number of secondary infections when the entire population is still susceptible. Provided an epidemic is far from saturation, then the estimates of *R*_{a} can be used as a good approximation for *R*_{0}.

Assuming an average infectiousness period of 10 years, *R*_{a} for HIV in the United Kingdom was estimated to be lower than 1 between 1995 and 2004, *R*_{a} remained lower than 1 for HET and above 1 for MSM (White et al. 2006). An independent study estimated for MSM in 1995 an *R*_{a} of 0.55 in Denmark, 0.85 in Norway, and 0.58 in Sweden (Amundsen et al. 2004). For IDU, *R*_{a} was estimated to be 3.5 in Latvia and 21.7 in Lithuania in 2002 (assuming an average duration of infectiousness of 11 years).

Taken together, the estimates of *R*_{0} and *R*_{a} vary considerably. This may be in part due to methodological difference but likely also reflects differences in the epidemics in different countries or different transmission groups. Our estimate of *R*_{0} = 2.29 is thus broadly in agreement with these earlier estimates. Note, in particular, that our estimates represent time averages over the epidemic with some of the considered subepidemics ranging back to the 1980.

Comparing our parameter estimates to quantities estimated by independent means from the Swiss HIV Cohort in previous studies (Metzner et al. 2010, Rieder et al. 2010, The Swiss HIV Cohort Study 2010) reinforces our confidence in the method. Specifically, our estimated sampling probability is in good agreement with previous estimates.

Analyzing the transmission in distinct subepidemics in Switzerland, we did not find any significant correlation between *R*_{0} and size of the subepidemic, or age of the subepidemic, or transmission group (HET/IDU versus MSM). The absence of such correlation may be due in part to the limited number of subepidemics that could be studied here. For example, the association between age of the subepidemic and *R*_{0} shows a trend towards decreasing *R*_{0} in younger subepidemics. However, the overall absence of significant associations suggests that neither size, age, nor transmission group of the subepidemic are major explanatory factors of *R*_{0}.

A legitimate concern is that our method requires subepidemics that are large enough such that a substantial number of patients are sampled. Therefore, our analysis is biased towards larger subepidemics, which may in turn result from larger *R*_{0}. However, as there is no correlation between the size of the subepidemic and *R*_{0} in our study, there is no evidence that our estimate *R*_{0} overestimates the true *R*_{0} in the entire Swiss epidemic. Note, moreover, that a general bias towards larger subepidemics is not a limitation just of our method but more generally applies to all studies mentioned here.

### Tree-Generating Models in Bayesian Analyses

The Bayesian BDM method developed and applied here has the advantage over previous methods (which are based on the coalescent) that the underlying tree-generating model reflects more accurately the epidemiological process of disease transmission. This implies that the BDM method provides estimates of key epidemiological parameters, whereas the coalescent only provides an estimate of the net growth of an epidemic. Furthermore, our simulations reveal an improved HPD accuracy for the net growth when using the BDM method instead of the classic coalescent method (97% vs. 55%). The Swiss HIV data analysis reveals three times more confined HPD intervals for the net growth when using the BDM method. Therefore, we consider the Bayesian BDM method to be more accurate and appropriate not only in cases where epidemiological parameters are being inferred but also generally when Bayesian methods are used for phylogenetic analysis of epidemiological sequence data.

Our BDM can only account for the exponential growth phase of an epidemic. Thus, the coalescent is still the only framework under which postexponential epidemic dynamics can be investigated. Having shown the advantages of the BDM over the coalescent, this paper will hopefully stimulate research to also use BDMs in order to describe the postexponential phase of the epidemic.

The method was applied here specifically to HIV but can be used to infer the epidemiology of other viral epidemics. Moreover, it could be adjusted to infer a within-host basic reproductive number *R*_{0} based on sequence samples that are obtained over the course of a viral infection in an individual patient; again, this would circumvent the requirement of an independent estimate of the expected generation time. Hence, our Bayesian BDM method represent a versatile tool for phylogenetic analysis of viral sequence data.

We thank the patients for participation in the SHCS and the clinical trials, the physicians and study nurses for excellent patient care, the laboratory technicians of the Swiss resistance laboratories for the quality of the data, SmartGene, Zug, Switzerland for providing excellent technical support, and Brigitte Remy, Martin Rickenbach, and Yannick Vallet from the SHCS data center in Lausanne for the data management. Funding sources: This study has been financed in the framework of the SHCS and supported by the Swiss National Science Foundation (SNF grant number 3345-062041). Further support was provided by the SNF (grants numbers 3247B0-112594 to H.F.G., S.Y., B.L., S.B.; 324730 − 130865 to H.F.G.; 3100*A*0 − 116408 to S.B.); SHCS project 470, 528 and 569; the SHCS Research Foundation, by a further research grant of the Union Bank of Switzerland in the name of a donor to H.F.G. and by the European Community's Seventh Framework Programme (grant FP7/2007 − 2013), under the Collaborative HIV and Anti-HIV Drug Resistance Network (grant 223131 to H.F.G.). The funding agencies had no role in conducting the study or in preparing the manuscript.

## References

## Author notes

**Associate editor:**Daniel Falush