Quantifying Transmission Heterogeneity Using Both Pathogen Phylogenies and Incidence Time Series

Abstract Heterogeneity in individual-level transmissibility can be quantified by the dispersion parameter k of the offspring distribution. Quantifying heterogeneity is important as it affects other parameter estimates, it modulates the degree of unpredictability of an epidemic, and it needs to be accounted for in models of infection control. Aggregated data such as incidence time series are often not sufficiently informative to estimate k. Incorporating phylogenetic analysis can help to estimate k concurrently with other epidemiological parameters. We have developed an inference framework that uses particle Markov Chain Monte Carlo to estimate k and other epidemiological parameters using both incidence time series and the pathogen phylogeny. Using the framework to fit a modified compartmental transmission model that includes the parameter k to simulated data, we found that more accurate and less biased estimates of the reproductive number were obtained by combining epidemiological and phylogenetic analyses. However, k was most accurately estimated using pathogen phylogeny alone. Accurately estimating k was necessary for unbiased estimates of the reproductive number, but it did not affect the accuracy of reporting probability and epidemic start date estimates. We further demonstrated that inference was possible in the presence of phylogenetic uncertainty by sampling from the posterior distribution of phylogenies. Finally, we used the inference framework to estimate transmission parameters from epidemiological and genetic data collected during a poliovirus outbreak. Despite the large degree of phylogenetic uncertainty, we demonstrated that incorporating phylogenetic data in parameter inference improved the accuracy and precision of estimates.


Introduction
The intensity of epidemics is often summarized by the reproductive number R, the average number of secondary infections caused by a typical infectious individual over the course of their infectious period. This statistic is useful for determining whether an epidemic can take off and if so the final size of the epidemic. However, large variation between individuals is frequently observed in outbreaks of directly transmitted acute infections leading to superspreading events such that a few individuals cause most of the infections (Lloyd-Smith et al. 2005). The offspring distribution captures the distribution of secondary infections per infectious individual, and can be parameterized by a negative binomial with mean R and dispersion k. The presence of superspreading as indicated by small values of k can affect the effectiveness of control strategies (Garske and Rhodes 2008).
Inferring the value of k from data is not straightforward, even in the presence of contact tracing data as many infections may be asymptomatic or not reported. The offspring distribution fit to incomplete transmission chain data has to be corrected for biased and under-reporting (International Ebola Response Team 2016). Obtaining precise estimates of k from just incidence time series is usually not possible because k only affects the noisiness of the incidence time series at low numbers.
Besides epidemiological data, pathogen population genetics are playing an increasingly important role in inferring epidemiological parameters (Volz et al. 2009;Koelle and Rasmussen 2012;Volz 2012;Stadler et al. 2013;Kühnert et al. 2014). For coalescent-based approaches, the offspring distribution is integral to the inference process as it affects the relationship between the underlying epidemic and the observed distribution of coalescent (branching) events in the pathogen phylogeny. When the offspring distribution is overdispersed, shorter intervals between coalescent times in the pathogen phylogeny are observed. This is expected as coalescent events correspond to transmission events during the epidemic and superspreaders can cause the aggregation of many transmission events within a short period of time.
Given that epidemiological parameters could be estimated either from epidemiological data or from phylogenetic data, combining the analysis of both types of data should provide more accurate and precise estimates. Rasmussen et al. (2011) found that estimating parameters jointly from both incidence time series and pathogen phylogeny reduced uncertainties in estimates of parameters and the prevalence over time. However, this work did not allow for uncertainty in the pathogen phylogeny and was limited to simple SIR models. While Rasmussen et al. (2014) showed that parameter estimates did not significantly change for phylogenies of sequences collected over many years, the uncertainty in pathogen phylogeny during outbreaks is generally greater and needs to be accounted for to ensure accurate estimation of transmission parameters.
Here we develop a statistical inference framework to fit a stochastic compartmental model with an explicit offspring distribution to enable the estimation of k and other epidemiological parameters from outbreak data when there is uncertainty in the pathogen phylogeny. A major challenge is that small numbers of infections at the start of epidemics combined with highly heterogeneous transmission lead to substantial stochasticity in the initial dynamics of epidemics. We used the Particle Monte Carlo Markov Chain (PMCMC) method to infer parameters while integrating over stochastic outcomes (Andrieu et al. 2010;Rasmussen et al. 2011).
We simulated data to assess the coverage, precision, and bias of estimates of k and other epidemiological parameters using our method. We also applied our inference framework to genetic sequence and epidemiological data collected during a poliovirus outbreak (Yakovenko et al. 2014). Although polio sequences have provided interesting insight into pathogen diversity and geographic transmission routes, phylodynamic methods such as those presented here have not previously been used to infer epidemiological parameters.

New Approaches
The inference framework that we developed here uses both incidence time series and pathogen phylogenies to infer the value of k and other epidemiological parameters. It is based on integrating results from coalescent theory and epidemiological modeling into a combined inference framework and using particle filtering to estimate the marginal likelihood of the data under the model in an MCMC approach. The main novelty of our method is the ability to estimate k by fitting stochastic compartmental models to both epidemiological and phylogenetic data. To do this we needed a transmission model with an explicit offspring distribution, a formulation of the coalescent parameterized in terms of the offspring distribution parameters, and a method to integrate over the phylogenetic uncertainty that often arises in rapid outbreaks.

Compartmental Model with Explicit Offspring Distribution
The offspring distribution describes the distribution of secondary infections Z i caused by each infectious individual i. The offspring distribution for unstructured compartmental models such as the SIR is geometric due to the Poisson transmission process and exponentially distributed duration of infection. However, the geometric is not able to capture the variation in infectiousness for several directly transmitted diseases for which the negative binomial provides a better fit (Lloyd-Smith et al. 2005). We use a negative binomial distribution with mean R t and variance R t 1 þ 1 k À Á to characterize the offspring distribution of individuals infected at time step t ¼ f1; . . . ; n T g, where the size of the simulation time step is Dt. The dispersion parameter k determines the level of overdispersion in the offspring distribution. A small k means that a few superspreading individuals cause most of the infections.
Without specifying an offspring distribution, the implicit assumption of the stochastic SIR model is that the offspring distribution is geometric. Throughout this paper, we parameterize the negative binomial with the mean and dispersion as they are the most relevant statistics for epidemiological studies. For a given individual i infected at time step H i , the number of secondary infections they cause is distributed according to Z i $ NBinðR t¼H i ; kÞ. The reproductive number R t is the average number of secondary infections caused by an individual infected at time step t: R t ¼ EðZ i jH i ¼ tÞ.
We simulated from a modified susceptible-infectedremoved (SIR) model using the binomial distribution based tau-leap method (Chatterjee et al. 2005). As we focused on the analysis of acute infectious diseases with short generation times, we approximated the transmission process by assuming all secondary infections occur at the end of the infectious period. This approximation was used so that we could simulate according an arbitrary offspring distribution without having to keep track of infection times of infected individuals. In practice, this meant drawing a random number from i $ NBinðR t¼H i Y t ; kY t Þ, where Y t was the number of infectious individuals recovering at time step t.

Coalescent Likelihood
The original coalescent provided the statistical distribution of coalescent times for a given effective population size Ne (Kingman 1982). In the context of epidemiology, Ne is related to the number of infectious individuals N via the variance of the offspring distribution r 2 at endemic equilibrium Ne ¼ N r 2 (Koelle and Rasmussen 2012). The discrepancy between the total number of infectious individuals and the effective number of individuals contributing to infections increases with the variance of the offspring distribution. For infectious diseases, N corresponds with the number of infectious individuals Y (time index t has been dropped for clarity).
Going backward in time, the probability density function of the time to coalescence of a pair of lineages U is f ðUÞ ¼ ke ÀkU , where rate k ¼ 1 NeÁT g and T g is the generation time (duration of infection).
In this work, we use a formulation of the coalescent parameterized by an arbitrary offspring distribution that is timevarying (Fraser and Li 2017). Assuming a negative binomial offspring distribution with mean R and variance R 1 þ R k À Á , the expected rate of coalescence for a pair of lineages is given by: where T g is the mean generation time, that is, the duration of infectiousness in an SIR model. We assume that k; R; and N change over time in a step-wise fashion at

Phylogenetic Uncertainty
To avoid the computationally intensive process of phylogenetic reconstruction on top of particle filtering, we separated the process of parameter estimation and phylogeny reconstruction. In Rasmussen et al. (2014), the authors used the same approach but with fewer (ten) phylogenies and they did not pool together posterior estimates to get an overall posterior distribution. An alternative approach was taken by Volz and Pond (2014) where they calculated the average likelihood across all phylogenies at each iteration of MCMC. Because we are adopting a PMCMC approach, it was more computationally efficient to run parallel PMCMC for each phylogeny than using the particle filter multiple times at each iteration of the MCMC.
Phylogenetic reconstruction programs such as MrBayes (Ronquist et al. 2012) produce a posterior distribution PðPhyjSÞ of phylogenies given the sequences S. To estimate the marginal posterior probability of the parameters PðhjSÞ, we can integrate over the phylogenies: P hS ð Þ / PðhÞ Ð Phy PðPhyjhÞPðPhyjSÞdPhy. Taking M samples from the posterior distribution PðPhyjSÞ, we can estimate the marginal posterior density PðhjSÞ using equation (2).
PðPhy m ð Þ jhÞ: However, as we reconstruct the phylogeny independently from estimation of epidemiological parameters, equation (2) is only an approximation of the marginal posterior density. The number of phylogenies M needed to accurately estimate epidemiological parameters depends on uncertainties in branching times and the molecular clock rate.

Implementation
Although programs that implement PMCMC do exist, they are mostly tailored for analyses of incidence time series (Dureau et al. 2013;King et al. 2016). Existing code for concurrently analyzing incidence time series and phylogenetic data is not parallelized and does not allow for phylogenetic uncertainty (Rasmussen et al. 2014). Parallelization is not needed if the number of particles is small (a few hundred), for example for analyzing data generated through largely deterministic processes. The number of particles needed to obtain stable estimates of likelihood scales with the length of the time series (Andrieu et al. 2010) as well as the stochasticity of the model. Without parallelization, the MCMC would not converge in a reasonable amount of time for outbreak data.
We implemented a parallelized version of particle filtering implemented in C þþ (code available at github.com/ lucymli/EpiGenMCMC; last accessed June 23, 2017) with an accompanying R package (github.com/lucymli/EpiGenR; last accessed June 23, 2017) to interface with the C þþ program.
More details on the implementation are provided in the "Materials and Methods" section.

Results
An overview of the simulated data analyzed in the following sections is provided in the "Materials and Methods" section.
Phylogeny Is More Informative than Incidence Time Series for Estimating k Based on data simulated from an SIR model, pathogen phylogeny was needed to accurately estimate the dispersion parameter k of the offspring distribution when k was small ( fig. 1). This suggested that superspreading events left sufficient signal to allow inference of k in the phylogeny but not the incidence time series. There was insufficient signal in the incidence time series to determine the value of k. Although using both epidemiological and phylogenetic data produced the least biased and most precise estimates of k, the coverage was lower than using each set of data alone as the true value did not fall within the credible intervals in many instances. In fact, using phylogenetic data alone seemed to produce the most accurate estimates that were only slightly less precise and slightly more biased compared with estimates that used both types of data (table 1).
We also estimated the basic reproductive number R 0 , time of the first infection T 0 , and the probability of sampling an infectious individual in the incidence time series q at the same time as estimating k. There were no noticeable k = 0.  fig. S1, Supplementary Material online). We did not estimate q when just using the phylogenetic data, as the reporting rate q referred to the probability that an infection appeared in the incidence time series.
Estimates of the R 0 and k were closer to the true value when both genetic and epidemiological data were used in inference, compared with fitting to each set of data individually (table 1). The coverage of estimates for T 0 and q while fitting to both sets of data was similar to fitting to epidemiological or phylogenetic data alone.

Estimates of R 0 Were Biased If k Was Fixed at the Incorrect Value
The dispersion parameter k of the offspring distribution is usually not estimated when fitting compartmental models. We investigated the effects of assuming the wrong value of k on parameter estimates, especially on R 0 estimates. The implicit assumption of an unstructured SIR compartmental transmission model is that the offspring distribution is geometrically distributed, which is equivalent to fixing k ¼ 1 in the negative binomial. For a subset of simulated outbreaks (those where we sampled 1% of individuals), we re-estimated parameters with a fixed k ¼ 1.
We compared these results ( fig. 2) to those obtained when k was also estimated ( fig. 1), and found significant differences in R 0 estimates when the true value of k 6 ¼ 1. This was evidenced by an increase in the Kolmogorov-Smirnov distances (Massey 1951), a measure of distance between two distributions when the true value of k 6 ¼ 1 compared with when the true value of k ¼ 1 (table 2). Inference using just incidence time series was less affected by assumptions of k, given the small K-S distances.
Estimates of the reporting rate and the epidemic start date were not affected by assumptions of k, regardless of the data used during inference.

Estimation from Multiple Phylogenies
For a subset of data (those generated using R 0 ¼ 2 and k ¼ 0:1), we re-estimated the parameters for each phylogeny inferred from the simulated sequences ( fig. 3). All 95% HPD intervals obtained using inferred phylogenies included the true parameter value. However, estimates of R 0 and k obtained from inferred phylogenies instead of the true phylogeny reduced precision and increased bias, although estimates of k were still more precise than those estimated from epidemiological data. Interestingly, estimates of the epidemic start date were less biased when using inferred phylogenies than the true phylogeny, although this might simply be due to the sample phylogeny randomly having a tree height further away from the epidemic start date.

Phylodynamic Analysis of a Wild Poliovirus Type 1 Outbreak
Although most poliovirus infections are asymptomatic, temporary or permanent paralysis can occasionally occur. Those symptomatic cases are reported to the World Health Organization. Accurately estimating the reporting rate (i.e., case-to-infection ratio) is especially important as the eradication of polio approaches completion. The consensus value often used in epidemiological modeling of wild poliovirus type 1 (WPV1) is 0.5% (Grassly et al. 2006;Blake et al. 2014). If the reporting rate is lower than expected, then a much longer period of no reported infections must pass before eradication can be certain (Eichner and Dietz 1996;Kalkowska et al. 2015). We wanted to apply our inference method to poliovirus phylogenies to see if we can obtain more accurate estimates of the reporting rate compared with using incidence time series alone.
In 2010, a large outbreak of wild poliovirus type 1 occurred in Tajikistan resulting in 518 reported cases of poliomyelitis (Yakovenko et al. 2014). Fitting an SEIR model structured by three age groups to incidence time series, Blake et al. (2014) used iterated filtering to obtain maximum likelihood estimates of various epidemiological parameters. We fit the same model to pathogen phylogenies, incidence time series, or both using a Bayesian approach. The posterior distributions of parameters are presented in figure 4, and the median and 95% HPD intervals are in supplementary table S1 in the Supplementary Material online.
The rate of substitution estimated using MrBayes was 0.010 (0.006, 0.014) substitutions per site per year, which was in line with previous estimates (Jorba et al. 2008).
We obtained more precise estimates of the reporting rate when using both epidemiological and phylogenetic data. These estimates are dependent on the initial number of Table 1. Precision (measured by the root mean squared deviation), Bias, and Coverage (% of simulations in which the true value is found in the 95% highest posterior density intervals) of Parameter Estimates When Fitting Models to Either Epidemiological, Phylogenetic, or Both Types of Data. NOTE.-These statistics were evaluated across all simulations presented in figure 1. The estimates of epidemic start dates T 0 were converted to the number of days after an arbitrary date. For bias and precision, we normalized the statistics by the true parameter value.
Quantifying Transmission Heterogeneity . doi:10.1093/molbev/msx195 MBE susceptible individuals, which we fixed to their maximum likelihood estimates obtained by fitting to epidemiological data only (Blake et al. 2014).
The basic reproductive number of children aged 0-5 years R c was estimated to be 2.58 (2.23-2.98) when both epidemiological and phylogenetic data were used in inference. These values were more similar to the results obtained from just the incidence time series than those estimated from pathogen phylogeny, indicating that the epidemiological data were more informative of the reproductive number than pathogen phylogeny. The maximum likelihood estimate of 2.18 from Blake et al. (2014) were included within the 95% highest posterior density (HPD) interval except when only phylogenetic data were used for inference. Parameter estimates when reporting rate was 1 in 100 and k was fixed to 1. The horizontal dashed lines denote the true parameter value for that set of parameters that is, the parameter value used to simulate the data. The boxes indicate the median and 95% HPD interval of parameter estimates pooled from replicate simulations. The vertical lines with a single dot denote the median and 95% HPD interval of each individual simulation. Simulations where the MCMC chain did not converge were left out of the plot. Estimates of the reporting rate did not include inference from phylogenetic data, as the reporting rate refers to the probability that an infection appears in the incidence time series.  NOTE.-K-S values closer to 1 reflect larger discrepancies between the posterior distributions, whereas those close to 0 suggest no difference in posterior distributions. The numbers in the brackets denote the range (maximum-minimum) of K-S distances from different sets of simulated data, and the number preceding the brackets denotes the median K-S distance.

MBE
The posterior distributions of the reproductive number of older children and adults R a all included the maximum likelihood estimate of 0.46. Adding the phylogenetic data did not significantly alter parameter estimates. This was not surprising as the credible interval surrounding estimates using just phylogenetic data was much wider.
The estimated value of k was high (>1), indicating the lack of superspreading dynamics. The estimated values were 6.7 (0.1-349.7) when only genetic data were used for inference. These values were much higher when epidemiological data were used: 69.0 (2:6 Â 10 À3 -729.8), and when both data sets were used at the same time: 64.0 (1.8-518.1).
Given the large credible intervals around estimates of vaccine effectiveness per campaign using phylogenetic data, only epidemiological data were informative of this parameter. The credible intervals using just epidemiological data included the maximum likelihood estimate from Blake et al. (2014) at 69% (55-80%).
Finally, the estimated start date of the epidemic for analysis using both data sets, epidemiological data only, and phylogenetic data only all overlapped with each other, as well as with estimates from Blake et al. (2014).
Overall, it seems that incidence time series were more informative for estimates of reproductive number, k, and vaccine effectiveness than pathogen phylogeny. However, in all these cases, including the pathogen phylogeny improved the precision of estimates.

Discussion
Building on methods that enable parameter inference for stochastic models and phylodynamic approaches integrating both epidemiological and phylogenetic data, we presented a framework for quantifying the offspring distribution dispersion k while inferring key epidemiological parameters from both types of data. The addition of pathogen phylogeny to epidemiological inference was necessary to accurately estimate the dispersion of the offspring distribution k. This would be useful for detecting superspreading dynamics in infectious disease outbreaks where data from densely sampled transmission networks are not available.
The phylogenetic data were not useful for all estimated parameters, however. In the poliovirus analysis, phylogenetic data alone were not informative of vaccine effectiveness per campaign or the basic reproductive number of older children and adults.
The use of a single representative phylogeny to infer epidemiological parameters is sufficient for well-resolved phylogenies. Rasmussen et al. (2014) found that parameters of an HIV transmission model were broadly consistent amongst 10 phylogenies sampled from Bayesian phylogeny reconstruction in BEAST. As these sequences were collected over a number of years, there was sufficient confidence in the branching times that different phylogenies sampled from the posterior distribution produced similar estimates. In an outbreak setting when transmission happens over a short time compared with viral evolution, greater uncertainty in branching times meant that we needed to use a larger number of phylogenies to be confident of the posterior distribution of parameters. For the analysis of the 2010 polio outbreak in Tajikistan, integrating over a large number of phylogenies was necessary given the extent of uncertainty in branching times (supplementary fig. S4, Supplementary Material online).
Ideally, an inference framework would concurrently estimate epidemiological parameters and reconstruct the phylogeny. This could be implemented in existing phylogenetic reconstruction packages such as MrBayes (Ronquist et al. 2012) and BEAST (Drummond et al. 2012;Bouckaert et al. 2014) by incorporating a particle filter. However, such a framework would be even more computationally intensive than PMCMC due to the additional parameters that need to be estimated.
Existing approaches to epidemiological inference from pathogen phylogeny do not usually account for overdispersion in the offspring distribution (Volz et al. 2009). While the variance of the offspring distribution can be increased by dividing the population into a limited number of infectious categories (Volz and  Quantifying Transmission Heterogeneity . doi:10.1093/molbev/msx195 MBE discretization of infectiousness requires a structured coalescent approach whereas estimating the offspring distribution parameters assumes homogeneous mixing. In previous studies on the relationship between the effective population size N e and the prevalence N have assumed that N e ¼ N However, this relationship is only accurate when N is constant. The formulation used in this paper is valid for any arbitrary time-varying offspring distribution and changing prevalence N (Fraser and Li 2017). Although the particle filter produces an unbiased estimate of marginal likelihood, it is very computationally intensive.

MBE
The number of particles required scales with the length of simulations, the number of transitions in the model, and the reporting probability of cases. As we simulated from the index case, the epidemic trajectories at the beginning of simulations were highly unpredictable. Data sets with overdispersed offspring distribution further increased the stochasticity of simulations, necessitating a large number of particles to obtain a stable estimate of the marginal likelihood. In our implementation, we needed 10,000 particles for k ¼ 0:1 and at least 1,000 for k ¼ 1. For simpler models, approximations such as the Kalman filter can be used. The strength of PMCMC, however, is the applicability to a wide range of models including high-dimensional ones (Sheinson et al. 2014). We provide a parallelized C þþ implementation of the PMCMC algorithm with an accompanying R package to process input and output for the C þþ program. We decided to use our own implementation of PMCMC as existing programs and libraries did not provide all the necessary features we needed for inference from both epidemiological and phylogenetic data. The R package POMP provides an extensive array of inference and simulation methods for Markovian processes including PMCMC (King et al. 2016), however the program is not easily parallelizable and is not well suitable for inference from phylogenetic data. On the other hand, the BEAST packages are well optimized for inference of epidemiological and evolutionary parameters from pathogen sequences but have not yet implemented particle filtering and have a limited number of epidemiological models (Drummond et al. 2012;Bouckaert et al. 2014).
In addition to methodological contributions, we demonstrated the value of a phylodynamic approach for poliovirus research. While phylodynamic analyses have been used to characterize the epidemiological dynamics of other viral diseases such as influenza and HIV, such methods are not widely used for poliovirus analysis. Molecular surveillance through sequencing of poliovirus isolates has mainly been used for tracking the geographic spread of poliovirus in endemic countries (Angez et al. 2012), detecting orphan lineages which are indicative of long-term silent transmission , and reconstructing the history of pathogen diversity (Burns et al. 2013). Although model-based parameter inference has been used to analyze epidemiological data for polio (Grassly et al. 2006;Mangal et al. 2013;Blake et al. 2014), it has not been used to analyze viral sequence data.
The gold standard of polio surveillance has traditionally been through Acute Flaccid Paralysis (AFP) surveillance, in which stool samples from patients with AFP symptoms are tested for the presence of poliovirus. As the number of poliovirus infections decreases, there might be too few symptomatic cases reported through AFP surveillance to provide sufficient data in terms of incidence time series and viral sequences. Environmental sampling of poliovirus shed by asymptomatically infected individuals will thus play an increasingly important role in monitoring poliovirus and quantifying its epidemiology as eradication gets closer.
The inference framework presented here integrates the analyses of epidemiological and phylogenetic data. It can account for demographic stochasticity and phylogenetic uncertainty to quantify heterogeneity between individuals at the same time as estimating other epidemiological parameters. This inference method can be further applied to other rapidly evolving viral infections especially those with superspreading dynamics.

Coalescent Likelihood Calculation
The likelihood given a phylogeny (Phy) is calculated in a piecewise fashion for small time intervals. The small time intervals are unequal in size, and bounded by the times for one of three "events": end of a simulation time step (total of n T steps), coalescence (total of n tips À 1 events), or sampling (total of n tips events). The length of each time interval between events is denoted by U s where s ¼ f1; . . .; n T þ2n tips À1g and the number of lineages at the end of each interval is A s . Let gðtÞ return a vector of indices of time intervals the phylogeny corresponding to simulation time step t. The phylogenetic data at simulation time step t are summarized by Phy t ¼ fU gðtÞ ; A gðtÞ g.
At each simulation time step t, we calculated the coalescent likelihood sequentially for each time interval s 2 gðtÞ, where k t is calculated from the simulated epidemic trajectory using equation (1). The overall likelihood for all time intervals in simulation time step t is the product of probabilities calculated using equation (3) P Phy t k t ð Þ¼ Q s PðPhy t;s jk t;s Þ for s 2 gðtÞ.
For the poliovirus analysis we used an SEIR model instead, which included a short latent compartment E after an individual became infected. Because we still simulated secondary infections at the end of an individual's infectious period, this was equivalent to an SIR model with a gammadistributed instead of exponentially distributed generation time. Because the latent period was very short compared with the infectious period, we approximated the coalescent likelihood with the same formula as above, replacing I t with E t þ I t and setting T g as the duration of infection.
Quantifying Transmission Heterogeneity . doi:10.1093/molbev/msx195 MBE We ignored age-structure in the likelihood calculation based on the phylogeny because mixing between age groups occurred at a sufficiently high rate that we could ignore population structure. For the purposes of inference, we calculated the likelihood for the phylogeny using the reproductive number for the whole population. However, we included agestructure in the epidemic model because different age groups were vaccinated at each of the three immunization campaigns.
When data are sparsely sampled, the epidemiological and phylogenetic data can be considered to be independent. Thus, when inferring from both types of data, the overall likelihood is calculated as the product of epidemiological and phylogenetic likelihoods.

Particle MCMC Procedure
The aim of the statistical inference framework presented here is to obtain the Bayesian posterior distribution PðhjDÞ / PðDjhÞPðhÞ, where h ¼ ðh 1 ; . . . ; h n h Þ is a vector of n h parameters with parameter space H ¼ ðH 1 ; . . . ; H n h Þ. The prior probability PðhÞ is updated with the likelihood of parameters given the data D. For compartmental transmission models, it is not usually possible to analytically solve the likelihood function. To solve this problem, particle filtering has been implemented within an Markov chain Monte Carlo (MCMC) framework to estimate the likelihood by integrating over stochastic epidemic trajectories (Andrieu et al. 2010). A stochastic model simulation generates an epidemic trajectory X 0:n T from time step 0 to n T describing the temporal changes in incidence, prevalence, and reproductive number. Initial model conditions are given by X 0 . Data comprise incidence time series and phylogenetic data D 1:T ¼ fEpi 1:n T ; Phy 1:n T g. Epi t refers to the total number of reported cases during the time interval [(tÀ1) Dt; t Dt].
The overall marginal likelihood is calculated sequentially for each discrete time step indexed by t ¼ f1; :; n T g (eq. 4; fig. 5).
Pseudocode of the inference procedure is provided below. We assume that both epidemiological and phylogenetic data are used. FOR iteration i in 1 to MCMC iterations 3. Propose new parameter values h Ã :¼ qðh Ã jhÞ where q is the proposal distribution. 4. Calculate the marginal likelihood L Ã ¼ Pðh Ã jDÞ using the particle filtering algorithm below. The particle filtering algorithm used to calculate the marginal likelihood is given below. J is the number of particles, where each particle is associated with an epidemic trajectory X ðjÞ 0:n T , and a particle weight x ðjÞ . The epidemic trajectory comprises two state variables that vary with time: incidence and pairwise coalescent rate.

END LOOP
For incidence time series, the likelihood calculation uses the probability mass function of the binomial: where q is the probability of a case being reported. If the simulation time step is smaller than the reporting period for incidence data, then we only calculate the likelihood every x number of simulation time steps, such that xDt is equal to or greater than the reporting period.
At the start of an MCMC chain, the initial parameter values were set to their true parameter values in the case of simulated data to reduce convergence time. We simulated an additional 100 data sets from the stochastic SIR model with true parameter values R 0 ¼ 2, k ¼ 1, and q ¼ 1%, but starting the MCMC chains at parameter values far from their true values. Using a heated chain at the beginning of the MCMC (multiplying p a by a factor), we found that the MCMC chain Simulation: J particles sampled from P(X t |X t−1 ,θ) For the poliovirus analysis, the initial parameter values were set to those obtained through by fitting a stochastic SEIR model to incidence time series (Blake et al. 2014).
For the simulated data, up to 500,000 MCMC iterations were carried out, sampling parameter values every 100 iterations. Convergence was determined by calculating the effective sample size after removing the first 50% of samples as burn-in. Samples with an effective sample size <200 were removed from the final result plot.
At each iteration of the MCMC, we used a Gaussian distribution qðh Ã i jh i Þ to propose a new parameter value h Ã i centered around the old parameter value h i , where i ¼ f1; . . . ; n h g and n h is the total number of parameters to be estimated. For k, we estimated its reciprocal 1 k so the proposal distributions were centered around 1 k instead. We did not include a covariance matrix because we proposed a new value for only one parameter at each iteration of MCMC.
The standard deviation r i of proposal distribution qðh Ã i jh i Þ was adjusted to optimize the acceptance probability of parameter h i . During the first 20,000 proposals of a parameter h i , the acceptance probability of the parameter a i was calculated every 200 proposals. If a i < 0:15 or a i > 0:75, the standard deviation of the proposal distribution r i was reduced or increased, respectively. Assuming an optimal acceptance probability a opt ¼ 0:234 (Roberts and Rosenthal 2001), the standard deviation was adjusted using: The time per MCMC iteration depends on the length of the time series data, the number of particles, and the number of random number draws per simulation time step. For a simulated data set with around 130 time steps, it took 0.77 seconds per MCMC iteration on a Linux cluster with 20 cores (Imperial College High Performance Computing Service) using 20,000 particles and both incidence time series and pathogen phylogeny for inference. The number of particles depended on the length of the time series, the number of compartments in the model, and the stochasticity of the model. For example, simulations are more stochastic when k is small so more particles are needed to prevent particle depletion. An example is provided in supplementary figure S5 in the Supplementary Material online to illustrate the issue of particle depletion when simulating outbreaks.

Overview of Simulation Study
We tested the PMCMC inference framework on simulated data first to determine the accuracy of parameter estimates, assess the value of phylogenetic data in epidemiological inference, and to demonstrate the importance of estimating k. For the simulation study, we generated 60 simulated data sets using a stochastic SIR model under various combinations of R 0 , k and reporting probability q (see supplementary figs. S2 and S3, table 3, and Supplementary Information in the Supplementary Material online for more details on the simulations). Each data set comprised a phylogeny and an incidence time series for a sample of infected individuals. The phylogeny is a dated phylogeny representing the genealogy of the sampled individuals. For each data set, we performed three sets of inference: using incidence time series; using phylogenetic data; or using both. The following parameters were concurrently estimated: R 0 , T g , k, T 0 , and q when incidence time series were used during inference. The results were shown in figure 1, supplementary figure S1 in the Supplementary Material online, and table 1.
To assess the consequences of not estimating k, we reestimated all other parameter values except k for 30 of the simulated data sets (those with q ¼ 1% sampling) while fixing k ¼ 1. Again, we conducted statistical inference 3 times using one or both sets of data. The posterior estimates were shown in figure 2.
For one of the simulated data sets (R 0 ¼ 2, k ¼ 0:1), we simulated the evolution of pathogen sequences down the true phylogeny to obtain a sample of pathogen sequences. Using MrBayes (Ronquist et al. 2012) to reconstruct the phylogeny from the pathogen sequences, we obtained a posterior distribution of phylogenies given the simulated pathogen sequences. We then sampled 100 phylogenies from this posterior distribution and re-estimated the parameter values using each sampled phylogeny. The posterior estimates using all phylogenies were shown in figure 3.
For all the parameter estimation above, we set the initial parameter values to be very close to the true parameter values (those used to simulate the data). To test that we can obtain the true parameter values in the absence of prior information, we generated 100 extra simulated data sets using R 0 ¼ 2, k ¼ 1 and q ¼ 1%. For each of these, the initial parameter values were randomly sampled from the prior distributions for the parameters. The coverage, precision, and bias of estimates are presented in supplementary table S2 in the Supplementary Material online.

Incorporation of Phylogenetic Uncertainty
Using a fixed phylogeny to infer parameters would not cause problems if confidence in the branching times was high. However, low diversity among pathogen sequences increases the uncertainties in parameter estimates due to uncertainties in topology and branching times.
For one of the outbreaks simulated using R 0 ¼ 2 and k ¼ 0:1 and sampled with 1% probability, we simulated sequence evolution down the sampled phylogeny using seqgen (Rambaut and Grassly 1997). In addition to the sampled sequences, we also simulated the sequence evolution of an outgroup so that the tree could be rooted. Each sequence was 1,000 nucleotides in length with equal equilibrium frequencies of A, C, T, and G. We used the JC69þC model of substitution (Jukes and Cantor 1969) with a rate of substitution of 0.15 per site per year. This ensured that sufficient evolution would take place during the outbreak.
We used MrBayes (Ronquist et al. 2012) to estimate the phylogeny from the simulated sequences assuming a strict Li et al. . doi:10.1093/molbev/msx195 MBE molecular clock and a JC69þC model. We used the outgroup sequence to root the phylogenies, and the tip sampling dates to estimate the rate of nucleotide substitution. From the resulting posterior distribution, we sampled 100 dated phylogenies. We divided the branch lengths of phylogenies measured in substitutions per site by the estimated molecular clock rates to obtain dated phylogenies. For each dated phylogeny, we re-estimated the epidemiological parameters. An overall posterior distribution was obtained by concatenating samples of parameter values obtained for each phylogeny.

Assessing the Coverage, Bias, and Precision of Estimates
The coverage was determined by the percentage of simulations for which the true parameter value was within the 95% HPD interval of estimates. Bias was the distance between the median parameter estimate and the true parameter value. Finally, the precision was determined by the Root Mean Squared Deviation (RMSD) using the formula ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 n P n i¼1 ðh i ÀĥÞ 2 Þ q , where h i for i ¼ f1; :; ng were the n values of parameter h sampled from the posterior distribution of the parameter andĥ was the true parameter value.

Poliovirus Analysis
Data used for poliovirus outbreak analysis were collected during the 2010 outbreak of WPV1 in Tajikistan, which resulted in 518 confirmed and polio-compatible cases (CDC 2010). Poliovirus genomes in stool samples collected from patients were sequenced in the 960-nucleotide VP1 region (Yakovenko et al. 2014). A total of 116 sequences were obtained from the stool samples in Tajikistan (GenBank: KC880365-KC880521). Each sequence was associated with the date of collection.
The posterior distribution of phylogenies was estimated using MrBayes (Ronquist et al. 2012), assuming a K80þC model of substitution (Kimura 1980) and a strict molecular clock. A uniform prior was placed on the branch lengths to avoid specifying a population model when inferring the phylogeny, given that the phylogeny would then be used to infer population dynamics. The K80þC model was selected by as it returned the lowest Bayesian Information Criterion score in jModelTest2 (Guindon and Gascuel 2003;Darriba et al. 2012). Phylogenies were rooted using an outgroup sequence sampled in India in 2009 (GenBank: KC800662). The tip dates were fixed to the date of sampling that is, the date of first stool collection. This was usually within 48 h of a patient arriving with symptoms of acute flaccid paralysis (AFP). As with the simulated data, dated phylogenies were obtained by dividing the branch lengths of reconstructed phylogenies by the molecular clock rate. We sampled 100 dated phylogenies from the MrBayes posterior and estimated parameter values based on each phylogeny.
A time series of daily reported cases was constructed from the line list and formed the epidemiological data. This data set was divided into three age groups: 0-5, 6-14, and 15þ years, which corresponded to the target age groups of supplementary immunization activities (CDC 2010).
We fit the SEIR model used in Blake et al. (2014) to the polio data. The model is divided into three age groups indexed by i. S t;i ; E t;i ; I t;i and R t;i are the number of susceptible, exposed, infectious and recovered individuals in age group i at time t. Details of the model can be found in the Supplementary Material online. Unlike the usual SEIR model, all infections occurred at the end of an individual's infectious Initial numbers of infected I 0;1 ; I 0;2 ; I 0;3 1, 0, 0 Mean duration of latency T L ¼ 1 NOTE.-Values of fixed parameters are given in the column "Value." For parameters that are estimated, the prior distribution on the parameter is given in the "Prior" column. The population was divided into three age groups: 0-5, 6-14 and 15þ years. The initial numbers of susceptibles were fixed to the maximum likelihood estimates used in Blake et al (2014). Vaccinations took place on the following dates: 06 May, 20 May, 03 Jun, 17 Jun and 17 Jun 2010. On these dates, individuals were moved from the susceptible to the recovered compartment with probability t. Gamma distributions are parameterized by the shape and scale parameters. *The reproductive numbers R c and R a were calculated from the estimated transmission rate amongst young children b, the relative transmission rate between all other groups b p , the duration of infectiousness, and numbers of susceptibles.
Quantifying Transmission Heterogeneity . doi:10.1093/molbev/msx195 MBE period. Thus the model used in this paper can be considered a SIR model with gamma-distributed generation time. We fixed the initial susceptible population sizes to those used in Blake et al. (2014), and also placed a strong prior on the duration of infectiousness based on likelihood profile obtained in Blake et al. (2014). Fixed parameter values and prior distributions on estimated parameters are outlined in table 4.
We used 10,000 particles and up to 150,000 MCMC iterations sampling every 20 iterations. The Markov chains were terminated earlier than 150,000 iterations if estimates of the marginal posterior density had an ESS of at least 100.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.