- Split View
-
Views
-
Cite
Cite
Nicola F Müller, David Rasmussen, Tanja Stadler, MASCOT: parameter and state inference under the marginal structured coalescent approximation, Bioinformatics, Volume 34, Issue 22, November 2018, Pages 3843–3848, https://doi.org/10.1093/bioinformatics/bty406
- Share Icon Share
Abstract
The structured coalescent is widely applied to study demography within and migration between sub-populations from genetic sequence data. Current methods are either exact but too computationally inefficient to analyse large datasets with many sub-populations, or make strong approximations leading to severe biases in inference. We recently introduced an approximation based on weaker assumptions to the structured coalescent enabling the analysis of larger datasets with many different states. We showed that our approximation provides unbiased migration rate and population size estimates across a wide parameter range.
We extend this approach by providing a new algorithm to calculate the probability of the state of internal nodes that includes the information from the full phylogenetic tree. We show that this algorithm is able to increase the probability attributed to the true sub-population of a node. Furthermore we use improved integration techniques, such that our method is now able to analyse larger datasets, including a H3N2 dataset with 433 sequences sampled from five different locations.
The presented methods are part of the BEAST2 package MASCOT, the Marginal Approximation of the Structured COalescenT. This package can be downloaded via the BEAUti package manager. The source code is available at https://github.com/nicfel/Mascot.git.
Supplementary data are available at Bioinformatics online.
1 Introduction
Phylogenies contain information regarding the history of a population and can be used to quantify demographic parameters. This has been widely done to study the spread of pathogens (Pybus et al., 2001; Russell et al., 2008), the speciation dynamics of extant species or the migration pattern of humans to name but a few. Forwards in time birth-death and backwards in time-coalescent models allow us to elucidate population dynamics from trees by calculating the probability of a phylogeny T given a set of demographic parameters Θ. To do so they classically rely on the assumption of well mixed populations, meaning that the rate at which any two pairs of lineages coalesce is the same. In most empirical applications this assumption of well mixed populations is however violated.
To address this model violation, so-called structured methods have been developed that consider birth-death processes in heterogeneous populations (Stadler and Bonhoeffer, 2013). In the backward in-time coalescent framework, the structured coalescent (Hudson, 1990; Notohara, 1990; Takahata, 1988) describes a coalescent process in sub-populations between which individuals can migrate. Such coalescent methods however typically require the state (or location) of any ancestral lineage in the phylogeny at any time to be inferred (Beerli and Felsenstein, 2001; Ewing et al., 2004; Vaughan et al., 2014). Inferring lineage states is computationally expensive, as it normally requires MCMC-based sampling, and limits the complexity of scenarios that can be analysed. As the number of different states is increased, convergence of the MCMC chains becomes a severe issue for inference under the structured coalescent when the sampling of migration histories is needed (De Maio et al., 2015; Vaughan et al., 2014). This essentially limits the number of different states that can be accounted for to three or four.
We addressed this limitation recently by introducing a new approximation of the structured coalescent that avoids this MCMC sampling of lineage states by integrating over all possible migration histories using a set of ordinary differential equations (Müller et al., 2017). In contrast to previous approximations that treat the movement of one lineage completely independently of all other lineages (De Maio et al., 2015; Volz, 2012), we explicitly include information about the location of other lineages and their probability of coalescing when modelling the movement of a lineage. This avoids the strong biases resulting from this independence assumption (Müller et al., 2017). We showed that this approximation is able to infer coalescent and migration rates well in various scenarios. However, this approach currently lacks the possibility to estimate the ancestral state of any internal nodes except the root.
Here, we introduce a new algorithm (Fig. 1) to calculate the probability of internal nodes being in any state that incorporates information from the entire tree using a forwards/backwards approach (Pearl, 1982). We additionally make improvements of the current BEAST2 (Bouckaert et al., 2014) implementation of Müller et al. (2017) in terms of calculation speed, allowing us to analyse datasets with more states and lineages. We show first on simulated datasets how this new implementation performs in inferring migration rates and effective population sizes in high dimensional parameter space. Next, we show how our new algorithm can dramatically improve ancestral state inference. We then apply our new approach to a geographically distributed samples of human Influenza A/H3N2 virus to demonstrate its applicability to large datasets.
2 Materials and methods
2.1 The approximate structured coalescent
2.2 The probability of a lineage being in a state
Integrating these equations from the present to the root of a phylogeny, allows us to calculate , that is the probability density of a phylogeny T under the MASCO approximations of the structured coalescent.
2.3 The probability of a node being in a state given the whole phylogeny
2.3.1 Backwards calculation of node states conditional on the Sub-trees
For convenience, we now change to vector notation. We define as the vector for the parent lineage p with entries in position a that only includes information from time 0 up to time t. is the vector with entries .
2.3.2 Calculation of transition probabilities
2.3.3 Forwards calculation of node states including all information in the phylogeny
2.4 Integration of the differential equations
2.5 Software
The method above is implemented into our BEAST 2 package MASCOT (Marginal Approximation of the Structured COalsescenT) and the analyses were done using version 1.0.0 and BEAST v2.5.0 (Bouckaert et al., 2014). Simulations were performed using a backwards in time stochastic simulation algorithm of the structured coalescent process using MASTER 5.0.2 (Vaughan and Drummond, 2013) and BEAST 2.4.7. MultiTypeTree analyses were performed using version 6.3.1 (Vaughan et al., 2014) and BEAST 2.4.7. Script generation and post-processing were performed in Matlab R2015b. Plotting was done in R 3.2.3 using ggplot2 (Wickham, 2009) and igraph 1.1.2 (Csardi and Nepusz, 2006). Tree plotting and tree height analyses were done using ape 3.4 (Paradis et al., 2004) and phytools 0.5-10 (Revell, 2012). Effective sample sizes (ESS) for MCMC runs were calculated for the posterior probability after a burn-in of 10% using coda 0.18-1 (Plummer et al., 2006). Parameter and state inference of the simulated data were only used if the posterior had an ESS of at least 100 and discarded elsewise.
2.6 Data availability
The source code for MASCOT is available at https://github.com/nicfel/Mascot.git. All scripts for performing the simulations and analyses presented in this article are available at https://github.com/nicfel/Mascot-Material.git, including the MASCOT xml file of the H3N2 analysis. Output files from these analyses, which are not on the github folder, are available upon request from the authors. A tutorial is available through the Taming the BEAST project (Barido-Sottani et al., 2017) on how to use MASCOT and its BEAUti interface is available at https://github.com/nicfel/Mascot-Tutorial.git.
3 Results
3.1 Inference of migration rates, effective population sizes and internal node states
First, we tested how well effective population sizes and migration rates are inferred using MASCOT. We simulated 1000 trees with MASTER (Vaughan and Drummond, 2013) using randomly sampled effective population sizes from LogNormal Distribution(μ = −0.125, σ = 0.5) and migration rates from an exponential distribution with mean = 0.5. We used 1000 tips and 6 different states. In order to have scenarios of under- and over-sampling of states, we randomly sampled the number of tips in each state. The number of samples per state was randomly drawn to be a value between 20 and 1000 in increments of 20, conditional on the overall number of samples in each state being exactly 1000. We then inferred the effective population size of every state and the migration rates between each state using MASCOT from fixed phylogenies. The results of these simulations are summarized in Figure 2.
Both effective population sizes and migration rates are inferred well. Population size estimates are however much more precise than estimates of migration rates, see Figure 2. This is expected since there are typically many fewer migration events in a phylogeny than coalescent events. Additionally, the number of migration rate parameters estimated (30) is much larger than the number of effective population size parameters (6). The estimates are well correlated with the truth, only at lower migration rates do estimates become worse. This is also to be expected since a low migration rate automatically means less events which will put the estimates closer to the prior (exponential with mean 1). The coverage is 95% for both migration rate estimates and effective population size estimates. We further inferred the state of each internal node with and without the backwards/forwards algorithm. Using the backwards/forwards algorithm reduces the probability mass that is attributed to the wrong node states in this scenario (Fig. 2C).
To estimate how the CPU time varies with the number of lineages and states, we used the same framework but with varying numbers of states and lineages. The CPU time per million samples depends approximately linearly on the number of sampled sequences. With an increasing number of states, the calculation time per million sample increases approximately quadratically.
We next compared the convergence properties of MASCOT with MultiTypeTree version 6.3.1 (Vaughan et al., 2014), which is based on the exact structured coalescent model without approximation, but requires MCMC sampling of lineage states. To do so, we compared the ESS per hour to MultiTypeTree (Vaughan et al., 2014) using fixed trees. MASCOT shows much higher ESS values per hour in each scenario, demonstrating the drastically improved convergence properties originating from not having to sample migration histories. While these estimates can vary based on the parameters under which simulations were performed and based on the MCMC operator setup used, they show the benefits of integrating over migration histories.
3.2 Application to H3N2
We then applied MASCOT to 433 Influenza A/H3N2 sequences sampled between 2000 and 2003 from Australia, Hong Kong, New York, New Zealand and Japan. We ran five independent chains each for 120 Million iterations using an site model with a fixed clock rate of substitutions per site and year. We fixed the clock rate due to a lack of temporal information from the sequences collected for this short amount of time. We then inferred the phylogenetic tree as well as the effective population sizes of every location, the migration rates between them, as well as the additional parameters from the model.
Figure 3 shows the maximum clade credibility tree with the different colours indicating the maximum posterior location estimate of each node. The pie charts indicate the probability of the marked nodes being in any possible location inferred with and without the backwards/forwards algorithm. These probabilities are the average over all the node state probabilities for each tree in the posterior containing that clade. We inferred New York to be a source location mainly for strains in Australia and New Zealand. Strains from Japan were inferred to originate mainly from Hong Kong and New York. The root of the phylogeny was inferred to be most likely in New York. The lack of samples near the root however makes the inference of its location unreliable.
4 Discussion
We provide a new algorithm to calculate the state of any node in a phylogeny under the marginal approximation of the structured coalescent (Müller et al., 2017). This algorithm entirely avoids the sampling of migration histories. Additionally, we improve the calculation time of our previously introduced approximation to allow for the analysis of phylogenies with more samples and more states.
We have shown on simulated data that our approach is able to infer migration rates and effective population sizes reliably even when many different states (6) are present. This is a case where exact methods that sample migration histories are currently not able to reach convergence. Even though MASCOT is an approximation, we reach a coverage of 95% for migration rates and effective population size estimates.
We also showed on simulated data that adding a backwards/forwards approach for the calculation of node states improves the inference of internal nodes. We use the backwards/forwards to calculate the state of every internal nodes in a way that is consistent with the complete phylogeny, which is not given by the backwards step alone. To estimate the probability of a node being in any possible state given a set of parameters we therefore do not need to average over many MCMC samples of migration histories. Whereas for some nodes the difference between with and without backwards/forwards is small, it is especially large for nodes where the difference in where the node is inferred to be compared with the parent node is large. This is due to conflicting information of the state of a node from the backwards and forwards step. This can for example be the result of a sampling event that adds information of where a lineage was further in the past. Future extensions could include an explicit sampling of migration histories or of the number of state changes. To do so, an algorithm similar to that of Minin and Suchard (2007, 2008) could be deployed, but lineage states would need to be sampled in a way that is probabilistically consistent with our equations for the forward line state probabilities. Finally, we applied MASCOT to a globally sampled H3N2 dataset where we inferred the phylogenetic tree and associated parameters. Our approach is able to reach convergence, even when a large number of sequences and different locations is present. The calculation time still causes challenges in the analysis of very large datasets. These could be circumvented by a further approximation of in Equation (1) when many lineages are present. This would allow every lineage to have the same transition probabilities and would therefore reduce the number of ODEs that have to be solved.
MASCOT still requires all migration rates and effective population sizes to be inferred. Especially the number of migration rates () can become problematic relatively fast. Future additions could however reduce the parameter space by for example deploying Bayesian Search Variable Selection (Lemey et al., 2009) or by making use of generalized linear models (Lemey et al., 2014) to describe migration rates as a combination of different covariates and hence only require the parameters of the GLM model to be inferred.
Acknowledgements
We thank Remco Bouckaert and Tim Vaughan for helpful comments on the implementation of MASCOT. We also thank three anonymous reviewers for the helpful comments on the manuscript.
Funding
N.M. and T.S. are funded in part by the Swiss National Science Foundation (SNF; grant number CR32I3_166258). D.R. is funded by the ETH Zürich Postdoctoral Fellowship Program and the Marie Curie Actions for People COFUND Program. T.S. is supported in part by the European Research Council under the Seventh Framework Programme of the European Commission (PhyPD: grant agreement number 335529).
Conflict of Interest: none declared.
References