Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10

Abstract The Bayesian Evolutionary Analysis by Sampling Trees (BEAST) software package has become a primary tool for Bayesian phylogenetic and phylodynamic inference from genetic sequence data. BEAST unifies molecular phylogenetic reconstruction with complex discrete and continuous trait evolution, divergence-time dating, and coalescent demographic models in an efficient statistical inference engine using Markov chain Monte Carlo integration. A convenient, cross-platform, graphical user interface allows the flexible construction of complex evolutionary analyses.


Introduction
First released over 14 years ago, the Bayesian Evolutionary Analysis by Sampling Trees (BEAST) software package has become firmly established in a broad diversity of biological fields from phylogenetics and paleontology, population dynamics, ancient DNA, and the phylodynamics and molecular epidemiology of infectious disease (Drummond et al. 2012). BEAST's specific focus on time-scaled trees, and the evolutionary analyses dependent on them, has given it a unique place in the toolbox of molecular evolution and phylogenetic researchers. Since inception, a strong motivation for BEAST development has been the rapid growth of pathogen genome sequencing as part of public health responses to infectious diseases (Grenfell et al. 2004). In particular, fast evolving viruses can now be tracked in near real-time (see, e.g. Quick et al. 2016) to understand their epidemiology and evolutionary dynamics.
In BEAST version 1.10, we have introduced a series of advances with a particular focus on delivering accurate and informative insights for infectious disease research through the integration of diverse data sources, including phenotypic and epidemiological information, with molecular evolutionary models. These advances fall into three broad themes-the integration of diverse sources of extrinsic information as covariates of evolutionary processes, the increased flexibility and modularization of the model design process with robust and accurate model testing methods, and substantial improvements on the speed and efficiency of the statistical inference.

Data integration
Many traits in phylogenetics are represented as or partitioned into a finite number of discrete values, with geographical location standing out as a popular example. Because BEAST is dedicated to sampling time-scaled phylogenies, new developments of discrete character mapping enable the reconstruction of timed viral dispersal patterns while accommodating phylogenetic uncertainty. By extending the discrete diffusion models to incorporate empirical data as covariates or predictors of transition rates, BEAST can simultaneously test and quantify a range of potential predictive variables of the diffusion process . Further, realizations of the trait transition process can also be efficiently produced, to pinpoint the nature and timing of changes in evolutionary history beyond ancestral node state reconstruction (termed Markov jumps), or to infer the time spent in a particular state (Markov rewards) (Minin and Suchard 2008). For molecular data, fast stochastic mapping approaches are also employed to obtain site-specific d N =d S estimates, integrating over the posterior distribution of phylogenies and ancestral reconstructions to quantify uncertainty on these measures of the selective forces on individual codons .
Multivariate continuous traits are incorporated using phylogenetic Brownian diffusion processes, modelling the shared ancestral dependence across taxa and the correlations between these variables. Such continuous models have most frequently been applied to diffusion on a geographical landscape with the traits representing coordinates and the phylogeny reconstructing the epidemiological process within the host population (Lemey et al. 2010). The landscapes can also represent other spaces, and integration of antibody binding assay data have extended 'antigenic cartography' (Smith et al. 2004) approaches to model simultaneous antigenic and genetic evolution and infer the viral trajectories in the immunological space generated by the host population .
Standard Brownian diffusion processes that assume a zeromean displacement along each branch may however be unrealistic for many evolutionary problems (including geographical reconstruction). A recently developed relaxed directional random walk allows the diffusion processes to take on different directional trends in different parts of the phylogeny while preserving model identifiability (Gill et al. 2017) and opens up these processes for a wide range of applications. BEAST 1.10 also extends multivariate phylogenetic diffusion to latent liability model formulations in order to assess correlations between traits of different data types, including (various combinations of) continuous, binary and discrete traits (Cybis et al. 2015), as demonstrated by applications to flower morphology, antibiotic resistance, and viral epitope evolution. To infer correlations between high-dimensional traits computationally efficiently, a novel phylogenetic factor analysis approach assumes that a small unknown number of independent evolutionary factors evolve along the phylogeny and generate clusters of dependent traits at the tips (Tolkoff et al. 2018).
Further extending the data integration approach, BEAST 1.10 includes a flexible framework for incorporating time-varying covariates of the effective population size over time. This uses Gaussian Markov random fields to reconstruct smoothed effective population size trajectories while simultaneously estimating to what extent predictor variables (e.g. fluctuations in climatic factors, host mobility, or vector density) may have driven the dynamics (Gill et al. 2016). Using a similar generalized linear modeling (GLM) approach, classical epidemiological time-series data such as case counts (Gill et al. 2016) can be integrated with pathogen genome sequence data to provide joint inference of important epidemiological parameters.
Finally, recent host-transmission models allow the integration of complete or partial knowledge of a pathogen's transmission history, enabling the simultaneous inference of withinhost population dynamics, viral evolutionary processes, and transmission times and bottlenecks (Vrancken et al. 2014). Likewise, other priors enable the reconstruction of transmission trees of infectious disease epidemics and outbreaks, while accommodating phylogenetic uncertainty and employ a newly designed set of phylogenetic tree proposals that respect node partitions (Hall et al. 2015).

Flexible model design
BEAST's companion graphical user interface program, BEAUti, allows the user to import data, select models, choose prior distributions, and specify the settings for both Bayesian inference and marginal likelihood estimation. Our efforts on BEAUti 1.10 have focused on allowing the user to easily link or unlink substitution, clock and tree models across multiple partitions as well as linking individual parameters to provide considerable adaptability in model design. Additionally, BEAUti can also group various parameters in a hierarchical phylogenetic model prior (Suchard et al. 2003), which allows parameters to take different values but be linked by a common distribution, the parameters of which can then be inferred. For example, flexible codon model parameterizations, using hierarchical phylogenetic models ) and incorporating a range of potential predictive variables for substitution behaviour , provide insight into the tempo and mode of pathogen evolution.
Marginal likelihood estimation to compare models using Bayes factors has become common practice in Bayesian phylogenetic inference. BEAST 1.10 now features marginal likelihood estimation ), using path sampling (Gelman and Meng 1998;Lartillot and Philippe 2006) and stepping-stone sampling (Xie et al. 2011), as well as the recently developed generalized stepping-stone sampling Baele et al. 2016a) that offers increased accuracy and improved numerical stability by employing the concept of 'working distributions', i.e. distributions with known normalizing constants and parameterized using samples from the posterior distribution.

Performance and efficiency
Increasing model complexity and sequence availability in modern-day analyses have stretched the computational demands of Bayesian phylogenetic inference. To improve efficiency for large-scale sequence data, BEAST 1.10 uses the BEAGLE library (Ayres et al. 2012) that provides access to massive parallelization on a range of computing architectures. In particular, the combination of BEAST 1.10 with BEAGLE 3.0 (Ayres et al., under review) allows multiple data partitions to be parallelized across a single high-performance device (i.e. a GPGPU graphics board) allowing for the utilization of the full capacity of these devices, reducing the computational overheads. As the complexity of phylogenetic model designs increase, concomitant with the surge in scale of genomic data, updating only a parameter associated with a single data partition limits the occupation of the massively multicore devices. To address this we have developed an adaptive multivariate transition kernel that simultaneously updates parameters across all the partitioned data, making more efficient use of available hardware . Through a combination of these two advances, BEAST 1.10 can yield a sizeable increase in effectively independent posterior samples per unit-time over previous software versions. For the example data described below, we see a 5-to 25-fold improvement depending on the model parameter, using an NVIDIA Titan V. Figure 1 presents a spatiotemporal reconstruction of Ebola virus evolution and spread during the 2013-2016 West African epidemic, highlighting several aspects of phylodynamic data integration. The estimates are based on a large data set of 1,610 genomes that represent over 5 per cent of the known cases (Dudas et al. 2017). Administrative regions (n ¼ 56) are included as discrete sampling locations to estimate viral dispersal through time while testing the contribution of a set of potential covariates to the pattern of spread using a GLM parameterization of phylogeographic diffusion ). This indicates, for example, the importance of population sizes and geographic distance to explain viral dispersal intensities. trait data with a GLM fitted to the discrete trait model in order to establish potential predictors of viral transition between locations. Plotted are a snapshot of geographic spread using SpreaD3 , the maximum clade credibility tree, the posterior estimates of the GLM coefficients for seven possible predictors for Ebola virus spread (Bayes Factor support values of 3, 20, and 150 are indicated by vertical lines) and the effective population size through time, estimated by incorporating case counts.

Relationship to BEAST2 and other software
Distinct from BEAST 1.10 described here, BEAST2 is an independent project (Bouckaert et al. 2014) intended as a platform that more readily facilitates the development of packages of models and analyses by other researchers. Although both projects share many of the same models and the underlying inference framework, BEAST has increasingly focused on the analysis of rapidly evolving pathogens and their evolution and epidemiology. We affirm that BEAST will continue to be developed in parallel to the BEAST2. While these projects share a recent common origin, each now aims to foster complementary research domains.
A range of other software focusing on phylodynamic analyses of fast-evolving pathogens has been described since the last version of BEAST was published. Of particular note are LSD , TreeDater (Volz and Frost 2017), and TreeTime (Sagulenko et al. 2018). These programs use least-squares algorithms (LSD) or maximum likelihood inference (TreeDater, TreeTime) and provide rapid analysis on large data sets for a subset of the models that BEAST provides. However, the former program implements very limited phylodynamic models and the latter two programs require a phylogenetic tree, inferred using other software, as input data, conditioning parameter estimates on this single tree.

Availability
BEAST 1.10 is open source under the GNU lesser general public license and available at https://beast-dev.github.io/beast-mcmc for cross-platform compiled programs and https://github.com/ beast-dev/beast-mcmc for software development and source code. It requires Java version 1.6 or greater. Documentation, tutorials, and help are available at http://beast.community and many users actively discuss BEAST usage and development in the 'beast-users' GoogleGroup discussion group (http://groups. google.com/group/beast-users). We also host an expanding suite of R tools-designed for posterior analyses using BEAST (https://github.com/beast-dev/RBeast).