Stochastic Variational Inference of Mixture Models in Phylogenetics

The pattern of molecular evolution varies among gene sites and genes in a genome. By taking into account the complex heterogeneity of evolutionary processes among sites in a genome, Bayesian inﬁnite mixture models of genomic evolution enable robust phylogenetic inference. With large modern data sets, however, the computational burden of Markov chain Monte Carlo sampling techniques becomes prohibitive. Here, we have developed a variational Bayesian procedure to speed up the widely used PhyloBayes MPI program, which deals with the heterogeneity of amino acid propensity. Rather than sampling from the posterior distribution, the procedure approximates the (unknown) posterior distribution using a manageable distribution called the variational distribution. The parameters in the variational distribution are estimated by minimizing Kullback-Leibler divergence. To examine performance, we analyzed three large data sets consisting of mitochondrial, plastid-encoded, and nuclear proteins. Our variational method accurately approximated the Bayesian phylogenetic tree, mixture proportions, and the amino acid propensity of each component of the mixture while using orders of magnitude less computational time.


Introduction
Understanding the evolutionary variation of phenotypic characters and testing hypotheses about the underlying mechanism are some of the main concerns of evolutionary biology. Because this variation needs be interpreted as an evolutionary history, accurately inferring the phylogenetic tree is important. Otherwise, the uncertainty of phylogenetic inference must be taken into account to obtain an unbiased picture of evolutionary variation.
The increasing amount of available genomic data enables reliable inference of phylogenetic trees. Because molecular evolution is largely driven by nearly neutral or slightly deleterious mutations Ohta (1973), this process is less prone to convergent evolution compared with the evolution of phenotypic traits. The pattern of molecular evolution is statistically formulated by Markov processes. The pattern and rate of molecular evolution is complex, however, depending on various factors affecting mutation rates and functional constraints. To model protein evolution, Thorne, Goldman, and Jones (1996) introduced the concept of hidden states of secondary structure to describe sites of heterogeneity Goldman et al. (1996); Thorne et al. (1996); Jones et al. (1996). Koshi and Goldstein (1998) developed a model of physico-chemical properties of amino acids, while Halpern and Bruno (1998) introduced a more advanced model with position-specific amino acid frequencies.
Equilibrium amino acid frequencies, which reflect structural and functional constraints, vary among sites within and among proteins. Inter-species comparative genomics approaches can analyze a huge number of alignment columns, but the number of taxa is often insufficient to estimate individual position-specific amino acid frequencies. To achieve a balance between variance and bias, Lartillot and Philippe (2004) proposed a Bayesian non-parametric approach based on a countable infinite mixture model, referred to as the CAT model. This model specifies K of distinct processes (or classes), each characterized by a particular set of equilibrium frequencies, and sites are distributed according to a mixture of these K distinct processes. By proposing a truncated stick-breaking representation of the Dirichlet process prior on the space of equilibrium frequencies (Ferguson 1973;Green and Richardson 2001;Ishwaran and James 2001), the total number of classes can be treated as free variables of the model. A hybrid framework between Gibbs-sampling and Metropolis-Hastings algorithm have been developed to estimate all parameters of the model Papaspiliopoulos and Roberts (2008).
Existing approaches cannot take full advantage of the CAT model (Lartillot and Philippe 2004;Lartillot 2006), because the computational burden is prohibitive for inference based on large data sets. Even well-designed sampling schemes need to generate a large number of posterior samples through the entire data set to resolve convergence, and their convergence can be difficult to diagnose. To provide faster estimation, Lartillot et al. (2013) developed a message passing interface (MPI) for parallelization of the PhyloBayes MPI program. By implementing Markov chain Monte Carlo (MCMC) samplers in a parallel environment, PhyloBayes MPI allows for faster phylogenetic reconstruction under complex mixture models.
Here, we propose an alternative approach, a variational inference method (Jordan et al. 1999;Bishop 2006;Blei et al. 2006;Hoffman et al. 2013). The basic idea of variational inference is the formulation of the estimation of marginal or conditional probabilities as an optimization problem rather than sampling-based inference. Variational methods, originally used in statistical physics to approximate intractable integrals, have been successfully used in wide variety of applications related to complex networks (Gopalan and Blei 2013) and population genetics (Gopalan et al. 2016;Raj et al. 2014). In this article, we demonstrate that our algorithms are considerably faster than PhyloBayes MPI while achieving comparable accuracies.

New Approaches
The CAT model formulates the substitutional heterogeneity across sites of protein sequences as a mixture of different equilibrium amino acid frequencies, called profiles. By introducing a Dirichlet process prior on these profiles, the number of categories, the profile of each category, and the resultant phylogenetic tree are estimated from the data in a Bayesian framework. The standard Markov chain Monte Carlo (MCMC) approach facilitates parameter estimation of this parameter-rich model and enables robust inference of phylogenetic trees while allowing for the complexity of protein evolution.
The rapid growth of genomic databases theoretically enables accurate classification of amino acid sites in protein sequences, but the Monte Carlo integration becomes computationally more challenging. To allow the CAT model to extract the maximum amount of relevant information from the data, we have developed a variational Bayesian procedure. The core of the variational framework is a mean-field approximation of the posterior distribution. We approximate the posterior distribution with a mean field representation of the variational distribution, which is much easier to work with computationally. In this approximation, the parameters and hidden variables are assumed to be independent of one another. The parameters of the variational distribution are obtained by minimizing the Kullback-Leibler (KL) divergence between the true conditional distributions of the hidden variables given the observations and their variational distributions. Inference becomes a single optimization problem that gives us approximate analytical forms for the posterior distributions over unknown variables of the CAT model as well as an approximate estimate of the intractable marginal likelihood. To deal with the uncertainty of tree topologies, we have preserved the Gibbs sampling algorithm of tree topologies (Lartillot et al. 2013). 3 Results

Runtime Performance
To compare the performance of our version of variational inference with that of the MCMC algorithm of PhyloBayes MPI, we estimated the CAT model with both algorithms using real data sets. This portion of the study was carried out using three real data sets, the largest consisting of 38,330 amino acid positions from 66 species. The goals of this data analysis were to demonstrate the numerical feasibility of our implementations and to ascertain the accuracy of our variational inference approach. In our comparisons, all algorithms were timed under equivalent computational conditions. Because of the intensive nature of the estimations, further computational experiments will be required to test the performance of variational inference on much more massive data sets. First, we explored whether our new approximation approach could significantly reduce the computational burden required to estimate all parameters of the CAT model. We focused our analysis on the three real data sets described in detailed Data Sets5.4. Table 1 illustrates the computational time required for estimation of all parameters in the CAT-Poisson model when optimized using variational inference compared with sampling under the MCMC algorithm across three data sets. These data sets contained drastically different numbers of taxa and sites. For example, the number of taxa and sites in data set C were approximately three times larger than those in data set B. The time complexity of each of the above algorithms was found to increase regularly with the number of genes, species and total aligned amino acid positions. Run times were significantly reduced in the variational inference framework compared with those in the MCMC approach. While our procedure uses the variational inference procedure to estimate parameters of the evolutionary process, we note that we have retained the algorithm for Gibbs sampling of tree topologies. If this partial MCMC algorithm can also be replaced by some other optimization, the computation burden will be greatly reduced.

Accuracy of Estimated Topologies, Tree Lengths, and Profiles
The tree topology and branch lengths estimated by variational inference were almost the same as those obtained by the MCMC algorithm ( Figure 1).
strategy guarantees that the samplers leave the posterior distribution invariant. Our approach, variational inference, proposes variational distributions for allocation variables and weights of the mixture and profiles. The choice among alternative allocations of sites to categories is driven by updating parameters of these variational distributions and computing the expected values of these variables under variational distribution. Table 2 compares some major categories estimated by MCMC and variational inference. The size of each category was approximated by the number of sites assigned to that class. The number of distinct categories was estimated for data set A representing 6,622 amino acid positions. As can be seen in the table, variational inference accurately approximated posterior mean sizes of these categories, their profiles were accurately estimated as well (Figure 2).
Taken together, these results demonstrate that the estimation time required by the variational inference framework compares favorably with that used by sampling algorithms such as MCMC, while a sufficient level of accuracy under the CAT model is still guaranteed. 4 author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/358747 doi: bioRxiv preprint Figure 2: Equilibrium frequency profiles of categories determined by MCMC and variational inference algorithms based on a mitochondrial data set (13 taxa and 6,622 amino acid positions (Rodríguez-Ezpeleta et al. 2006)). In the case of the first few categories, the two algorithms performed similarly. A detailed comparison of the remaining categories between MCMC and variational inference can be found in Table 2.

Discussion
We have developed a new framework for estimating all parameters of the CAT model, namely, stochastic variational inference, that can considerably improve runtime performance as well as significantly reduce the computational burden. In contrast to existing approaches designed for the same purpose that rely on simulation framework, such as Gibbs-sampling and Metropolis-Hastings algorithms (Lartillot and Philippe 2004;Lartillot 2006;Lartillot et al. 2013), stochastic variational inference recasts the problem of inference as an optimization problem, thus allowing us to design powerful tools for convex optimization. In this way, our approach proposes a feasible family of variational distributions and then selects the family member closest to the true intractable posterior distribution of interest by optimizing Kullback-Leibler (KL) divergence.
We have demonstrated through analysis of actual data sets that our method accurately approximates the posterior distribution of the CAT model with improved speed. This substantial runtime enhancement with no loss of accuracy allows our method to be applied to the large data sets that are steadily becoming the norm in phylogenetic and biological evolutionary studies. Finally, our results were obtained on a modest computing platform. The implementation of a variational inference version of PhyloBayes MPI to exploit advanced computing architectures holds the promise of analyzing even larger data sets than the examples in our paper.
Bayesian models of sequence evolution allow substitutional heterogeneity across protein sequence sites to be taken into account. In particular, the CAT model treats the number of substitutional categories as a free parameter and is able to uncover a level of heterogeneity 5 author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/358747 doi: bioRxiv preprint much higher than that assumed by other mixture models. Under the variational inference approach, all of these special features of the CAT model are guaranteed. Given the increasing size of studied data sets, ensuring statistical algorithms scale to a large number of species with massive numbers of nucleotide positions is critical. We have shown that such analyses are time consuming when undertaken with MCMC algorithms, which perform a large number of multiple simulated iterations over the entire data set. With their more efficient optimization framework, stochastic variational inference algorithms overcome this limitation without compromising all of the principles and statistical assumptions behind the model. For improved estimations of site-heterogeneous Bayesian mixture models with the massive data set, we recommend implementation of a variational inference version of PhyloBayes MPI.

CAT Model
We use an infinite mixture model that describes site heterogeneity with respect to the substitution process. This model is similar to one proposed by Lartillot and Philippe (2004), but, instead of sampling-based inference, we have developed a new approach that allows for efficient inference of ancestral sequences. Our model, which does not assume that all sites of a protein evolve under the same substitution process, is characterized by a 20x20 substitution matrix. In addition, the model does not assume a fixed number of distinct substitution processes (or classes) and respective amino-acid profiles; instead, these are treated as free variables of the model. Poisson (or F81) Felsenstein (1981) Markov processes are considered to apply to all substitution processes along the branches of a tree (Lartillot and Philippe 2004;Lartillot 2006). Each Markov process is characterized by a rate matrix Q = [Q ab ], which can be expressed in terms of a vector of stationary probabilities, or equilibrium frequencies π a , 1 ≤ a ≤ 20 such that 20 a=1 π a = 1 and a set of relative rates, or exchangeability parameters, (ρ ab ) , 1 ≤ a, b ≤ 20. We determine the size of the segments adaptively as described below. This approach allows us to work in the framework of a Bayesian mixture model with parameters representing the mixture of distinct classes, the rates at each site. and branch lengths.
Given an amino-acid database including N aligned positions (columns) and P taxa, we label the data matrix D ip as simply the possible states of the process operating at site i for i = 1, . . . , N at the leaf indexed by p (1 < p < P ). We consider l j (1 < j < 2P − 3); r i (1 ≤ i ≤ N ) to be random variables that denote the branch j and the relative rate of substitution at each site i. Formally, the CAT-Poisson model assumes that (i) a gamma distribution of shape 1 and scale β > 0 is the prior distribution of branch lengths, (ii) a gamma distribution of shape α and scale α is the prior distribution of rates, and (iii) the prior distribution on profile π is a flat Dirichlet distribution. Furthermore, Lartillot et al. (2013) has developed a Dirichlet process mixture model formulated in terms of a stick-breaking construction over the equilibrium frequency profile to generate an infinite number of mixtures of Poisson processes for describing sites, with each mixture characterized by its own substitution matrix Q k , k = 1, ..., ∞ and only the stationary probabilities π k a , a ∈ [1, ..., 20], k ∈ [1, ..., ∞] differing. By proposing a new random variable V k which is the unit length of the k th stick, the stick-breaking representation allows the construction of an infinite mixture structure. Moreover, each site i in an amino-acid sequence belongs to a category k that is specified by the allocation variable z i ∈ [1, ..., ∞]. The vector z = (z i ) where i ∈ [1, ..., N ], is called the allocation vector. The allocations z = (z i ) are drawn i.i.d from a multinomial of the infinite vector of mixing proportions, namely, ϕ = (ϕ k ) , k ∈ [1, ..., ∞]. In addition, we use a data augmentation algorithm, that is proposed (Nielsen 2002), to obtain the substitution mapping in the case of Poisson processes. The 6 author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/358747 doi: bioRxiv preprint substitution mapping is described by the formula Ξ ij = n ij , σ h ij h=1,...,n ij −1 , n ij denotes the number of substitutions on branch j and at site i, σ h ij h=1,...,n ij −1 denotes the successive states of the process and random variable w k a is the total number of substitutions to state a at sites which are assigned in cluster k, plus one if a is the state at the root of the tree. This algorithm is used to simulate the mutational history for a single site. The probability distribution of n ij is defined by the Poisson distribution of rate parameter r i l j and σ h ij h=1,...,n ij −1 is drawn from π k a , a ∈ [1, ..., 20], k ∈ [1, ..., ∞]. Given a data set of amino-acid sequences, Markov chain Monte Carlo (MCMC) sampling methods have been proposed to approximate full parameters of this model (Lartillot and Philippe 2004;Lartillot 2006). A parallel computing version has been developed to speed up the estimation process, thus allowing faster inference the phylogenetic reconstruction under a Dirichlet Mixture Process (Lartillot et al. 2013). Basically, however, MCMC methods, even parallel MCMC, solve this problem based on sampling schemes from a Markov chain whose stationary distribution is the posterior of interest and by updating an estimate of the model parameters. When a database becomes too large for memory or iterative computation, these approaches significantly increase the time complexity of inference.

Variational Inference
Variational inference is a class of methods that reformulate the problem of approximating the posterior inference for complex probabilistic models as an optimization problem. The central purpose of the variational inference algorithm is to approximate the true intractable posterior distribution p(Ξ, V, z, π, l, r|D) by finding an element of a tractable family of probability distributions q(Ξ, V, z, π, l, r|Θ), called the variational distribution. These distributions are parameterized by free parameters, called variational parameters Θ. Variational inference fits these parameters to find a distribution close to the true intractable posterior distribution of interest. The distance on probability space for a pair of probability distribution q(Ξ, V, z, π, l, r|Θ) and p (Ξ, V, z, π, l, r|D) is measured with Kullback-Leibler (KL) divergence. As in Jordan et al. (1999), we can decompose the log marginal probability using log p (D) = L [q(Ξ, V, z, π, l, r|Θ)] + KL [q(Ξ, V, z, π, l, r|Θ)|p (Ξ, V, z, π, l, r|D)] (1) where we have defined An approximation to the true posterior distribution can be computed by minimizing KL divergence which is equivalent to maximizing the lower bound L [q(Ξ, V, z, π, l, r)] in equation [2] on the logarithm of the marginal probability of the observation by optimization with respect to the variational distribution q(Ξ, V, z, π, l, r). Using results from the simplest family of distributions as mean-field variational approximations (Blei et al. 2006;Hoffman et al. 2013), each variable in the CAT-Poisson model is independent and governed by its own variational parametric distribution. Moreover, we consider truncated stick-breaking representations, proposed previously by Blei et al. (2006), for only the variational distributions. The truncated level or the largest number of categories K max can be freely chosen. The family of variational distributions in the CAT-Poisson model can be written as follows: are free variational parameters. To guarantee the tractability of computing the expectations of variational distributions, we choose variational distributions from exponential families (Wainwright et al. 2008).
To estimate each variational parameter in the CAT-Poisson model, we use the stochastic variational inference (SVI) algorithm to optimize the lower bound in Equation [2] (Hoffman et al. 2013). The main idea behind stochastic variational algorithm is the use of a stochastic gradient ascent, which is based on the noisy realization of the gradient to find the optimum of an objective function (Robbins and Monro 1951). Based on the subsampling techniques, noisy gradients are computed by SVI. This procedure reduces the computational burden by avoiding the expensive sums in the above lower bound. The SVI algorithm thus significantly accelerates the variational objective analysis of the large database. Applying the previously proposed SVI framework (Hoffman et al. 2013), we can separate the computational cycle into the following steps: 1. Sample amino acid data from the whole set of input data.
2. Estimate how each site is assigned to a category, on the basis of observational data and the current approximation of variational parameters.

Update variational parameters
-Local parameters are assignment variables, and breaking proportions.
-Global parameters are equilibrium frequency profile, branch length, and rate across sites.
The lower bound of the data in terms of the variational parameters is specifically described in the Supplementary Material. Mathematical details of the variational objective function and computational methods of noisy derivatives and updating of variational parameters are also explained in that section.

Parallelization and Tree Topology
To parallelize the algorithm at the single machine level and thus reduce runtimes, we adopted the MPI parallelization of the PhyloBayes MPI program (Lartillot et al. 2013). Specifically, we use one master process for dispatching computational tasks and collecting and summing results, and with multiple slave processes executing the orders and returning all essential information to the master. This parallel strategy helps to equally divide the computational burden among slaves.
In addition, a partial Gibbs sampling algorithm for subtree pruning and regrafting (SPR) is adopted to update the tree topology (Lartillot et al. 2013). In a parallel environment, the task of the master process is to randomly select a subtree for pruning and send this information to all slaves. The task of each slave process is to update the conditional likelihood vectors of each resulting topology and the complete scan of all possible regrafting points. One single log likelihood for each regrafting point is arranged into an array and sent back to the master process. All arrays are collected and summed and lastly the Gibbs-sampling decision rule is finally applied to select regrafting position.

Data Sets
Three real data sets were used for our computational experiments. Data set A was a mitochondrial data set which consisting of 33 proteins, 6,622 amino acid positions from 13 species. Data set B was a plastid data set which composed of 50 plastidencoded proteins, 10,137 amino acid positions from 28 species. A total of 13% and 5% amino acid positions were missing from the mitochondrial and plastid data sets, respectively (Rodríguez-Ezpeleta et al. 2006;Lartillot et al. 2013). Finally, data set C was a more challenging and larger complete set of mitochondrial protein sequences derived from, a large alignment of EST and genome data, which consists of 197 genes, a total of 38,330 amino-acid positions from 66 species and with 30% missing data, is constructed by (Philippe et al. 2011).
C++ code for the variational inference version of the CAT model to perform computational experiments with these data sets is available at https://github.com/tungtokyo1108/. author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/358747 doi: bioRxiv preprint author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/358747 doi: bioRxiv preprint