Stochastic Variational Inference for Bayesian Phylogenetics: A Case of CAT Model

Abstract The pattern of molecular evolution varies among gene sites and genes in a genome. By taking into account the complex heterogeneity of evolutionary processes among sites in a genome, Bayesian infinite mixture models of genomic evolution enable robust phylogenetic inference. With large modern data sets, however, the computational burden of Markov chain Monte Carlo sampling techniques becomes prohibitive. Here, we have developed a variational Bayesian procedure to speed up the widely used PhyloBayes MPI program, which deals with the heterogeneity of amino acid profiles. Rather than sampling from the posterior distribution, the procedure approximates the (unknown) posterior distribution using a manageable distribution called the variational distribution. The parameters in the variational distribution are estimated by minimizing Kullback–Leibler divergence. To examine performance, we analyzed three empirical data sets consisting of mitochondrial, plastid-encoded, and nuclear proteins. Our variational method accurately approximated the Bayesian inference of phylogenetic tree, mixture proportions, and the amino acid propensity of each component of the mixture while using orders of magnitude less computational time.


Introduction
Understanding the evolutionary variation of phenotypic characters and testing hypotheses about the underlying mechanism are some of the main concerns of evolutionary biology. Because this variation needs to be interpreted as an evolutionary history, accurately inferring the phylogenetic tree is important. Otherwise, the uncertainty of phylogenetic inference must be taken into account to obtain an unbiased picture of evolutionary variation.
The increasing amount of available genomic data enables reliable inference of phylogenetic trees. Because molecular evolution is largely driven by nearly neutral or slightly deleterious mutations (Ohta 1973), this process is less prone to convergent evolution than the evolution of phenotypic traits. The pattern of molecular evolution is statistically formulated by Markov processes. The pattern and rate of molecular evolution are complex, however, depending on various factors affecting mutation rates and functional constraints. To model protein evolution, Thorne et al. (1996) introduced the concept of hidden states of secondary structure to describe sites of heterogeneity Jones et al. 1996; Thorne et al. 1996). Koshi and Goldstein (1998) developed a model of the physico-chemical properties of amino acids, while Halpern and Bruno (1998) introduced a more advanced model with position-specific amino acid frequencies.
Equilibrium amino acid frequencies, which reflect structural and functional constraints, vary among sites within and among proteins. Interspecies comparative genomics approaches can analyze a huge number of alignment columns, but the number of taxa is often insufficient to estimate individual position-specific amino acid frequencies. To achieve a balance between variance and bias, Lartillot and Philippe (2004) proposed a Bayesian nonparametric approach based on a countable infinite mixture model, referred to as the CAT model. This model specifies K distinct processes (or classes), each characterized by a particular set of equilibrium frequencies, and sites are distributed according to a mixture of these K distinct processes. By proposing a truncated stickbreaking representation of the Dirichlet process prior on the space of equilibrium frequencies (Ferguson 1973;Green and Richardson 2001;Ishwaran and James 2001), the total number of classes can be treated as free variables of the model. A hybrid framework combining Gibbs-sampling and the Metropolis-Hastings algorithm has been developed to estimate all parameters of the model (Papaspiliopoulos and Roberts 2008).
Existing approaches cannot take full advantage of the CAT model (Lartillot and Philippe 2004;Lartillot 2006), because the computational burden is prohibitive for inference based on large data sets. Even well-designed sampling schemes need to generate a large number of posterior samples through the entire data set to resolve convergence, and their convergence can be difficult to diagnose. To provide faster estimation, Lartillot et al. (2013) developed a message passing interface (MPI) for parallelization of the PhyloBayes MPI program. By implementing Markov chain Monte Carlo (MCMC) samplers in a parallel environment, PhyloBayes MPI allows for faster phylogenetic reconstruction under complex mixture models.

New Approaches
Here, we propose an alternative approach, a variational inference method (Jordan et al. 1999;Bishop 2006;Blei et al. 2006;Hoffman et al. 2013). Variational methods, originally used in statistical physics to approximate intractable integrals, have been successfully used in a wide variety of applications related to complex networks (Gopalan and Blei 2013) and population genetics (Raj et al. 2014;Gopalan et al. 2016). The basic idea of variational inference in the Bayesian framework is to approximate the posterior distribution by a computationally tractable function, called the variational distribution. The variational parameter, which specifies the variational distribution, is estimated by minimizing the Kullback-Leibler (KL) divergence of the posterior distribution to the variational distribution. As a result, the posterior distribution is estimated by numerical optimization without invoking Monte Carlo simulation. To deal with the uncertainty of tree topologies, we preserved the Gibbs sampling algorithm of tree topologies (Lartillot et al. 2013). In this article, we demonstrate that our algorithms are considerably faster than PhyloBayes MPI while achieving comparable accuracy.

Variational Inference of CAT-Poisson Model
In the CAT model, each site category has its own amino acid replacement rate matrix. Instead of dealing with the general time reversible Markov process, in this article, we focus on the most popular CAT-Poisson model. This model takes account of rate heterogeneity among sites, and also allows the preferred amino acids to vary among sites. It assigns the alignment columns to the categories of amino acid profiles, taking account of uncertainty. Given the assignment to the category, the process of molecular evolution follows the amino acid version of the F81 model (Felsenstein 1981).
We denote the sequence data set by D. The CAT model has parameters ðU; NÞ. U consists of branch lengths (l), sitespecific relative rates (r), the amino acid profile (equilibrium frequency, p), the unit length of the stick (V), and the allocation variable (z) of the Dirichlet process prior on these profiles. The parameter N is the substitution mapping parameter. Variational inference approximates the true intractable posterior distribution pðU; NjDÞ by an element of a tractable family of probability distributions qðU; NjHÞ, called the variational distribution. As a variational distribution for the CAT-Poisson model, we adopt Gamma distributions for the branch lengths and the site-specific evolutionary rates, and a Dirichlet distribution for the amino acid profiles (see Materials and Methods for details).
It should be noted that, in the likelihood framework, a maximum likelihood approach minimizes the KL divergence from the true distribution to the model distribution (Kullback and Leibler 1951;Akaike 1974). In contrast, a variational inference minimizes the KL divergence from the model variational distribution to the true posterior distribution. Because of asymmetry of KL divergence, the maximum value of ELBO cannot be used for comparing candidate models of variational distributions. Currently, the standard model checking process is to compare the important aspects of q Ã ðU; NjHÞ with those of MCMC runs by example data at the developmental stage of the program.

An Illustrative Example in Phylogenetics
As an illustrative example, we estimate the posterior distribution of the distance d between a pair of aligned sequences D with the JC69 model (Jukes and Cantor 1969). Out of n sites, the sequences differ at x sites. We assign a gamma prior with a ¼ 1 and b ¼ 1 for the distance d: The likelihood of the JC69 model is given as: Given the prior and the likelihood, the posterior distribution is obtained as: Dang and Kishino . doi:10.1093/molbev/msz020 MBE Because this illustrative model includes only a single free parameter, the denominator can be accurately calculated by numerical integration.
As the variational distribution for the posterior distribution of d, we adopt a gamma distribution: The ELBO is written as follows: Therefore, variational parameters, c and c 0 , are estimated by optimizing the value of the following: Here, Wð:Þ is the digamma function, the first derivative of the log gamma function. The first term and the second term of equation (3) Figure 1 shows the estimated posterior distribution of d for the case of n ¼ 1,000, x ¼ 100, a ¼ 1, b ¼ 1. The distribution with the estimated parametersĉ andĉ 0 approximates the true posterior distribution accurately. Table 1 compares the computational time of variational inference of the CAT-Poisson model with that of MCMC. Three empirical data sets were analyzed (see Materials and Methods). Here, the number of iterations was set to 30,000 for MCMC sampling from the posterior distribution (default value of phyloBayes). As for variational inference, we could not implement a stopping rule based on convergence criteria because we partially preserved MCMC for tree topology. The trace of ELBO value implied sufficient convergence with far less than 1,000 iterations for data set A ( fig. 2a). However, we note that the value of ELBO expresses the goodness of fit of the variational parameters, but does not measure the consistency of the topology. Figure 2b-d shows that the posterior consensus tree by variational inference mostly reached convergence at 1,000. Tentatively, we set the same number of iterations as an MCMC case for comparing CPU times. We confirmed that the result of variational inference with 5,000 iterations was unchanged for data set A (data not shown).

Runtime Performance
The time complexity of each of the above algorithms was found to increase regularly with the numbers of genes, species, and total aligned amino acid positions. Even with the same number of iterations, run times were significantly reduced in the variational inference framework compared with those in the MCMC approach. This may be partly because variational inference does not include the step of generating random numbers (except for the one for sampling topologies) and the calculation of acceptance probabilities. Since our stopping rule was not thoughtfully designed but rather ad hoc, we need to perform any interpretations with caution. Once we can replace the step of Gibbs sampling of topology with some Stochastic Variational Inference for Bayesian Phylogenetics . doi:10.1093/molbev/msz020 MBE deterministic procedure of variational inference, the computational burden will be markedly reduced.

Posterior Independence between the Phylogenetic Parameters
Our variational distribution for the CAT model assumed independence among the branch lengths, the site-specific relative rates, and the amino acid profiles. To examine its validity, we checked the MCMC sample of the total branch length and the entropy of the amino acid profile of the largest cluster as an example. The scatter plot supports independence between these two characters (r ¼ À0:024, fig. 3). As a result, the variational inference approximated the distribution of the MCMC sample accurately ( fig. 4a and b).

Accuracy of Estimated Profiles
By introducing a Dirichlet process prior, the CAT model provides a posterior distribution of K, the number of separate categories, and the size of each category. The PhyloBayes MPI program, which is based on a hybrid strategy combining Gibbs sampling and Metropolis-Hastings algorithm, first proposes allocation variables and amino-acid profiles. The site to category allocation are sampled with the posterior weights of the mixture and profiles associated with each component of the mixture. Metropolis-Hastings algorithms are then used to sample the classes for sites. In contrast, our variational inference estimates the posterior distributions of the allocation variable for each site, weight, and amino acid profile of the categories. Table 2 compares some major categories estimated by MCMC and variational inference. The size of each category was approximated by the number of sites assigned to that class. The number of distinct categories was estimated for data set A representing 6,622 amino acid positions. As can be seen in the table, variational inference accurately approximated the posterior means of these category sizes. The posterior distributions of the number of site categories and the amino acid profiles are also well approximated by the variational inference (fig. 5).
Taken together, these results demonstrate that the estimation time required by the variational inference framework compares favorably with that used by sampling algorithms such as MCMC, while a sufficient level of accuracy under the CAT model is still guaranteed.

Discussion
The variational distribution for the CAT model approximated the posterior distribution accurately. This is largely because the branch lengths, site-specific evolutionary rates, and amino acid profiles were mostly independent in the posterior distribution. When the parameters of a model are mutually dependent in the joint posterior distribution, the variational inference may underestimate the posterior variance, even though the estimated posterior means may be unbiased. It is recommended to check the posterior correlations carefully at the stage of developing new programs, and to transform the parameters when the correlation is observed.
One of the most important steps in the variational framework is the calculations of expectations for the latent variables in the general ELBO. Specifically, the variational inference can achieve the best performances for the conjugate models. Because the likelihood of a CAT model is composed of the distributions of exponential family, most of the expectations could be obtained in the closed form.
The approximations of the posterior distributions of the transition probabilities in the Markov models of nucleotide substitution can still be a challenge for the Bayesian computation. There are some proposals that can deal with intractable integrations and provide a convenient way to obtain an analytically tractable solution, such as the first-order Taylor expansion (Ma and Leijon 2011;Ma et al. 2014) and the Delta method (Braun and McAuliffe 2010;Wang and Blei 2013), however, the mathematical expansions are still a challenge for the Bayesian phylogenetic inference. In many cases, phylogenetic inference includes many parameters, some of which are not of major concern. It may thus be worthwhile considering a practical approach to estimate these nuisance parameters by maximum likelihood and performing a Bayesian inference for the parameters of major interest.

CAT-Poisson Model
We briefly review the CAT-Poisson model that describes site heterogeneity of the substitution process (Lartillot and Philippe 2004). This model allows rate variation among sites and also allows variation of the rate matrix among sites. Here, we explain the basic default model, called the CAT-Poisson model. Given an amino acid sequence data set consisting of N alignment columns and P taxa, we denote the observed amino acid at site i for taxon p by D ip (i ¼ 1; . . . ; N; 1 p P). The CAT-Poisson model regards the branch lengths l j ð1 < j < 2P À 3Þ; the sitespecific relative rates r i ð1 i NÞ as random variables. Each site has its specific amino acid profile, or equilibrium frequencies, p a ; 1 a 20, such that P 20 a¼1 p a ¼ 1. The substitution process at each site follows the F81-type model (Felsenstein 1981). In other words, the probability of amino acid replacement by amino-acid a is proportional to p a . Sites are clustered into the categories of amino acid profiles. The CAT model describes the probabilistic allocation of a site to the categories by a mixture model. Given the allocation, the amino acid profile of a site has a prior of uniform Dirichlet Dang and Kishino . doi:10.1093/molbev/msz020 MBE distribution. A Dirichlet process treats the number of categories as an unknown variable. The stick-breaking representation considers two infinite collections of independent random variables; the unit length of sticks that correspond to the categories, V k , and the amino acid profiles of the categories, p k a (1 k < 1). They follow: where u k is the mixing proportions of an infinite number of successively broken sticks and stands for the total mass parameter of the Dirichlet process (Ferguson 1973;Green and Richardson 2001;Ishwaran and James 2001). Lartillot et al. (2013) introduced the allocation variable of a site i to a category, z i 2 ½1; . . .; 1 (1 i N). The allocation variables are drawn i.i.d from a multinomial of the infinite vector Stochastic Variational Inference for Bayesian Phylogenetics . doi:10.1093/molbev/msz020 MBE of mixing proportions. Given that the site i belongs to the category k, the likelihood of the data at this site, pðD i jp k Þ, is described by the transition probabilities along branches (Felsenstein 1981). p k is the amino acid profile of the kth category. Lartillot et al. (2013) applied a data augmentation algorithm of substitution mapping (Nielsen 2002). Along branch j and at site i, the substitution mapping, N ij , is the combination of the number of substitutions, n ij , and the successive states of the process ðr h ij Þ h¼1;...;n ij À1 . The random variable w k a is the total number of substitutions to state a at sites that are assigned in category k, plus one if a is the state at the root of the tree. The prior distributions of the branch lengths and site-specific relative rates follow independent gamma distributions with shape 1 and scale b > 0 and independent gamma distributions with shape a and scale a, respectively. n ij follows the Poisson distribution with the rate parameter r i l j and ðr h ij Þ h¼1;...;n ij À1 is drawn from ðp k a Þ; a 2 ½1; . . .; 20; k 2 ½1; . . .; 1.

Variational Inference of CAT Model
With mean-field variational approximations (Blei et al. 2006;Hoffman et al. 2013), each variable of the variational distribution is assumed to be independent. For practical implementation, we consider truncated stick-breaking representations (Blei et al. 2006) by setting the limit on the possible largest number of categories K max . The family of variational distributions in the CAT-Poisson model can be written as follows: qðN; z; V; p; l; rjHÞ where is the set of the free variational parameters. Note that equation (4) assumes independence among the sets of parameters describing phylogeny. This model may underestimate the posterior variance, if the true posterior joint distribution includes large correlations. We will see in the Result section that branch lengths, evolutionary rates, and amino acid profiles are almost independent in the joint distribution from MCMC. To guarantee the tractability of computing the expectations of variational distributions, we choose variational distributions from exponential families (Wainwright et al. 2007).
To estimate each variational parameter in the CAT-Poisson model (4, 5), we consider dividing the set of  The kth local variable V k is the unit length of kth stick in the stick-breaking representation which is used to make the infinite vector of mixing proportions. The ith local variable z k i of the mixture component represents the allocation situation of site i of alignment of amino acid sequences. Each local variable ðV k ; z k i Þ is governed by "local variational parameters" ½H l ¼ ð# k ; # 0 k ; / k i Þ. Bishop (2006) has proposed a coordinate ascent algorithm for solving the optimization problem of these variables. The coordinate ascent algorithm attempts to find the local optimum of the ELBO by optimizing each factor of the mean field variational distribution, while fixing the others. The optimal qðzÞ and qðVÞ are then proportional to the exponentiated expected log of the joint distribution, q Ã ðzÞ / exp E jz ½log pðN; V; z; p; l; rÞ þ const q Ã ðVÞ / exp E jV ½log pðN; V; z; p; l; rÞ þ const: Here, E jz and E jV denote expectations with respect to the variational distributions of all the variables except for z or V. The global variables U g potentially control any of the data. These variables are governed by the "global variational parameters" ½H g ¼ ðc; c 0 ; f; f 0 ; k; x; iÞ. The coordinate ascent algorithm iterates t times to update local variational parameters based on mapping data, where gð:Þ are the natural parameters.
To estimate each global variational parameter in the CAT-Poisson model, we use the stochastic variational inference Stochastic Variational Inference for Bayesian Phylogenetics . doi:10.1093/molbev/msz020 MBE (SVI) algorithm to optimize the lower bound in equation (2) (Hoffman et al. 2013). The stochastic variational algorithm is based on stochastic gradient ascent, the noisy realization of the gradient. In our study, we adopted natural gradients (Amari 1982) to account for the geometric structure of probability parameters (Robbins and Monro 1951). Importantly, natural gradients are easy to compute and give faster convergence than standard gradients. The SVI repeatedly subsamples the data, updates the values of the local parameters based on the subsampled data, and adjusts the global parameters in an appropriate way. Such estimates can guarantee algorithms to avoid shallow local optima of complex objective functions.
In our setting, we sample a mapping data point N n at each iteration, and compute the conditional natural parameters for the global variational parameters given N replicates of N n . Then, the noisy natural gradients are obtained. By using these gradients, we update H g at each of t iterations (with step size q t ): where tð:Þ denote the sufficient statistics. Based on the subsampling techniques, this procedure reduces the computational burden by avoiding the expensive sums in the above lower bound. The SVI algorithm thus significantly accelerates the variational objective analysis of the large database. Applying the previously proposed SVI framework (Hoffman et al. 2013), we can separate the computational cycle into the following steps: (1) Sample amino acid data from the whole set of input data.
(2) Estimate how each site is assigned to a category, based on observational data and the current approximation of variational parameters. (3) Update variational parameters -Local parameters are assignment variables, and breaking proportions.
-Global parameters are equilibrium frequency profile, branch length, and rate across sites.
The lower bound of the data in terms of the variational parameters is specifically described in the Supplementary Material online. Mathematical details of the variational objective function and computational methods of noisy derivatives and updating of variational parameters are also explained in that section.

Parallelization and Tree Topology
To parallelize the algorithm at the single machine level and thus reduce runtimes, we adopted the MPI parallelization of the PhyloBayes MPI program (Lartillot et al. 2013). Specifically, we used one master process for dispatching computational tasks and collecting and summing results, and with multiple slave processes executing the orders and returning all essential information to the master. This parallel strategy helps to equally divide the computational burden among slaves.
In addition, a partial Gibbs sampling algorithm for pruning and regrafting (SPR) is adopted to update the tree topology (Lartillot et al. 2013). In a parallel environment, the task of the master process is to randomly select a subtree for pruning and send this information to all slaves. The task of each slave process is to update the conditional likelihood vectors of each resulting topology and the complete scan of all possible regrafting points. One single log likelihood for each regrafting point is arranged into an array and sent back to the master process. All arrays are collected and summed and lastly the Gibbs sampling decision rule is finally applied to select the regrafting position.

Data Sets
Three real data sets were used for our computational experiments. Data set A was a mitochondrial data set consisting of 33 proteins and 6,622 amino acid positions from 13 species. Data set B was a plastid data set composed of 50 plastidencoded proteins and 10,137 amino acid positions from 28 species. In total, 13% and 5% amino acid positions were missing from the mitochondrial and plastid data sets, respectively (Rodr ıguez- Ezpeleta et al. 2006;Lartillot et al. 2013). Finally, data set C was a more challenging and larger complete set of mitochondrial protein sequences derived from a large alignment of EST and genome data, which consists of 197 genes and a total of 38,330 amino acid positions from 66 species and with 30% missing data, constructed by (Philippe et al. 2011).
Cþþ code for the variational inference version of the CAT model to perform computational experiments with these data sets is available at https://github.com/tungtokyo1108/; last accessed January 21, 2019.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.