: Software for Bayesian Phylogenetic Analysis

.—Phycas is open source, freely available Bayesian phylogenetics software written primarily in C ++ but with a Python interface. Phycas specializes in Bayesian model selection for nucleotide sequence data, particularly the estimation of marginal likelihoods, central to computing Bayes Factors. Marginal likelihoods can be estimated using newer methods (Thermodynamic Integration and Generalized Steppingstone) that are more accurate than the widely used Harmonic Mean estimator. In addition, Phycas supports two posterior predictive approaches to model selection: Gelfand–Ghosh and Conditional Predictive Ordinates. The General Time Reversible family of substitution models, as well as a codon model, are available, and data can be partitioned with all parameters unlinked except tree topology and edge lengths. Phycas provides for analyses in which the prior on tree topologies allows polytomous trees as well as fully resolved trees, and provides for several choices for edge length priors, including a hierarchical model as well as the recently described compound Dirichlet prior, which helps avoid overly informative induced priors on tree length. [Bayes Factor; Bayesian phylogenetics; conditional predictive ordinates; data partitioning; marginal likelihood; posterior predictive model selection; steppingstone method.]

The diversity and complexity of models blossomed following the introduction of Bayesian statistics into phylogenetics in the mid-1990s (Huelsenbeck et al. 2000;Lartillot and Philippe 2004;Huelsenbeck et al. 2006;Ané et al. 2007;Redelings and Suchard 2007;Alekseyenko et al. 2008;Drummond and Suchard 2010;Heled and Drummond 2010;Evans and Sullivan 2012). Maximum likelihood approaches to model comparison-for example, likelihood ratio tests (LRTS; Wilks 1938), Akaike Information Criterion (AIC; Akaike 1974), Bayesian Information Criterion (BIC; Schwarz 1978), and decision-theoretic model selection (DTModSel; Minin et al. 2003)-do not account for the potentially important influence of the prior on model fit. This becomes increasingly relevant as model complexity grows. Rich models may include nuisance parameters that are only mildly identifiable, yet critical for model realism. Such models depend heavily on information in the prior to provide restraints on the parameter values.
Phycas is software for Bayesian phylogenetic analysis written largely in C++ but with a Python 2.x interface mediated by the Boost Python library. Phycas expects data files to be in NEXUS format, and uses the NEXUS class library (Lewis 2003) as its NEXUS file parser. Although currently restricted to nucleotide data and unrooted trees, Phycas offers Bayesian model selection methods that are not commonly found in Bayesian software packages, such as marginal likelihood estimation by the generalized steppingstone (SS) method, posterior predictive model selection using Conditional Predictive Ordinates (CPOs), and the Gelfand-Ghosh (GG) method, tree topology priors that allow polytomies, and edge length priors that control tree length. Phycas uses the GTR substitution model (and submodels of GTR), with optional discrete gamma and invariable sites rate heterogeneity models. Data partitioning is allowed, with tree topology and edge lengths linked across partition subsets and all other model parameters unlinked. These and other features of Phycas are detailed in the sections that follow.

PYTHON INTERFACE
Phycas is largely written in the C++ computing language; however, it is used via a Python interface made possible by the Boost Python library (http:// www.boost.org/doc/libs/1_55_0/libs/python/doc/, last accessed January 23, 2015). Consequently, the user creates a simple Python script to run Phycas (in this way Phycas is similar to the P4 program of Foster 2004a). The most basic analysis possible is given by the following example script: from phycas import * mcmc.data_source = 'mydata.nex' mcmc() Submitting the above script in the Python interpreter first loads all Phycas modules into Python (from phycas import *), specifies the file mydata.nex as the data source, and conducts a default Bayesian Markov chain Monte Carlo (MCMC) analysis (mcmc()). In general, command settings (e.g., data_source) are specified by separating the command name (e.g., mcmc) from the setting by a period, while calling the command as if it were a Python function (e.g., mcmc()) results in execution of the command. A list of available commands may be obtained by typing help after issuing the from phycas import * statement to load Phycas code into the Python interpreter. The current values for a command (e.g., mcmc) may be seen by typing 526 SYSTEMATIC BIOLOGY VOL. 64 mcmc.curr, and more verbose usage hints are obtained by typing mcmc.help.
To reduce the learning curve, Phycas provides a command named scriptgen that creates Python scripts for common Phycas analyses. For example, calling scriptgen as follows generates a Python script file named runss.py containing commands for conducting a steppingstone analysis using the GTR+I+G model using data from mydata.nex: scriptgen.analysis = 'ss' scriptgen.model = 'gtr+i+g' scriptgen.data = 'mydata.nex' scriptgen.out.script = 'runss.py' scriptgen() An extensive tutorial and reference is provided in the manual (see Section "Availability").

MODEL SELECTION
Phycas offers several methods for estimating the marginal likelihood of a model-the Harmonic Mean (HM), Thermodynamic Integration (TI), and Generalized Steppingstone (SS) methods-and two posterior predictive approaches to model selection-GG and CPO.

Marginal Likelihood Estimation
The marginal likelihood is the total probability of the data conditional on a particular model of interest. The marginal likelihood is a weighted average likelihood over all possible combinations of parameter values, where the weights are provided by the joint prior distribution. Integrating out all model parameters leaves a quantity that does not depend on any particular parameter value combination, and instead represents the average fit of the model to the data. Because all parameters are integrated out of each model, the marginal likelihood is comparable across models, and models with higher marginal likelihoods are preferred over models with lower marginal likelihoods. Much recent attention has focused on estimating the marginal likelihood accurately. Despite being strongly biased, the HM method (Newton and Raftery 1994) long dominated marginal likelihood estimation, at least in phylogenetics, due to the fact that it is easily computed using the same posterior sample used to make inferences and is thus essentially cost free. The HM marginal likelihood estimate is provided when the Phycas sump() command is executed. Lartillot and Philippe (2006) provided TI as an accurate alternative to HM. Phycas implements the annealing-melting form of TI, allowing nonuniform spacing of values as suggested by Friel and Pettitt (2008), Lepage et al. (2007), and Xie et al. (2011). Xie et al. (2011) and Fan et al. (2011) introduced the SS method, that achieves the accuracy of TI but is slightly more computationally efficient. Phycas implements two versions of SS. The original version (Xie et al. 2011) follows a path from posterior to prior, and has recently been implemented in both BEAST (Baele et al. 2012;Drummond et al. 2012), MrBayes 3.2 (Ronquist et al. 2012), and the recently released RevBayes (https:// github.com/revbayes/revbayes/wiki, last accessed January 23, 2015) (Höhna et al. 2014a,b). Alternatively, generalized SS ) follows a path between the posterior and a reference distribution. In practice, the reference distribution used by Phycas is simply the joint prior parameterized using a preliminary sample from the posterior distribution by matching first and second moments of marginal posterior distributions. For example, the default prior distribution for the gamma shape parameter governing rate heterogeneity is Exponential(1.0). If the marginal posterior mean of the gamma shape parameter from a preliminary MCMC analysis is 0.236, then an Exponential(1/0.236) density would be used for this component of the reference distribution. The two versions of SS become equivalent if the prior is used as the reference distribution in generalized SS. Recently, Holder et al. (2014) described a reference distribution for tree topology, allowing generalized SS to estimate the marginal likelihood when tree topology is variable. This method is used in Phycas whenever an SS analysis is requested and tree topology is not fixed.

Posterior Predictive Approaches
Phycas offers two posterior predictive approaches to Bayesian model selection: The GG and the CPO methods . Although marginal likelihoods focus attention on the goodness-of-fit of a model to the observed data, posterior predictive approaches address how well the model of interest does in predicting new data. It is natural to expect a good model to generate data (e.g., through simulation) that is nearly indistinguishable from real data, and posterior predictive model selection works by comparing predicted data (data simulated from the model) to the actual data used to parameterize the model. Gelfand and Ghosh (1998) described a posterior predictive model selection criterion that balances predictive variance against goodness-of-fit. Paying attention to the predictive variance is important because otherwise a model that makes wild predictions can be selected in preference to a model that is much more precise in its predictions. The first posterior predictive approaches for model selection to be specifically applied in phylogenetics involved posterior predictive P values (Bollback 2002;Suchard et al. 2003). This method computes a statistic 0 from the original data set and also from each of n data sets simulated from the model of interest. Each simulated data set uses parameter values from a parameter vector sampled via 527 MCMC from the posterior distribution. If the value 0 falls within the middle 95% of the distribution of simulated statistics, the model is considered adequate. One possible drawback of this approach is that a model that produces a distribution of simulated statistics with large variance can easily be considered adequate; that is, the P value approach can reward a model for being very imprecise in its predictions. In the GG approach, models are penalized for predictive variance and rewarded for goodness-of-fit, and thus a model cannot win if it fails to be both reasonably accurate as well as reasonably precise when compared with competing models.
A key aspect of posterior predictive approaches lies in the choice of statistic used in the assessment. Currently, Phycas uses a very basic statistic that nevertheless captures important aspects of variability as well as nucleotide composition. For each data set (original and simulated), the statistic S represents the vector of counts of 15 categories of sites: categories 1-4 are constant sites (all A, all C, all G, all T); categories 5-10 have only two of the four nucleotide states present (A+C, A+G, A+T, C+G, C+T, G+T); categories 11-14 have only three of the four nucleotide states present (A+C+G, A+C+T, A+G+T, C+G+T); and the 15th category consists of highly variable sites containing all four nucleotide states (A+C+G+T). To do well, a model must produce simulated data sets that are similar to the original data set in the frequencies of these 15 categories of sites. This categorization is arbitrary but nevertheless captures important aspects of a nucleotide sequence alignment, such as nucleotide composition and rate heterogeneity, and binning sites into just 15 categories results in large samples for most bins.
The second approach to posterior predictive model selection available in Phycas is the method of CPO. The CPO method was first described by Geisser (1980) and was later considered by Gelfand et al. (1992) and Chen et al. (2000), and was introduced into phylogenetics by Lewis et al. (2014). It offers posterior predictive cross validation with essentially no additional computational cost beyond the initial MCMC required to produce a sample from the posterior distribution. The key quantity used to compare models in this approach is the probability of the data (y i ) observed at a focal site i given the data from all other sites except the focal site (y (−i) ): CPO i = p(y i |y (−i) ). These CPOs may be used to assess how well the model fits each site, and when summed across sites produces the LPML (Log PseudoMarginal Likelihood), which can be used as an overall criterion for comparing models. It might appear that k separate analyses would be required for k sites, each time excluding one site; however, it is possible to compute all k CPOs using only a single MCMC analysis exploring the posterior distribution. During a CPO analysis, Phycas saves all site log-likelihoods each time the MCMC analysis samples parameters and trees. The log(CPO) estimate for the i-th site equals the log of the harmonic mean of the site likelihoods saved for site i.

POLYTOMY ANALYSES
Phycas implements the tree topology prior described by Lewis et al. (2005). This generalized topology prior places nonzero prior probability mass on unresolved tree topologies, providing one possible solution to the problem known as "Bayesian Star Tree Paradox" (Suzuki et al. 2002;Lewis et al. 2005;Yang and Rannala 2005;Steel and Matsen 2007;Susko 2008;Wang and Yang 2014). The paradox lies in the infrequent but surprisingly strong support for fully resolved tree topologies when the truth is a star tree under traditional priors that are uniform over all fully resolved tree topologies. Other than Phycas, the only software applications that currently offer such a prior are Crux (Evans 2009) and P4 (Foster 2004a,b). Phycas implements two versions of this tree topology prior. In the first version, the polytomy prior, a tree topology having m internal nodes is C times more probable than a tree topology with m+1 internal nodes. This version does not take into account the number of possible tree topologies in each resolution class, where resolution class m comprises all tree topologies having m internal nodes. An alternative version, the resolution class prior, results in resolution class m having prior probability C times greater than resolution class m+1. Both the polytomy and resolution class priors provide a means (by specifying C > 1) for gently encouraging trees to contain polytomies, while also allowing a flat prior across all possible binary and nonbinary tree topologies (by specifying C = 1). Gentle prior encouragement of polytomies (e.g., C = e = 2.71828···) prevents artifactual support for groupings not present in the true tree without sacrificing resolution in cases of true support (Lewis et al. 2005).

EDGE LENGTH PRIORS
Phycas offers a choice of priors for edge lengths. The traditional approach associates a Uniform(0, 2 ) or Exponential(1/) prior with each of the 2n−3 edges in an unrooted tree, where is the prior mean and n is the number of tips (taxa). Phycas requires prior distributions to be proper (integrate to 1.0) and to have support equal to the domain of the parameter to which it is applied, thereby disallowing a Uniform prior on edge lengths because the domain of edge length parameters is (0,∞). A hyperparameter (and its associated hyperprior) can be defined in Phycas to govern the Exponential edge length mean (as in Suchard et al. 2001). The use of hierarchical models allows the investigator to avoid having to choose a value for the edge length prior mean, instead allowing the edge length hyperparameter to be updated along with the other model parameters. This is a pure Bayesian way of accomplishing the goal of empirical Bayesian approaches that set the edge length prior mean to the mean or mode of the maximum likelihood edge length estimates (e.g., Brown et al. 2010). Yang and Rannala (2005) suggested that allowing internal edges to have a different (typically smaller) VOL. 64 mean than external edges as a way of avoiding overcredibility artifacts. Phycas allows separate priors for internal and external edge lengths in a nonhierarchical model, and allows separate hyperpriors for internal and external edge lengths under a hierarchical model. The posterior means of the two hyperparameters can be examined to assess whether internal edges typically have smaller or larger means than external edges. The use of hierarchical models allows the use of the prior suggested by Yang and Rannala (2005) without the danger of assuming an overly informative internal edge length prior mean, which can result in a prior model that forces internal edge lengths to be smaller than values preferred by the likelihood. Marshall et al. (2006), Brown et al. (2010), and Marshall (2010) identified a problem in which fixed edge length priors yielded tree lengths that were often much larger than the corresponding maximum likelihood estimates of tree length. Rannala et al. (2012) identified the central problem and suggested a solution. The problem is that when an Exponential() prior is applied to each of n e edge length parameters, the tree length prior induced is Gamma(n e , 1/), which has mean n e / and variance n e / 2 . Assume that an Exponential(10) edge length prior is applied to each edge in an unrooted tree with n e = 100 edges. Although this prior seems reasonable (mean 0.1, [SD] 0.1) for each edge length individually, the tree length prior has mean 10 and standard deviation 1. This tree length prior is surprisingly informative, and would lead to overestimated posterior mean tree length if the true tree length were small (e.g., 1), which is often the case for phylogeographic data sets in which there are many taxa (many below species level) but low molecular divergence. Rannala et al. (2012) suggested applying a gamma prior to tree length and allowing a Dirichlet distribution conditional on tree length to govern the individual edge length proportions. This compound Dirichlet prior distribution can be used in Phycas by setting model.tree_length_prior = TreeLengthDist( T , T , , c), where T , T , , and c are defined exactly as in the Rannala et al. (2012) paper, namely: T and T are the shape and scale of the gamma prior distribution governing tree length, is the Dirichlet parameter governing each external (terminal) edge length proportion, and c is the Dirichlet parameter governing each internal edge length proportion.

EXAMPLE
As an example of some of the analyses possible using Phycas, consider the gene psaB from the green algal data-set analyzed by Lewis et al. (2014). This file contains nucleotide sequence data (1791 sites) for the chloroplast protein coding gene psaB from 35 green algal taxa in the class Chlorophyceae, order Sphaeropleales. To estimate the marginal likelihood of these data, we created a file containing the following text: from phycas import * scriptgen.analysis='ss' scriptgen.model='gtr+i+g' scriptgen.datafile='sphaero_psaB.txt' scriptgen.out.script.prefix='psaBss' scriptgen() Executing this file in Python 2.7 produced a Phycas script file named psaBss.py containing the commands necessary to carry out a generalized SS analysis under an unpartitioned GTR+I+G model. Executing psaBss.py in Python 2.7 results in several files, the most important of which is ss-sump-log.txt, which shows the estimated log marginal likelihood: −23474.34855.
An exploratory CPO analysis may be conducted to assess the potential benefits of partitioning by codon position, and a second steppingstone analysis may be used for confirmation. To perform the CPO analysis, create a phycas script named psaBcpo.py by executing the following script in Python: from phycas import * scriptgen.analysis= 'cpo' scriptgen.model='gtr+i+g' scriptgen.datafile='sphaero_psaB.txt' scriptgen.out.script.prefix='psaBcpo' scriptgen() To define a partition that divides sites into three subsets according to their codon position, add these lines to psaBcpo.py just before the line (beginning mcmc.data_source) that specifies the data file: m = model() partition.addSubset (subset(1, 1791, 3), m, 'first') partition.addSubset(subset(2, 1791, 3), m, 'second') partition.addSubset (subset(3, 1791, 3), m, 'third') partition() These lines define three subsets (named, arbitrarily, first, second, and third, that each contain one-third of the 1791 total sites. The command subset(x, y, z) produces a list of site indices starting with site x (where 1 is the first site), ending with site y, and advancing z sites at every step. Executing the psaBcpo.py in Python conducts the CPO analysis. Phycas generates an R script that if executed in R will generate the plot shown in Figure  values. This makes sense because third codon position sites are relatively fast evolving, and fast-evolving sites produce data patterns that are difficult to predict, even using the true model. The middle section corresponds to second codon position sites. These are generally the most slowly evolving sites, and clearly have the highest CPO values on average, although many exceptions are evident.
Estimating the marginal likelihood requires only that the partition information be inserted into the file psaBss.py before it is executed again in Python. The log marginal likelihood when sites are partitioned by codon position is −22191.41873, which represents nearly 1283 log units improvement in model fit. The LPML values obtained by summing log(CPO) values across all sites also favor partitioning by codon position: LPML is −23328.61002 for unpartitioned data compared to −22001.99350 for partitioned data (nearly 1327 log units higher).
Although not explicitly illustrated here, polytomy analyses were performed using an early version of Phycas in Lewis et al. (2005). The data set needed to reproduce Figure  Posterior predictive model selection: Lewis et al. (2014) LIMITATIONS One limitation of Phycas pointed out in review is that it lacks some features that might be expected in a Python module, such as programmatic access to sampled parameters and trees during an MCMC analysis. Such access to runtime results would enable much better integration into existing Python analysis pipelines and would save effort involved in writing additional code to process saved parameter, tree, and output files. Our original goal was to provide software that produced output files similar in format to other popular Bayesian software applications such as MrBAYES (Ronquist et al. 2012) and BEAST (Drummond et al. 2012), allowing use of visualization tools such as Tracer (Rambaut et al. 2013) and FigTree (Rambaut 2014). We wished to avoid creating a new ad hoc scripting language just for Phycas, and chose Python because of its power and simplicity; however, the current version of Phycas lacks an Application Programming Interface (API) and thus has the behavior of a stand-alone computer program, with both input and output effected via plain text files.

AVAILABILITY
Phycas is free and open source; source code and a PDF manual are available from phycas.org and github.com/plewis/phycas. Precompiled versions for Windows 7/8 and MacOS 10.9 are also available for download from phycas.org. FUNDING POL was supported by the National Science Foundation (grant numbers EF-0331495, DEB-1036448, and DEB-1354146) and the Alfred P. Sloan Foundation (grant number 98-4-5 ME). MTH was supported by National Science Foundation grant DEB-0732920. DLS was supported by the National Evolutionary Synthesis Center (NESCent), National Science Foundation grant EF-0423641. Additional support was provided by the Department of Ecology and Evolutionary Biology at the University of Kansas, and by the Department of Ecology and Evolutionary Biology and the Bioinformatics VOL. 64 Facility of the Biotechnology/Bioservices Center at the University of Connecticut.