Motivation: A variety of probabilistic models describing the evolution of DNA or protein sequences have been proposed for phylogenetic reconstruction or for molecular dating. However, there still lacks a common implementation allowing one to freely combine these independent features, so as to test their ability to jointly improve phylogenetic and dating accuracy.
Results: We propose a software package, PhyloBayes 3, which can be used for conducting Bayesian phylogenetic reconstruction and molecular dating analyses, using a large variety of amino acid replacement and nucleotide substitution models, including empirical mixtures or non-parametric models, as well as alternative clock relaxation processes.
Availability: PhyloBayes is freely available from our web site http://www.phylobayes.org. It works under Linux, Mac OsX and Windows operating systems.
Supplementary information:Supplementary data are available at Bioinformatics online.
The field of phylogenetics has been particularly prolific over the recent years. Many new models have been proposed, such as mixture models, accounting for site-specific effects (Huelsenbeck and Suchard, 2007; Lartillot and Philippe, 2004; Le et al., 2008a, b; Pagel and Meade, 2004; Wang et al., 2008), or flexible molecular clocks (Drummond et al., 2006; Lepage et al., 2007; Thorne et al., 1998). However, many of these developments are often available through distinct implementations, sometimes under different statistical paradigms (maximum likelihood versus Bayesian), making comparative evaluations more difficult, and preventing potentially powerful model combinations to be applied to empirical data. In particular, mixture models of amino acid replacement have resulted in important advances in model fit and phylogenetic reconstruction (Lartillot et al., 2007; Le et al., 2008b; Wang et al., 2008), suggesting that their use in molecular dating analyses may also result in fundamental improvements. Yet, current molecular dating software packages do not implement such sophisticated substitution models.
In this direction, we propose a Bayesian phylogenetic reconstruction program, PhyloBayes 3. As its two main distinguishing features, this program gathers a large class of recently published models accounting for variations of the substitution patterns along the sequences, and rate variations along lineages (relaxed clocks). Overall, PhyloBayes 3 makes it possible to use a large spectrum of substitution models for both phylogenetic reconstruction and molecular dating analyses.
2.1 Substitution models
Gathering several recent developments about the use of mixtures in statistical phylogenetics, PhyloBayes 3 proposes a wide range of models accounting for site-specific variations of several features such as: More classical site-homogeneous models are also implemented, such as JTT (Jones et al., 1992), WAG (Whelan and Goldman, 2001) or LG matrices (Le and Gascuel, 2008), for proteins, and the general time-reversible (GTR) model for protein and nucleic acid data. In addition to these prespecified settings, PhyloBayes 3 allows users to enter their own matrix or mixture of matrices or profiles.
the equilibrium frequencies of the substitution process, using either a Dirichlet process (Lartillot and Philippe, 2004) or pre-specified empiricals mixture of profiles (Le et al., 2008a; Wang et al., 2008);
the entire substitution matrix again using a Dirichlet process, or pre-specified empirical mixtures of matrices (Le et al., 2008b).
2.2 Molecular dating
Currently available molecular dating software programs propose either branch i.i.d. models of rate variations (Akerborg et al., 2008; Drummond et al., 2006), or autocorrelated model of rate variations (Kishino et al., 2001; Lepage et al., 2007; Rannala and Yang, 2007). In the case of branch i.i.d. models, the rate of evolution on each branch is independent from that of neighboring branches. In autocorrelated models, on the other hand, the rate follows a diffusion process along the lineages, so that trends in the overall substitution rate may last over several successive branches of the tree.
Differences also exist in the way the likelihood is computed. A first approach consists in using a normal approximation around the maximum likelihood estimate (Thorne et al., 1998). A computationally more intensive but more exact approach requires to combine relaxed clock models with the classical pruning algorithm for computing the likelihood. Dynamic programming algorithms have been proposed to reduce the computational burden entailed by such exact computations (Akerborg et al., 2008). In our case, we use data augmentation methods (see below), which also result in much more tractable computations in practice.
Finally, there are several ways in which fossil calibrations can be enforced. In the hard constraint approach, calibrations are considered as totally certain, so that calibrated nodes are never allowed to fall outside the specified intervals provided by fossil evidence (Kishino et al., 2001). Alternatively, the soft bound approach allows for smoothly decreasing probability outside the intervals provided by the fossil calibrations (Inoue et al., 2009; Rannala and Yang, 2007; Yang and Rannala, 2006).
Subsuming all these developments, PhyloBayes 3 implements both autocorrelated and non-autocorrelated models, and allows for molecular dating analyses both with or without the normal approximation, thereby making all substitution models currently implemented in PhyloBayes accessible to molecular dating analyses. It also accepts either hard or soft fossil constraints, thus allowing extensive comparisons of alternative approaches for estimating divergence dates.
2.3 Monte Carlo methods
On the algorithmic side, PhyloBayes relies on classical Metropolis–Hastings Monte Carlo methods, combined with more sophisticated sampling algorithms based on data augmentation and conjugate Gibbs sampling (Lartillot, 2006; Mateiu and Rannala, 2006). These latter algorithmic developments proved essential for proper mixing in the high-dimensional spaces entailed by the more complex models proposed by the program, in particular the infinite mixtures. They are also a key ingredient of the molecular dating analyses without the normal approximation.
Correctly assessing convergence is a particularly important aspect of Markov chain Monte Carlo, in particular under complex non-parametric models. In this respect, we propose several convergence diagnostics, consisting in measuring the discrepancy between the posterior averages obtained from several independent runs, as well as estimating the decorrelation time and the effective size of several summary statistics. These diagnostics are also available as automated stopping rules, telling the program to stop once the diagnostics indicate a sufficiently good convergence.
The implementation was checked against several alternative software programs under equivalent models, as well as using self-consistency tests based on simulated data (see Supplementary Material).
2.4 Model comparison and assessment
PhyloBayes 3 offers several approaches for Bayesian model comparison and assessment, such as Bayes factor computation for comparing relaxed molecular clock models under the normal approximation, cross-validation and posterior predictive testing. Based on our experience thus far, tentative guidelines for model choice can be provided: the CAT-GTR model, which is a Dirichlet process mixture of profiles of equilibrium frequencies combined with general exchange rates (i.e. an infinite mixture of matrices sharing the same set of exchange rates, and differing only by their equilibrium frequencies) is the best overall model, although its fit breaks down for smaller datasets (<1000 aligned positions), for which empirical mixtures then provide good alternatives. On the other hand, for very large datasets, the computational cost of CAT-GTR may be prohibitive, in which case combining an infinite mixture of profiles with flat exchange rates (the CAT model) offers a good compromise between computational speed and model fit.
We conducted an analysis on a previously published phylogenomic dataset encompassing 15 553 aligned positions for 52 eukaryotic taxa (Philippe et al., 2007). We first compared alternative models by cross-validation (the procedure to follow is described in the manual, see Supplementary Material). The CAT-GTR model appears to be the best model, followed by the CAT model.
Using posterior predictive tests, the models were assessed for their ability to account for saturation and site-specific amino acid propensities (diversity). The CAT-GTR model is the only model not rejected by both tests, while CAT is only marginally rejected for the diversity test, and passes the saturation test (Table 1). Interestingly, the WLSR model (Wang et al., 2008) also passes the saturation (but not the diversity) test, and this, in spite of its low cross-validation fit. All other models are rejected by the two tests.
As both cross-validation and posterior predictive assessments strongly suggest the use of CAT-GTR, this model was used to infer the tree and perform a molecular dating analysis. The tree was identical to that obtained under the CAT model, although with globally higher posterior probability support values (Fig. S1 in the Supplementary Material; see also Philippe et al., 2007). For the molecular dating analysis, we used a log-normal autocorrelated clock relaxation model, a birth–death prior on divergence times, combined with soft calibrations (Rannala and Yang, 2007; Yang and Rannala, 2006). The resulting chronogram, automatically provided as a postscript file by the program, is displayed in Figure 1. Imposing hard bounds for the calibrations, using a uniform prior on divergence times, or alternative clock relaxation models, did not result in significant changes in the divergence date estimates (Fig. 2 in the Supplementary Material).
The overall analysis took approximately 1 month, occupying 32 nodes on a distributed memory cluster (Intel Q9550 bi-processors), >70% of the CPU being used by the cross-validation analysis. The inference of the tree topology took 2 weeks on two independent processors, and the molecular dating analysis required ∼1 week on eight nodes.
We wish to thank Le Si Quang, Olivier Gascuel, Huai-Chun Wang and Andrew Roger for providing the models; Sebastien Bigaret for his technical help in compiling under windows; Hervé Philippe, Frédéric Delsuc and Emmanuel Douzery for their extensive beta testing of the program; Nicolas Rodrigue and three anonymous referees for their comments on the manuscript.
Funding: French ANR Mitosys program; Université de Montréal; Canada Foundation for Innovation.
Conflict of interest: none declared.