-
PDF
- Split View
-
Views
-
Cite
Cite
Dan L. Warren, Anthony J. Geneva, Robert Lanfear, RWTY (R We There Yet): An R Package for Examining Convergence of Bayesian Phylogenetic Analyses, Molecular Biology and Evolution, Volume 34, Issue 4, April 2017, Pages 1016–1020, https://doi.org/10.1093/molbev/msw279
- Share Icon Share
Abstract
Bayesian inference using Markov chain Monte Carlo (MCMC) has become one of the primary methods used to infer phylogenies from sequence data. Assessing convergence is a crucial component of these analyses, as it establishes the reliability of the posterior distribution estimates of the tree topology and model parameters sampled from the MCMC. Numerous tests and visualizations have been developed for this purpose, but many of the most popular methods are implemented in ways that make them inconvenient to use for large data sets. RWTY is an R package that implements established and new methods for diagnosing phylogenetic MCMC convergence in a single convenient interface.
Bayesian inference using Markov chain Monte Carlo (MCMC) in phylogenetics involves inferring posterior distributions (e.g. of phylogenetic trees and model parameters) given a set of prior beliefs and a molecular sequence alignment. This approach has become one of the primary methods used to infer phylogenies from molecular data (Ronquist et al. 2012; Bouckaert et al. 2014). However, practical applications of MCMC to phylogenetic problems are complicated by a problem inherent to MCMC methods; it is frequently difficult to determine whether the chain has undergone enough iterations and whether enough samples have been taken to accurately infer the posterior distributions of clades and model parameters.
MCMC methods allow researchers to make inferences about the parameters of interest, such as the phylogenetic tree topology, while integrating out the uncertainty in other parameters, such as the model of molecular evolution (Gilks et al. 1996). It is not necessary to explore the entire space of possible solutions to do this; rather an MCMC chain is said to have “converged” when further exploration of the solution space does not change the inferred posterior probability distributions beyond some user-specified tolerance. Failure to appropriately diagnose non-convergence can lead to premature termination of chains, resulting in inappropriate estimates of the tree topology, clade support values, and model parameters.
There are attributes of the phylogeny problem that make achieving and assessing convergence difficult. Phylogenetic problems often involve estimating many interacting model parameters (such as rates of evolution and dates of divergence), as well as the topology of the phylogenetic tree (which may itself interact with inferred model parameters). Interactions among continuous parameters can make exploring the space of possible solutions difficult, because efficient exploration requires coordinated changes among more than one parameter. On top of this, exploring the space of all possible phylogenetic tree topologies (“tree space”) is also difficult; the number of possible tree topologies is astronomical even for small numbers of taxa, and adjacent solutions can differ considerably in their posterior probability. As such, tree space can contain local optima (Whidden and Matsen 2015) in which the MCMC can become stuck, and so fail to converge. Interactions among parameters mean that the failure of a single parameter to converge can lead to poor inferences of other parameters. Thus, assessing MCMC convergence requires the analysis of all parameters, including the tree topology itself.

RWTY plots examining the behavior of individual chains. Data are two chains from Williams et al. (2013). RWTY allows users to visualize the amount of time chains have spent in different areas of tree space both as a heatmap (A) and a scatter plot (B). RWTY also allows users to visualize likelihoods and model parameters (C and D) and tree topology (E and F) as a function of chain length. In panel G, the cumulative change in split frequencies is plotted as a function of chain length, where the solid line gives the mean standard deviation of split frequencies as a function of chain length and increasingly lighter ribbons give the limits of the 75%, 95%, and 100% quantiles. RWTY also plots the cumulative posterior probability estimates for clades as a function of chain length, and highlights splits that are likely to be problematic using a metric (WCSF) that weights changes in split frequencies by their position in the chain (H).
In order to deal with the shortcomings of convergence diagnostics based on likelihoods and model parameters, AWTY (Are We There Yet, Nylander et al. 2008) provided novel convergence diagnostics based on directly examining the posterior probabilities of clades as a function of chain length. These diagnostics help users detect when multiple topological optima are being explored, and can help estimate the number of trees necessary to achieve accurate posterior estimates (fig. 1, panel H). Comparison of such plots from multiple replicate chains can further assist users in diagnosing problems with phylogenetic MCMC analyses because well-behaved replicate chains will infer similar posterior distributions. It must be noted that none of these diagnostic plots is sufficient to positively diagnose convergence, but at minimum they represent a much stricter set of necessary conditions for accepting the output of an MCMC when compared to simply examining likelihood or parameter plots alone.
The AWTY software (Nylander et al. 2008) has been widely used since its release, and is now part of the standard toolbox of investigators using MCMC methods in phylogenetics. However, the package is only available through an online interface, it runs slowly on large tree files, it does not read tree files from the latest phylogenetic MCMC software, and its code is not open source and so cannot be extended or further developed by the community.
RWTY is a new package that leverages the functionality of R packages for phylogenetics (Paradis et al. 2004; Schliep 2011), statistical analysis (Plummer et al. 2006; Wickham 2007; Schloerke et al. 2011; R Core Team 2016), and visualization (Wickham 2009; Schloerke et al. 2011; de Vries and Ripley 2013; Garnier 2016) to provide a suite of functions for visualizing and analyzing the performance of MCMC chains. RWTY provides a single environment in which to analyze the convergence of all parameters in a phylogenetic MCMC analysis, including continuous parameters and those associated with the tree topology. RWTY accepts input from popular phylogenetic MCMC packages, currently including MrBayes (Ronquist et al. 2012), BEAST (Bouckaert et al. 2014), and RevBayes (Höhna et al. 2016). In addition, trees may be manually loaded from any format that can be coerced into an ape multiphylo object, and parameter data from any format that can be converted to an R data frame. RWTY provides access to many existing and new methods for assessing convergence of phylogenetic MCMC analyses. For example, it produces plots of the posterior probability of sampled clades similar to those produced by AWTY (Nylander et al. 2008) (fig. 1, panel H), it allows visualization of MCMC exploration of tree space in a manner similar to TreeSetViz (Amenta and Klingner 2002) (fig. 1, panels A and B), and it allows users to examine traces, posterior probability distributions, and effective sample sizes (ESS) of model parameters similar to Tracer (Rambaut et al. 2014) (fig. 1, panels C–F).

RWTY plots for between-chain comparisons. Data are six MCMC chains from an analysis of three loci, two runs per locus (Williams et al. 2013). In the top panel, posterior probability estimates from each pair of chains are shown in a scatter plot (below diagonal), and summary statistics are given above the diagonal. The center panel shows a chain comparison dendrogram, in which the branch length separating each pair of chains represents the average standard deviation of split frequencies between those chains. In the lower panel, the mean and 75%, 95% and 100% quantiles of the standard deviation of split frequencies across all chains is plotted against chain length.

Pairwise examination of topology, model parameters, and likelihoods for a single chain from an analysis of a fungus dataset (Hibbett et al. 1997). Histograms along the diagonal plot the posterior probability distribution for each parameter, with red indicating values outside of the 95% confidence interval. Relationships between parameters are displayed as scatter plots below the diagonal and contour plots above the diagonal. Points in the scatter plots are colored according to generation from the MCMC chain, with lighter colors representing points later in the chain.
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
Acknowledgments
This work was funded in part by Australian Research Council fellowship awards to D.L.W. and R.L.
References
Author notes
Associate Editor: Michael Rosenberg