Computational Performance and Statistical Accuracy of *BEAST and Comparisons with Other Methods

Under the multispecies coalescent model of molecular evolution, gene trees have independent evolutionary histories within a shared species tree. In comparison, supermatrix concatenation methods assume that gene trees share a single common genealogical history, thereby equating gene coalescence with species divergence. The multispecies coalescent is supported by previous studies which found that its predicted distributions fit empirical data, and that concatenation is not a consistent estimator of the species tree. *BEAST, a fully Bayesian implementation of the multispecies coalescent, is popular but computationally intensive, so the increasing size of phylogenetic data sets is both a computational challenge and an opportunity for better systematics. Using simulation studies, we characterize the scaling behavior of *BEAST, and enable quantitative prediction of the impact increasing the number of loci has on both computational performance and statistical accuracy. Follow-up simulations over a wide range of parameters show that the statistical performance of *BEAST relative to concatenation improves both as branch length is reduced and as the number of loci is increased. Finally, using simulations based on estimated parameters from two phylogenomic data sets, we compare the performance of a range of species tree and concatenation methods to show that using *BEAST with tens of loci can be preferable to using concatenation with thousands of loci. Our results provide insight into the practicalities of Bayesian species tree estimation, the number of loci required to obtain a given level of accuracy and the situations in which supermatrix or summary methods will be outperformed by the fully Bayesian multispecies coalescent.


INTRODUCTION
In recent years, a number of new techniques have applied next-generation sequencing to phylogenetics and phylogeography (McCormack et al. 2013). These new methods include target enrichment strategies (Mamanova et al. 2010) like exon capture (Bi et al. 2012), anchored phylogenomics (Lemmon et al. 2012), and ultra-conserved elements (Faircloth et al. 2012), as well as RAD sequencing (Baird et al. 2008;Davey et al. 2011). As a result, genome-wide samples of large numbers of loci from multiple individuals and multiple species have become increasingly common. This trend is rapidly shifting the modus operandi of systematic biology from phylogenetics to phylogenomics. This move to phylogenomics has also heralded a rapid development and uptake of species tree inference methods that acknowledge and model the discordance among individual gene trees. As with the field of phylogenetics, there is a broad acceptance that probabilistic modelbased methods are preferable; however, the amount of data produced by next-generation technologies has also spurred the development of faster methods that do not utilize all the available data and employ statistical shortcuts such as admitting no uncertainty in individual gene trees Liu et al. 2009).

Bayesian Species Tree Estimation
The theory of incomplete lineage sorting and its implications for phylogenetic inference has been appreciated for some time (Pamilo and Nei 1988), and early approaches to applying this theory inferred the species tree that minimizes deep coalescences using gene tree parsimony (Maddison 1997;Page and Charleston 1997;Slowinski and Page 1999). The fully probabilistic application of the theory to molecular sequence analysis has only begun more recently with the introduction of Bayesian implementations of the multispecies coalescent (Rannala and Yang 2003;Edwards et al. 2007;Liu 2008;Liu et al. 2008;Heled and Drummond 2010). This model embeds gene trees within a birth-death or pure Yule species tree, and within each lineage (or branch) of the species tree, gene trees are assumed to follow a coalescent process (Heled and Drummond 2010). Prior to the development of these methods, it was necessary to assume that the history of each gene is shared and equal to the history of the species tree being studied.
However, gene trees evolve within a species tree and the approximation of equating them becomes increasingly problematic as one samples more loci, when in reality each have distinct gene tree topologies and divergence times. The multispecies coalescent brings together coalescent and birth-death models of timetrees into a single model. It describes the probability distribution of one or more gene trees that are nested inside a species tree. The species tree describes the relationship between the sampled species, or sometimes, sampled populations that have been separated for long periods of time relative to their population sizes. In the latter case it may be referred to as a population tree instead.
The initial implementations of the multispecies coalescent made very simple assumptions including no 381 382 SYSTEMATIC BIOLOGY VOL. 65 recombination within each locus and free recombination between loci. Although these simple assumptions can be robust to violation, including some forms of gene flow ) (but see Leaché et al. 2014), researchers have begun to acknowledge that additional processes (such as hybridization) may need to be incorporated (Joly et al. 2009;Kubatko 2009;Chung and Ané 2011;Yu et al. 2011;Camargo et al. 2012). A number of simulation studies have also looked at various facets of performance of Bayesian species tree estimation including the influence of missing data (Wiens and Morrill 2011), the influence of low rates and rate variation among loci (Lanier et al. 2014) and comparisons of performance with "supermatrix" concatenation approaches (DeGiorgio and Degnan 2010; Larget et al. 2010;Leaché and Rannala 2011;Bayzid and Warnow 2013).
Although these modeling advances are exciting, in the face of a next-generation data deluge, this study asks and answers the following, heretofore unanswered questions: (i) How do fully Bayesian multispecies coalescent methods scale to data sets of hundreds of loci? (ii) How much more accurate will phylogenetic species tree estimates be with more sequence data? (iii) When should one use a multispecies coalescent approach instead of computationally more efficient Bayesian supermatrix approaches, or summary methods which do not use all available data? To address the first of these questions, we investigate the computational performance of the *BEAST implementation of the multispecies coalescent (Heled and Drummond 2010), so as to assess the feasibility of conducting phylogenomic analyses using existing computational tools. To shed light on the second question, we investigate how estimation accuracy improves with increasing loci.
To address the final question, we investigate how the statistical accuracy of the multispecies coalescent compares with concatenation across a broad range of conditions. We also investigate the statistical accuracy of the multispecies coalescent, supermatrix and summary methods using simulations based on two published sequence data sets; RAD tag sequences from a study of the Sino-Himalayan plant clade Cyathophora (Eaton and Ree 2013), and RNA-seq assemblies from a study of primates (Perry et al. 2012). Cyathophora, a section of the genus Pedicularis originating in the late Miocene or the Pliocene, is probably no older than 8 Ma (Yang and Wang 2007) and is therefore a shallow study system. In contrast, primates are a deep study system, as the oldest split in this order is estimated to have occurred in the Cretaceous around 80 Ma (Tavaré et al. 2002;Steiper and Young 2006;Wilkinson et al. 2011).

METHODS
Using simulation, we investigated the trends in computational performance and statistical accuracy of the multispecies coalescent model as implemented in BEAST 2 (*BEAST), and its statistical accuracy relative to other methods of species tree inference. In designing these simulation studies there were a number of parameters to consider. The key parameters that might determine performance of inference under the multispecies coalescent are as follows: n : The number of species. Another factor which may influence *BEAST performance is whether the molecular evolution of each locus has been more or less clock-like. Of all these parameters it is the number of loci n l , the number of sites in a single locus n s , and the number of individuals per species n i that are largely determined by experimental design. In addition, a complete specification of a multispecies coalescent model requires a speciation model (parameterized model of the species tree), a substitution model (model of the relative rates and base frequencies), and a clock model describing the absolute rate of evolution across the branches of each gene tree. In the following sections we describe the choices of parameters, models, and simulation conditions for our computational experiments.
Species and gene trees for all experiments were simulated using biopy (http://www.cs.auckland.ac. nz/∼yhel002/biopy/, last accessed December 25, 2015), which simulates gene trees contained within species trees according to the multispecies coalescent process. Sequence alignments were also simulated using biopy for experiments 1 and 2, and Seq-Gen (Rambaut and Grass 1997) was used to simulate nucleotide alignments for experiment 3.

Experiment 1: Performance of *BEAST with Increasing
Numbers of Loci The first set of simulations we performed was primarily aimed at understanding the effect that increasing the number of loci has on the computational performance and statistical accuracy of Bayesian species tree estimation. We simulated 100 random (rapidly speciating) species trees of each of three different sizes, n = 5,8,13, using the birth-death process (Kendall 1948;Nee et al. 1994;Gernhard 2008). In all cases, the speciation rate was = 1 and the extinction rate was = 0.2 (nominally per million years). For 5-species trees we considered n i = 2,4,8, for 8-species trees n i = 2,4 and for 13-species trees n i = 2. For each combination of n and n i we simulated up to 256 gene trees. Gene alignments were simulated from these gene trees 383 using an HKY substitution model (Hasegawa et al. 1985) and a strict clock. All sequences were simulated with a substitution rate of 1% per lineage per million years, a transition/transversion ratio of 4, equal base frequencies and a strict clock. For each *BEAST analysis, the substitution rate was fixed at 1%, and a single value and set of base frequencies for all loci was estimated. The locus length was 200 sites each to mimic short-read nextgeneration sequence data. Finally, we drew successively larger subsets of each group of alignments to form a set of *BEAST analyses (Heled and Drummond 2010). We considered increasing numbers of loci on a logarithmic scale, that is n l ∈{2,4,8,16,32,64,128,256}.
If the effective sample size (ESS) of either the log posterior or the age of the species tree in an analysis was not ≥200 after the initial MCMC chain was completed, we used the resume function in BEAST 2 (Bouckaert et al. 2014) to extend the MCMC chain from the final state of the previous run, until sufficient samples were obtained to achieve a minimum ESS of 200. For each combination of n l , n and n i , MCMC chains were resumed until at least 90 out of 100 replicates had sufficient ESS values. All statistics and trees were logged at a sampling rate of 1 sample per 25,000 states, and the MCMC chains that needed extension were combined into a single long chain. Pseudocode for the experimental protocol can be found in Algorithm S1 in Supplementary Material on Dryad at http://dx.doi.org/10.5061/dryad.02tf9.
ESS per hour was not calculated using the total CPU time for the combined chain because resumed runs were not restricted to a single type of CPU and hence were not directly comparable. Instead, the initial MCMC chain for each condition and replicate was restricted to a single type of CPU (Intel E5-2680 @ 2.70 GHz), and million states per hour of CPU time was calculated based on the number of states and CPU time of the initial chain. To calculate ESS per million states, the ESS of the age of the species tree was divided by the million post-burnin states in the combined chain. To calculate ESS per hour, ESS per million states was multiplied by million states per hour. All replicates were used to calculate average ESS rates, including those with ESS values <200.
The main measure of error used in this study, "relative species tree error," incorporates both topological and branch length error by building on the previously described measure "rooted branch score" (RBS; Heled and Bouckaert 2013). Given two trees T 1 and T 2 , the sets of monophyletic clades c present in each tree are defined as C 1 and C 2 . The length of the branch which extends rootward from the most recent common ancestor (MRCA) of a clade is defined as b(c). Given these definitions, the rooted branch score is defined as the sum of all absolute differences in branch lengths b(c) between trees T 1 and T 2 : By convention, the branch length of a clade that is missing from a tree is zero, so the topological error of absent or erroneous clades will be weighted by the true or estimated branch length respectively. We define the relative species tree error e T to be the posterior expectation of the rooted branch score distance RBS between the estimated species treeT and the true species tree T true , normalized by the tree length of the true species tree L true : This measure summarizes the error over the entire posterior distribution by averaging the RBS for each i posterior sampleT i drawn from the entire set of posterior samples of size k. We normalize by the length of the true species tree to make the error comparable between species trees of differing units and/or number of species. Replicates with insufficient ESS values were excluded when calculating average relative species tree error, because the posterior distributions of species trees for those replicates might be inadequately sampled.
A post hoc analysis was performed to investigate the residual variation in ESS rates and relative species tree error, after accounting for the number of loci, individuals and species in each replicate. Spearman's rank correlation was used to calculate correlation coefficients between the residuals and various tree and alignment parameters. P-values for each correlation were computed using asymptotic t approximation, and then corrected for multiple comparisons based on 48 tests per set of residuals (Benjamini and Hochberg 1995).
Mean population size was calculated as the mean of all per-branch effective population sizes. Species tree asymmetry is the variance 2 N in the number of nodes between each tip and the tree root (Kirkpatrick and Slatkin 1993). Mean tree height difference is the mean difference in height between each gene tree and the species tree. Mean deep coalescences is the mean number of deep coalescences for each gene as calculated by DendroPy 4.0.3 (Sukumaran and Holder 2010). The mean parsimonious mutations is the parsimonious (minimum) number of mutations required per site given the true gene tree, again calculated by DendroPy. Mean variable site count is the mean number of sites per locus with more than one extant allele, and mutations per variable site is the total number of parsimonious mutations required divided by the total number of variable sites.
Experiment 1 was performed using the Pan cluster provided by New Zealand eScience Infrastructure and hosted at the University of Auckland (http://www.eresearch.auckland.ac.nz/en/centre-foreresearch/research-facilities/computing-resources.html, last accessed December 25, 2015). This high performance compute cluster provides access to Linux compute 384 SYSTEMATIC BIOLOGY VOL. 65 nodes with 2.7 and 2.8GHz Intel Xeon CPUs, and approximately 8 GB of RAM per CPU core.

Experiment 2: Comparing a Bayesian Multispecies Coalescent Approach with a Bayesian Supermatrix Approach
In the second set of simulations, we compare the statistical accuracy of the multispecies coalescent to partitioned concatenation, both as implemented in BEAST 2. We refer to these methods as *BEAST and Bayesian supermatrix respectively. Specifically we tested the hypothesis that the comparative accuracy would depend on mean branch length in coalescent units of (2N e ) −1 .
For every combination of n = 4,5,6,8 and n l = 1,2,4, we simulated species trees with a range of branch lengths in coalescent units. In order to vary branch lengths, species trees were simulated with expected root heights of R = 1 2 ,1,2,4,8,16 (nominally in millions of years) and population sizes chosen from N e = 1 4 , 1 2 ,1 (nominally in units of million individuals), changing the coalescent branch length unit numerator and denominator respectively. Additional expected root heights were included where the most accurate method switches from *BEAST to Bayesian supermatrix, to obtain denser sampling in that part of parameter space.
Species trees were generated under the pure birth Yule model (Yule 1924). The birth rate for each combination of parameters was set to = 1 R n k=2 1 k , that is, the birth rate which generates trees with an expected root height of R. These settings roughly correspond to mammalian nuclear genes of species with an effective population size of one-quarter, one half or one million individuals.
A single individual per species was simulated for all loci. We used the Jukes-Cantor substitution model (Jukes and Cantor 1969) and a strict clock model for each locus, but with rate variation between loci. The mutation rate for the first locus was fixed at 0 = 0.01, and the rates for other loci drawn from the range [ 0 /F, 0 ×F]. We used F = 3, giving a factor of 9 between the fastest and slowest possible rates. The rate was drawn in log space, so there is equal density of slower and faster rates around 0 . The number of sites per alignment (n s ) was fixed at 1000.
We generated 100 replicates for each combination of n, n l , R and N e . For each unique combination of n, R and N e only one set of 100 species trees was generated and used (regardless of n l ) to minimize species tree sampling error when analyzing the effect of increasing n l . Gene trees and extant sequences were generated separately for each replicate and for each value of n l .
Both Bayesian supermatrix and *BEAST analyses used a Yule prior on the species tree, with a uniform prior of [ 1 / 100 ,100] on , and a separate partition per locus each with a strict clock model, where the clock rate of the first partition was fixed to the truth ( 0 ) and the other rates were estimated. The *BEAST effective population size hyperparameter (popMean) was given a uniform prior in the range [ 1 5 ,5], and all population sizes were estimated.
The Bayesian supermatrix analysis used a fixed chain length of 4 million states, sampling every 1000 states. The *BEAST analysis used a fixed chain length of 40 million states, sampling every 10,000 states. The ESS values of the posterior, likelihood and prior statistics of each chain were estimated, and replicates where the ESS was <200 for any of those statistics were discarded. For each combination of n, n l and method there were never more than 4% of replicates discarded for this reason ( Figure S10 in Supplementary Material available on Dryad). As with experiment 1, this experiment was performed using the NeSI Pan cluster.

Experiment 3: Many-method Comparison of Species Tree
Inference using Parameters Estimated from Two Phylogenomic Data Sets The purpose of the third set of simulations was twofold: to check that the trends in statistical accuracy observed for the first two sets of simulations held for empirically derived simulations, and to compare statistical accuracy across a range of species tree inference methods. To simulate more realistic trees and sequences, we derived a range of properties and phylogenetic parameters from two empirical phylogenomic data sets for use as simulation parameters.
The biallelic species tree inference method SNAPP (Bryant et al. 2012) was used to estimate speciation birth rates and effective population sizes because it did not require phasing the sequence data. To estimate base frequencies, substitution rates, between-site rate variation, and between-locus rate variation, we used a Bayesian supermatrix analysis with a Yule prior on the species tree. A detailed description of sequence data processing and SNAPP and BEAST settings is given in Supplementary Material available on Dryad.
We simulated 100 replicates each of "deep" and "shallow" Yule species trees of n = 12 and n = 8 respectively, using the inferred empirical birth rates, with per-branch population sizes picked from a gamma distribution of shape 2 and a mean equal to the mean inferred population sizes. For the deep species trees we simulated 512 gene trees, and for the shallow species trees we simulated 4096 gene trees within each species tree, each with two individuals per species.
For each simulated gene tree, we chose a strict clock rate from the gamma distribution defined by the inferred shape parameters and scale parameters. Nucleotide sequences were simulated for every locus using the empirically derived GTR+G base frequencies, substitution rates, and gamma rate variation from the applicable study. As the shallow study used 64nt RAD tags, we picked that fixed length for sequence 385 simulations based on that study. For simulations based on the deep study, each simulated alignment length was randomly sampled (with replacement) from the original alignment lengths of the deep study.
All GTR+G rates were estimated for *BEAST and Bayesian supermatrix analyses. For RAxML analyses, only GTR+G substitution rates were estimated and empirical base frequencies were used. Clock rate distribution parameters and clock rates for each locus were estimated for *BEAST and Bayesian supermatrix analyses. Loci were not partitioned for RAxML analyses, so per-locus clock rates could not be estimated for that method. The RAxML maximum likelihood algorithm used was "new rapid hillclimbing." Pairwise distances matrices calculated by RAxML were used to generate neighbor-joining trees using the BIONJ algorithm implemented in PAUP* version 4.0a142 (http://paup.csit.fsu.edu/, last accessed December 25, 2015). *BEAST and BEAST trees are implicitly rooted because they are ultrametric, and RAxML and BIONJ trees were midpoint rooted.
MP-EST uses gene trees as input data, which were inferred using RAxML. The same settings used for RAxML species tree inference were used for gene tree inference, and gene trees were midpoint rooted. For each replicate MP-EST was set to make 10 independent runs, and the species tree with the highest pseudo-likelihood was retained for further analysis.
The BEAST and *BEAST chains were run on the Raijin cluster provided by the National Computational Infrastructure (http://nci.org.au/systems-services/ national-facility/peak-system/raijin/, last accessed December 25, 2015). This cluster provides access to Linux compute nodes with 2.6 GHz Intel Xeon Sandy Bridge CPUs, and 4 GB of RAM was requested per run. Further details of BEAST and *BEAST chains are provided in Supplementary Material available on Dryad. RAxML and MP-EST were run on the cluster provided by the Genome Discovery Unit of the Australian Cancer Research Foundation Biomolecular Resource Facility. Jobs on this cluster ran on Linux compute nodes with a variety of Intel Xeon and AMD Opteron CPUs, and 2 GB of RAM was requested per RAxML or MP-EST job.

Experiment 1: Performance of *BEAST with Increasing Numbers of Loci
Computational performance.-We evaluated the scaling of computational performance of *BEAST as a function of the number of loci analyzed. We recorded the elapsed computational time for each replicate analysis running in a single thread. This was then used to calculate the effective number of samples per hour (ESS per hour), to measure the computational effort required to produce a sample from the posterior for a given number of loci. The ESS per hour relationship (Fig. 1a, S3 in Supplementary Material available on Dryad) suggests that a power law fits the scaling of computational performance. The linear relationship in the log-log plot indicates that a power law fits well for the range from 32 to 256 loci. We extrapolate that for n = 5, n i = 2 and n l ≥ 32, ESS per hour follows a power law with a slope and intercept of −3.06±0.04 and 16.34±0.18, respectively.
Applying this functional relationship, we could estimate the computational cost to analyze a similar data set with a larger number of loci. For example, given 5 species and 2 individuals in the simulation, the predicted ESS per hour is 0.54 for 256 genes, which indicates it would take approximately 369 CPU hours to attain an ESS of 200. We can therefore estimate that a similar analysis of 1024 loci would take roughly 1064 CPU days. Nevertheless, an analysis of this size might be achieved within 2 months by parallelizing the problem into 20 independent MCMC chains for 2 months each and discarding a few days of burnin from each of them, to achieve on the order of 10 independent samples from each chain.
Variation in ESS per hour between replicates was observed under all tested conditions ( Figure S3 in Supplementary Material available on Dryad). The slowest replicate relative to the median rate for any condition was a 5 species, 2 individuals and 256 genes outlier, 94× slower than the median rate for that combination (Fig. 1a). This replicate would require approximately 1500 CPU days to attain an ESS of 200. However, this was an extreme case as the next slowest replicate for that combination was another outlier only 6.4× slower than the median rate, and would require only 100 CPU days to attain the same ESS value.
The slope of the expected computational performance as a function of number of loci does not vary with the number of species or the number of individuals (Fig. 1b), although a larger range of n and n i would need to be examined to understand the scaling relationship of computational performance with those quantities. For analyses larger than 5 species and 2 individuals, the power law range appears to begin at n l ≥ 16. Combining all simulation results, a multiple linear regression describing a response variable Y (e.g., ESS per hour) as a function of three explanatory variables: number of loci n l , number of species n, and number of individuals per species n i , can be constructed Solid shapes indicate the median value for each category, and regression lines were calculated using all replicates for each category. c) Relative error for 5 species each with 2 individuals, with each box-and-whisker showing the variance in relative error between replicates. Numbers above the graph area indicate how many replicates were included for each number of loci. d) The relative error in the estimated species tree as a function of the number of loci, with trend lines for each combination of number of species and individuals per species. Solid shapes indicate the median value for each category, and regression lines were calculated using all replicates for each category with sufficient ESS. as follows: log(Y) = 1 log(n l )+ 2 n+ 3 n i +. ( Taking the ESS per hour as the response variable, the linear regression estimates of the coefficients are 1 =−2.81±0.02, 2 =−0.42±0.01, 3 =−0.46±0.01, and the intercept is = 17.98±0.13. At least within the range of parameters examined here, it appears that the 1 coefficient is not greatly influenced by n and n i (Fig. 1b).
We also considered the scaling of the number of effective samples per million states (ESS per million states) in the MCMC analyses. This quantity is complementary to our first result; it is easier to investigate as it does not require running all simulations on identical and dedicated hardware. Computational time for methods like *BEAST is dominated by the phylogenetic likelihood, which is calculated for all site patterns given a proposed tree (Yang et al. 1994). Because *BEAST infers a separate gene tree for each locus, the time per state will be linear with the number of loci assuming the average number of site patterns per locus is independent of the total number of loci. This assumption of independence holds for experiment 1 because loci were subsetted uniformly.
Adapting the terminology of Equation (3), the slope of ESS per hour ( 1h ) will be simply related to the slope of ESS per million states ( 1s ): 1h = 1s +1. However because CPU time per site pattern depends on the specific hardware employed, the intercept of ESS per hour ( h ) cannot be predicted from that of ESS per million states ( s ).
As expected, ESS per million states also exhibits a power law in the number of loci ( Figure S4 in Supplementary Material available on Dryad). By assigning the ESS per million states to Y in the multiple linear regression in Equation 3, the estimated coefficients are 1 =−1.87 ±0.02, 2 =−0.28±0.01, 3 =−0.24±0.01, and the estimated intercept is = 9.07 ±0.12. The difference in slope between ESS per million states and ESS per hour is (−1.87)−(−2.81) = 0.94, very close to 1 as predicted. As with ESS per hour, observations used for the linear regression were restricted to n l ≥ 32 for the 5 species, 2 individual case and n l ≥ 16 for other cases.
Using the example of 5 species and 2 individuals, the slope and intercept are −1.97 ±0.04 and 7.86± 0.18 respectively, so the predicted ESS per million states for 256 individuals is 0.047 ( Figure S4a

Statistical accuracy.-
We also calculated the relative error in the species tree estimate for each replicate. For some larger analyses it was challenging to achieve acceptable ESS values for every replicate, even with chain lengths of several billion states and access to highperformance computational infrastructure. To retain the larger analyses without biasing statistical accuracy, we excluded replicates in which the ESS of either the log posterior or the species tree age was smaller than 200. All remaining replicates were used for a linear regression analysis of the contribution of the number of loci to relative species tree error. This analysis revealed a power law relationship from 2 to 256 loci (Fig. 1c, S5 in Supplementary Material available on Dryad). Given 5 species and 2 individuals, the slope and intercept are −0.435±0.007 and −0.889±0.026 respectively, so the relative species tree error predicted by the power law for 256 loci is 0.037. By extrapolation, we would therefore estimate that the relative error of a 1024 loci analysis would decrease to 0.037 × 1024 256 −0.435 ≈ 0.020.
Linear regression analysis of relative species tree error for all combinations of n and n l showed little variation in the trend line slope between conditions (Fig. 1d). By assigning the relative species tree error to Y in the multiple linear regression in Equation (3) Material available on Dryad. Trends in topology-only accuracy inferred using rooted Robinson-Foulds (rRF) scores are also presented in Supplementary Material as Dryad, Figure S9 and Table S12 available on Dryad. Finally, we also analyzed the number of species tree topologies sampled in each posterior distribution. It appears that for the analyses involving 8 and 13 species there is a rapid reduction in the number of topologies in the 95% credible set with increasing numbers of loci, but it does not follow a power law ( Figure S7 in Supplementary Material available on Dryad).
Post hoc analysis of convergence and species tree error.-Experiment 1 was designed to investigate the relationship between the number of loci n l , number of species n and number of individuals n i on ESS rates and statistical accuracy. Although these variables explained most of the variation in ESS rates and accuracy, residual variation was present between the 100 replicates of each combination of n l , n and n i (Fig. 1a and c). The correlations between this residual variation and a collection of phylogenetic statistics that could be extracted from the simulated trees and alignments were studied in a post hoc analysis.
The only tree or alignment statistic that was significantly correlated with ESS per hour consistently across all conditions was mean tree height difference (Table 1). This statistic is the mean difference in height between each gene tree and the species tree. The positive correlation observed for this parameter suggests that when gene trees are taller relative to the species tree, the ESS rate will be higher and *BEAST will converge more quickly.
In contrast to ESS per hour, several statistics were consistently significantly correlated with relative species tree error ( Table 2). The height of the species tree and the number of variable sites per locus were negatively correlated with relative error. This result is somewhat intuitive, as taller species trees will have longer branches which are easier to resolve, and the number of variable sites is an obvious proxy for the amount of information in each locus. Relative error was positively correlated with the mean number of deep coalescences and the number of mutations per variable site. Those correlations suggest that data sets with more incomplete lineage sorting will be more difficult to resolve, and that saturated sites may increase uncertainty.

Experiment 2: Statistical Accuracy of *BEAST Relative to
Bayesian supermatrix To assess the statistical accuracy of the *BEAST relative to the standard Bayesian supermatrix approach, we conducted a simulation study where we simulated species trees with a broad range of mean branch lengths for varying numbers of species and loci. Gene coalescences occur prior to species divergence times, and the severity of this discrepancy will depend on species tree branch lengths in units of coalescent time. Because the multispecies coalescent accounts for this 388 SYSTEMATIC BIOLOGY VOL. 65  phenomenon but the Bayesian supermatrix approach does not, we expected the multispecies coalescent to outperform the Bayesian supermatrix approach for trees with shorter branch lengths. The "species tree error ratio" e T a /e T b is a measure of the comparative accuracy and is specified as follows, where a is *BEAST and b is Bayesian supermatrix: Values below 1 indicate lower error, or equivalently superior accuracy, when using *BEAST instead of Bayesian supermatrix. For all numbers of species tested, the statistical accuracy of *BEAST was superior to Bayesian supermatrix for trees with shorter mean branch lengths (Fig. 2). Using LOESS regression, it is clear that as the number of loci increases, *BEAST performance improves relative to Bayesian supermatrix because for a given mean branch length, the species tree error ratio decreases as the number of loci increases (Fig. 2).
For all numbers of species and loci tested, there is a mean branch length crossover point where for shorter mean branch lengths, *BEAST is expected to outperform Bayesian supermatrix, and vice versa for longer mean branch lengths. The crossover point depends on the number of loci; as the number of loci increases, the point shifts right (Fig. 2), indicating that *BEAST is expected to outperform Bayesian supermatrix for a larger range of mean branch lengths, consistent with the general trend of improved performance of *BEAST when increasing the number of loci.
Within the parameter region explored in this experiment, depending on the number of species, loci and the effective population sizes, the crossover point was found in the range 0.382(2N e ) −1 to 5.416(2N e ) −1 ( Figure S11 in Supplementary Material available on Dryad). For mean branch lengths shorter than 0.382(2N e ) −1 , *BEAST was preferred regardless of the parameters explored, even when using a single locus (Fig. 2). The crossover point given a single locus was always below 0.5(2N e ) −1 ( Figure S11 in Supplementary Material available on Dryad) and given longer mean branch lengths the relative performance of Bayesian supermatrix was higher than for multilocus inference (Fig. 2). This implies that *BEAST is still useful for single-locus studies of species trees with short branches, but should be applied with caution.

Experiment 3: Inferred Parameters of Phylogenomic Data
Sets and Multi-method Comparison Sequence data sets from two published studies were realigned and reanalyzed to calculate their empirical properties and phylogenetic parameters. Besides the expected difference in speciation rate (which for the shallow study rate was over six times faster, corresponding to much shorter branch lengths), the shallow plant study sequences were very AT rich, whereas the deep primate study sequences were moderately GC rich (Table 3). C T substitutions were FIGURE 2. Species tree error ratio (*BEAST/BEAST) as a function of the average species tree branch length (in coalescent units) for trees of 4, 5, 6, and 8 species. Data points are below 1 (black line) where the *BEAST error is lower than the BEAST error, indicating that *BEAST was more accurate than BEAST. Data points above 1 show the opposite. Only results with both mean branch lengths and error ratios between 0.1 and 10.0 are included. The red, green, and blue lines show the local regression for one, two and four locus estimates, respectively. The shaded region indicates where the crossover point depended on the combination of simulation parameters chosen-*BEAST was always preferred for average branch lengths shorter than this zone. a greater proportion of all substitutions for the deep study, but the between-site gamma rate variation was flatter. The mean effective population size N e of the deep study was estimated to be only 2.4% that of the shallow study.
The original publication of Cyathophora sequences and phylogeny suggested that P. rex subsp. rockii is sister to subsp. rex and subsp. lipskyana (Eaton and Ree 2013). The most common species tree topology seen in both SNAPP and Bayesian supermatrix posterior distributions supports this placement (Figures S16 and S17 in Supplementary Material available on Dryad). The original study left open the question of P. thamnophila monophyly but raised the possibility that the apparent paraphyly of this species, as replicated by our reanalysis, is an artifact of introgression (Eaton and Ree 2013). Species trees inferred by SNAPP and Bayesian supermatrix from reanalysis of the deep phylogenetic study ( Figure S18,S19) agreed with the accepted primate phylogeny (Perry et al. 2012 All inferred parameters are rounded to three significant figures or one decimal place, whichever is more precise.

Analysis of empirical-based simulations.-
We simulated species trees, gene trees, and sequences based on the estimated parameters of both data sets (Table 3), and refer to these simulations as shallow and deep phylogenetic simulations respectively. The mean branch length of the simulated shallow species trees was 0.539(2N e ) −1 , compared with 159.8(2N e ) −1 for the simulated deep species trees. We computed the relative species tree error for all *BEAST analyses of these simulations.
The relative species tree errors for all values of n l and n i considered were computed for both simulation types. A power law appeared to fit the relationship between relative error and number of loci for values of n l ≥ 2, so log-log linear regression analyses were restricted to n l ≥ 2. The log-log slope connecting relative error and the number of loci appears mostly independent of n i for shallow phylogenetic simulations. For deep simulations, the trend lines for n i = 1 and n i = 2 were very close, implying that multiple individuals did not improve accuracy for those simulations (Figure 3).
This result is consistent with the initial set of simulations reported in "Statistical accuracy." However, the log-log slopes varied substantially between *BEAST inference of shallow and deep phylogenetic simulations. The difference in power law exponents inferred using multiple linear regression (Tables S13 and S14 in Supplementary Material available on Dryad) between shallow and deep simulations was (−0.365)−(−0.568) = 0.203.
Results from the initial simulation study, detailed in "Computational performance," suggest that a power law relationship of ESS and number of loci only applies to *BEAST analyses of 16 to 32 loci and above. As we only inferred deep phylogenetic trees utilizing up to 8 The relative species tree error as a function of the number of loci for empirical-based simulations. Both shallow and deep phylogenetic simulation results are presented. Solid and hollow shapes are the median value for each category, and regression lines were calculated using all replicates for each category.
loci and shallow phylogenetic trees up to 32 loci using *BEAST, we cannot make firm conclusions regarding the scaling laws of ESS performance using this set of simulations.

Alternative methods for multi-locus phylogenetic inference.-
The second analysis we conducted based on the empirically derived shallow and deep phylogenetic simulations was a comparison of common multi-locus methods of species tree inference. This encompassed the Bayesian multispecies coalescent (*BEAST), Bayesian supermatrix (BEAST), Maximum-likelihood supermatrix (RAxML), neighbor-joining (BIONJ), and summary coalescent (MP-EST) methods. As some methods provide only a single best tree estimate in place of a posterior distribution of trees, we used common ancestor summary trees (CAT; Heled and Bouckaert 2013) for *BEAST and Bayesian supermatrix analyses in this comparison.
Based on relative species tree error, *BEAST outperformed all other methods for any given number of loci for the shallow simulations. The statistical accuracy of Bayesian supermatrix, RAxML and BIONJ all plateaued beyond 64 loci for the shallow simulations, whereas *BEAST appears to follow a power law as previously suggested (Fig. 4a). The statistical accuracy of all methods improves with increasing numbers of loci for the deep simulations, however we limited the simulations to a maximum of 8 loci when running *BEAST. The statistical accuracy of all methods tested was similar up to 8 loci, but for larger numbers of loci Bayesian supermatrix analysis was superior and BIONJ was inferior to RAxML (Fig. 4b).
A major factor causing the poor performance of methods other than *BEAST for the shallow simulations is a bias when estimating pendant edge (also known as leaf or tip) length. Although the mean bias of estimated pendant edge length trends towards zero for *BEAST,  other methods converge on a bias of approximately 350%, meaning estimated pendant edges are on average 4.5× the true length (Fig. 4c). In contrast, there is only a small positive bias using methods other than *BEAST for the deep simulations (Fig. 4d).
Relative species tree error incorporates both topological error and branch length error. To separate these two components, we calculated the mean rRF score as a measure of purely topological error-estimated topologies more distant from the truth will have higher rRF scores. For shallow simulations, *BEAST was the best-performing method, and the topological accuracy of both *BEAST and MP-EST was improved given two individuals per species (Fig. 4e). For deep simulations, all methods other than *BEAST and MP-EST converged at near-zero topological error given 512 loci (Fig. 4f). *BEAST was limited to a maximum of 8 loci, but its performance for a given number of loci was very close to Bayesian supermatrix. The topological accuracy of MP-EST was inferior to all other methods analyzed.

DISCUSSION AND CONCLUSIONS
We have demonstrated by simulation that the multispecies coalescent (as implemented in *BEAST) can be applied to some problems involving hundreds of loci. In order to analyze the performance of *BEAST with hundreds of loci under various conditions, with 100 replicates per condition and given finite computational resources, we made choices partly based on computational expediency. These included relatively limited numbers of species and individuals, and assuming a strict molecular clock. More complexity in the sense of more parameters to estimate, for example denser taxon sampling or relaxed clocks, would be expected to require more computational time than the analyses reported here.
Researchers studying the evolutionary histories of organisms are not burdened by the need to test hundreds of replicates across many conditions, and can therefore conduct larger analyses using *BEAST. For example, a recent study of Neotropical cotingas (Cotingidae: Aves) applied *BEAST to resolve a species tree of 67 extant bird lineages, and used a lognormal relaxed clock for each locus with molecular rate calibrations to infer absolute divergence times. ESS rates for all logged statistics were greater than 200 and convergence was also confirmed graphically, demonstrating that *BEAST can be applied to real phylogenetic data sets with many taxa, and may also be used with a relaxed clock (Berv and Prum 2014).

Power Laws Describe *BEAST Scaling Behavior
For the various numbers of species, individuals and loci analyzed in this study, power laws could be used to describe the observed trends in computational performance of *BEAST, and in the statistical accuracy of the fully Bayesian multispecies coalescent. In terms of computational performance, this provides a benchmark for the efficiency of Bayesian MCMC approaches to inference under the multispecies coalescent. Our results are a product of the particular algorithm design decisions that the authors of *BEAST have made, and we hope that power law exponents can be improved upon by subsequent efforts to produce more efficient algorithms for inference under the multispecies coalescent model.
In contrast, the power law that describes the decrease in estimation uncertainty associated with inference of the species tree with increasing number of loci is a fundamental property of the model itself, and will hold regardless of the details of the algorithmic approach to inference under this model. It therefore represents a fundamental feature of the problem of species tree inference. With these results, it is possible to extrapolate what one might expect to achieve by expanding data from a small pilot study to a more comprehensive sample of the genomic material of a set of study species or individuals.
The decrease in relative species tree error given different numbers of species and individuals was investigated in experiment 1. Other phylogenetic parameters were fixed, including the locus length, substitution model and population size distributions. Possibly because of this, the variation in power law exponents was minimal. Experiment 3 in contrast compared shallow and deep phylogenies with larger and smaller population sizes respectively, and associated alignments of short fixed-length loci and longer variable-length loci respectively. Clock rate variation and substitution model rates also differed between conditions. Power law exponents did vary between experiment 1 and both the shallow and deep inferences in experiment 3; exponents were −0.433, −0.365 and −0.568 respectively. This is important because larger exponents imply a greater decrease in relative species tree error, so additional loci will lead to a larger improvement in accuracy of inferred species trees than with a smaller exponent.
Given a hypothetical pilot study of 16 loci, it may be of interest what the decrease in error would be for a full study of 256 loci. Because the number of loci in this scenario is increased 16 times, the reduction in relative species tree error of the full study compared with the pilot study would be 1.0−16 −0.433 ≈ 70% if the study is similar to experiment 1, 1.0−16 −0.365 ≈ 64% if it is similar to the shallow phylogenetic simulations, or 1.0−16 −0.568 ≈ 79% if it similar to the deep phylogenetic simulations. What these calculations should remind us about the power law relationship is that expanding data from 1 to 16 loci provides as great an increase in statistical accuracy as expanding from 16 to 256 loci. That is, for each subsequent locus added there is a diminishing return with regards to statistical accuracy.
The power laws describing computational performance can also be used to predict the increase in computational time and chain length required to achieve sufficient sampling of the posterior distribution.

393
In experiment 1, the power law coefficient for the log number of loci was −2.81 for ESS per hour and −1.87 for ESS per million states. Given the previous example going from 16 to 256 loci, the amount of time required for sufficient sampling of data sets similar to experiment 1 would increase by 16 2.81 ≈ 2408 times. The chain length (number of states) required would increase by 16 1.87 ≈ 180 times.
Some residual variation in ESS rates was observed after accounting for the number of individuals, species, and loci in each analysis. This was unsurprising as the operators used by *BEAST are stochastic (Höhna and Drummond 2012), so even when applied to the same data ESS rates are expected to vary between runs. Consistent with this expectation, the only nonstochastic contribution identified in our post hoc analysis was a moderate correlation between residual ESS per hour and the average gene and species tree height difference.
It is possible that the parameters which were kept constant in our analysis (e.g., the substitution rate, or the number of sites per loci, or the choice of a strict molecular clock) may change the relationship between the number of loci and computational performance or statistical accuracy. Given a sequence data set with substantially different properties from experiment 1, increasing the number of loci might have a smaller or larger effect on computational performance.

*BEAST Compared with Other Methods
A previous simulation study which analyzed the scaling behavior of *BEAST and other methods used just two species trees to report on topological accuracy given a range (5, 10, 25, and 50) of number of loci, and produced ambiguous results (Bayzid and Warnow 2013). Because we simulated a new species tree for each replicate, we are able to make more general observations regarding relative performance. As expected, the relative performance of *BEAST is higher when branch lengths are shorter. The relative performance of *BEAST is also higher as the number of loci is increased (Fig. 2).
The primary measure we chose to explore statistical accuracy, relative species tree error, incorporates both branch length and topological error. This measure is particularly relevant for molecular dating and downstream analyses of macroevolution and ecology. For example, the PD C measure of phylogenetic diversity and the BiSSE model of binary character influence on birth and death rates both assume accurate tree topologies and branch lengths (Maddison et al. 2007;Cadotte et al. 2008). When inferring species trees with shorter branch lengths, *BEAST using tens of loci outperformed supermatrix methods by this measure, even when other methods were able to utilize thousands of loci (Fig. 4a).
If instead branch lengths are irrelevant for a study, *BEAST still outperformed other methods for a given number of loci when inferring the topology of shallow species trees (Fig. 4e). However, when using thousands of loci, other methods were able to outperform *BEAST because *BEAST was restricted to tens of loci.
For certain species trees concatenation is statistically inconsistent (Roch and Steel 2015) and might not outperform *BEAST even when using thousands of loci. For deeper phylogenetic trees, *BEAST performed similarly to the Bayesian supermatrix method, which in turn was superior to RAxML given larger numbers of loci ( Fig. 4b and f). Unpartitioned concatenation is known to potentially change the branch lengths and topology of estimated trees relative to partitioned concatenation (Kainer and Lanfear 2015), so this difference may be due to method configuration rather than a quality of the statistical method employed (maximum likelihood). Regardless, as *BEAST requires substantially more computational time, concatenation methods may be preferable in this case.
Multispecies coalescent methods assume free recombination between loci, and no recombination within loci. Short sequences dispersed throughout a genome, including RAD tags, can be justifiably used with coalescent methods as violations of both assumptions are likely to be limited. However, shortcut coalescence methods like MP-EST suffer from high gene tree estimation error when applied to these short sequences (Mirarab et al. 2014a;Springer and Gatesy 2016). In our study, MP-EST was inferior to *BEAST and similar to concatenation when inferring shallow phylogenies using short, RAD tag-like sequences (Fig. 4e). When inferring deep phylogenies MP-EST was inferior to both *BEAST and concatenation (Fig. 4f), despite the longer loci used for those simulations.
Newer fast multispecies coalescent methods such as ASTRAL (Mirarab et al. 2014b) and SVDquartets (Chifman and Kubatko 2014) may perform better at inferring species tree topology-the latest iteration of ASTRAL is both faster and less sensitive to gene tree error than MP-EST (Mirarab and Warnow 2015). However because these methods compute unrooted species trees without branch lengths, they cannot be compared with other methods using relative species tree error or rRF scores.

Practical Implications for Applied Phylogenetics
Systematists can use the results of this study as a guide to choosing an appropriate phylogenetic method. If both a priori estimates or boundaries of root height (clade age) and extant effective population sizes are available for a particular study system, and the Yule process is a good fit for that system, an approximate estimate of branch length in coalescent units can be made before selecting a particular method.
Previous work has shown that the expected mean branch length of a Yule tree is equal to 1/2 (Steel and Mooers 2010). Under the Yule model this value is related to the expected root height: where R is the expected root height and H n is the n th harmonic number (where n is the number of species). The expected branch lengthb in coalescent units of (2N e ) −1 is therefore: The mean root height of the shallow simulations was 0.01315, and the mean of the reciprocal extant population sizes 1/N e was 302.05. The approximate branch length in coalescent units based on these averages is: 01315 H n −1 ·302.05 = 0.578. (7) This approximate value is quite close to the sample mean of simulated branch lengths; 0.539(2N e ) −1 . Based on the results of experiment 2, this value ofb is towards the lower bound of the crossover zone, and *BEAST will be preferred under most conditions (Fig. 2). As with experiment 1, parameters which were kept constant may move this crossover point to be more or less favorable to *BEAST.
The results of experiment 3 will inform researchers with access to phylogenomic data in the order of hundreds or thousands of loci on how to select an appropriate inference method. If branch lengths are at all important, either for reporting divergence times or for downstream analyses which require a species tree, using a subset of loci with *BEAST will be superior to using all loci with other methods tested for shallow phylogenies (Fig. 4a). If instead only the topology of the species tree is of interest, concatenation methods may be superior to fully Bayesian multispecies coalescent methods like *BEAST until improvements can be made to their computational performance ( Fig. 4e and f).

Open Questions in Phylogenomic Inference
Our results point to a number of areas for further research into the performance of species tree inference.
When using a single locus for species tree inference, experiment 2 shows Bayesian supermatrix analysis outperforming *BEAST for trees with longer branch lengths. This may be due to the population size priors used in *BEAST. However, our many-method comparison shows similar performance for both methods given species trees with long branch lengths. Because deep phylogenetic trees from experiment 3 were longer than the longest trees from experiment 2, this may point to a zone of intermediate branch lengths where *BEAST performs poorly given a single locus.
For all simulations we assumed a constant rate of speciation, however many lineages of life have undergone rapid radiations. It may be that when inferring species trees of clades containing ancient rapid radiations the performance of phylogenetic methods is closer to the shallow simulations than the deep simulations, and hence *BEAST becomes the preferred method.
Sequence alignments were generated and subsetted uniformly for all simulations regardless of the number of loci used for each analysis. In practice, researchers may reasonably choose longer, more informative loci when subsetting phylogenomic data sets for use with methods like *BEAST which are computationally intensive. This may improve the relative performance of *BEAST given a subset of the most informative loci relative to supermatrix or summary methods using thousands of loci.
However, whole proteins and transcripts can span genomic regions hundreds of thousands of nucleotides long, so recombination within loci will be common. The use of whole proteins or transcripts with coalescent methods has been dubbed "concatalescence" to reflect this violation Springer 2013, 2014). If these long sequences are instead split into their constituent exons, the assumption of free recombination between loci may be violated due to short intronic distances. Further studies are needed to resolve which violation is less harmful to statistical accuracy.