StarBEAST2 Brings Faster Species Tree Inference and Accurate Estimates of Substitution Rates

Abstract Fully Bayesian multispecies coalescent (MSC) methods like *BEAST estimate species trees from multiple sequence alignments. Today thousands of genes can be sequenced for a given study, but using that many genes with *BEAST is intractably slow. An alternative is to use heuristic methods which compromise accuracy or completeness in return for speed. A common heuristic is concatenation, which assumes that the evolutionary history of each gene tree is identical to the species tree. This is an inconsistent estimator of species tree topology, a worse estimator of divergence times, and induces spurious substitution rate variation when incomplete lineage sorting is present. Another class of heuristics directly motivated by the MSC avoids many of the pitfalls of concatenation but cannot be used to estimate divergence times. To enable fuller use of available data and more accurate inference of species tree topologies, divergence times, and substitution rates, we have developed a new version of *BEAST called StarBEAST2. To improve convergence rates we add analytical integration of population sizes, novel MCMC operators and other optimizations. Computational performance improved by 13.5× and 13.8× respectively when analyzing two empirical data sets, and an average of 33.1× across 30 simulated data sets. To enable accurate estimates of per-species substitution rates, we introduce species tree relaxed clocks, and show that StarBEAST2 is a more powerful and robust estimator of rate variation than concatenation. StarBEAST2 is available through the BEAUTi package manager in BEAST 2.4 and above.


Supplementary figures
Figure S1: Frequency of five-taxon species tree topologies sampled from a birth-death prior distribution. Topologies were sampled by simulating trees using biopy (red), or by using a StarBEAST2 MCMC chain (blue). Frequencies are identical apart from noise, indicating that StarBEAST2 is mathematically correct. Three levels of probability are evident from left to right -high probability balanced topologies e.g. ((a,b), ((c,d),e)), middle probability intermediate topologies e.g. (((a,b),(c,d)),e), and low probability unbalanced topologies e.g. ((((a,b),c),d),e).
Unique tree topology (ordered by frequency) Tree topology frequency

Simulated StarBEAST2
• 0 1 Figure S2: Frequency of five-taxon gene tree topologies sampled from a multispecies coalescent prior distribution. Topologies were sampled by simulating gene trees within species trees using biopy (red), or by using a StarBEAST2 MCMC chain (blue). Gene trees were sampled with two clock rates, 0.5 (circles) and 2.0 (triangles), although clock rate should not affect topology or node heights in units of time. Frequencies are identical apart from noise, indicating that StarBEAST2 is mathematically correct. Three levels of probability are evident from left to right -high probability balanced topologies e.g. ((a,b), ((c,d),e)), middle probability intermediate topologies e.g. (((a,b),(c,d)),e), and low probability unbalanced topologies e.g. ((((a,b),c),d),e).  Figure S3: Probability densities of five-taxon species tree node heights sampled from a birth-death prior distribution. Node heights were sampled by simulating trees using biopy (red), or by using a StarBEAST2 MCMC chain (blue). Probability densities are plotted separately for each ranked node giving four peaks, one for each internal node. Node height probability densities are identical apart from noise, indicating that StarBEAST2 is mathematically correct.  Figure S4: Probability densities of five-taxon gene tree node heights sampled from a multispecies coalescent prior distribution. Node heights were sampled by simulating gene trees within species trees using biopy (red), or by using a StarBEAST2 MCMC chain (blue). Gene trees were sampled with two clock rates, 0.5 and 2.0, and node heights from both sets of gene trees were combined as clock rate should not affect topology or node heights in units of time. Probability densities are plotted separately for each ranked node giving four peaks, one for each internal node. Node height probability densities are identical apart from noise, indicating that StarBEAST2 is mathematically correct. Branch substitution rate Density Simulated StarBEAST2 Figure S5: Probability densities of five-taxon gene tree branch rates produced by a species tree uncorrelated lognormal (UCLN) relaxed clock. Branch rates were sampled by simulating gene trees within species trees using biopy and simulating species tree branch rates using scipy (red), or by using a Star-BEAST2 MCMC chain (blue). Gene trees were sampled with two clock rates, 0.5 and 2.0, resulting in two peaks. Branch rate probability densities are identical apart from noise, indicating that StarBEAST2 is mathematically correct. Branch substitution rate Density Simulated StarBEAST2 Figure S6: Probability densities of five-taxon gene tree branch rates produced by a species tree uncorrelated exponential (UCED) relaxed clock. Branch rates were sampled by simulating gene trees within species trees using biopy and simulating species tree branch rates using scipy (red), or by using a Star-BEAST2 MCMC chain (blue). Gene trees were sampled with two clock rates, 0.5 and 2.0, resulting in two peaks. Branch rate probability densities are identical apart from noise, indicating that StarBEAST2 is mathematically correct. Figure S7: Number of erroneous per-species substitution rates estimated using BEAST concatenation and StarBEAST2. Erroneous branch rates are those with 95% highest posterior density (HPD) credible intervals which do not include the true rate of 1. Numbers above a branch are the counts of erroneous branches for concatenation, and those below are the counts for StarBEAST2, both out of 96 replicates. The first count is the number of branch rates inferred to be faster than 1, and the second is the number inferred to be slower than 1. Figure S8: Number of erroneous branch lengths estimated using BEAST concatenation and Star-BEAST2. Erroneous branch lengths are those with 95% highest posterior density (HPD) credible intervals which do not include the true simulated length for a given branch. Numbers above a branch are the counts of erroneous branches for concatenation, and those below are the counts for StarBEAST2, both out of 96 replicates. The first count is the number of branch lengths inferred to be longer than the true length, and the second is the number inferred to be shorter.  Figure S12: Estimates of species tree branch rates using unphased BEAST concatenation versus Star-BEAST2. Estimated rates are the posterior expectations of each branch rate from each replicate. Root branch rates, which were fixed at 1, were excluded. In blue are simple linear regression lines of best fit, and in red are the y = x lines showing a perfect relationship between estimates and truth. N = 30.         3 CoordinatedUniform proposal example

Supplementary tables
The CoordinatedUniform operator will pick one non-leaf, non-root species tree node uniformly at random to be moved, dubbed S. Given a species tree with the topology ((A,B),C),D there are two non-leaf, non-root nodes: the common ancestor of species A and B (AB) and the common ancestor of species A, B and C (ABC). We illustrate an example where the ABC node is chosen ( Figure S13).  Figure S13: Illustration of the CoordinatedUniform move. An example gene tree is embedded within a four-taxon asymmetric species tree. Internal gene tree nodes are numbered, species leaves lettered. The ABC node is represented by a star in the middle step. Dashed lines are connected component and ABC node child and parent branches.
The step after selecting a species tree node is to determine which internal gene tree nodes will also be moved. CoordinatedUniform moves every gene tree node s in every gene tree which meets all of the following requirements: 1. at least one descendant individual of s is also a descendant individual of the left child of S 2. at least one descendant individual of s is also a descendant individual of the right child of S 3. all descendent individuals of s are also descendent individuals of S Descendent individuals of a gene tree node s are gene tree leaf nodes which have s as an ancestor.
Descendent individuals of a species tree node S are gene tree leaf nodes belonging to any extant species which has S as an ancestor. In the example, the left child of S is the AB node, and the right child of S is the C node. The following internal nodes in the gene tree meet requirement (1) 12, 13, 14, 15, 16, 20, 22; requirement (2) 13, 16, 17, 21, 22; and requirement (3) 12, 13, 14, 15, 16, 17, 20. The only nodes in the example meeting all three requirements are 13 and 16. These nodes are connected by a gene tree branch, making them a single connected component. The connected component has three child branches (dashed red, yellow and blue lines) and one parent branch (dashed orange line).
To determine the lower bound for the move, we must first identify the shortest branch out of all connected component child branches and species tree node S child branches. In the example the connected component child branches have lengths of 3, 7 and 8.5, and the species tree node child branches have lengths 4 and 6 ( Figure S13). The lower bound is therefore the original species tree node height t(S) 3 = 6 3 = 3.
To determine the upper bound we use the shortest branch out of all connected component parent branches and the species tree node S parent branch. The parent branch of the single connected component has a length of 2.5, and the parent branch of S has a length of 4. The upper bound is therefore t(S) + 2.5 = 6 + 2.5 = 8.5.
A new species tree node height t 0 (S) between 3 and 8.5 is chosen uniformly at random. Within this range, if the connected component node heights are changed by ⌘ = t 0 (S) t(S), no topology changes are induced or required. In our example the new height is 5, so ⌘ = 5 6 = 1. After the height of the species tree node S and all connected component gene tree nodes are changed by ⌘ = 1 the proposal is complete ( Figure S13).