Complexity of the simplest species tree problem

Abstract The multispecies coalescent model provides a natural framework for species tree estimation accounting for gene-tree conflicts. Although a number of species tree methods under the multispecies coalescent have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here, we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood, and maximum likelihood. We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case, major differences exist among the methods. Full-likelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes,whereas these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.


Introduction
The multispecies coalescent (MSC) model (Rannala and Yang 2003) combines the phylogenetic process of species divergences with the population genetic process of coalescent and naturally accommodates delayed coalescence (also known as incomplete lineage sorting, Maddison 1997), the phenomenon in which gene sequences fail to coalesce in their most recent common ancestor but do so only in more ancient ancestors. Delayed coalescence causes the gene tree for a gene or genomic region to differ from the species tree and is the most important factor for gene-tree-species-tree discordance (Maddison 1997;Nichols 2001;Szollosi et al. 2015). The MSC provides a natural framework for estimating species trees accounting for genealogical heterogeneity among genes or across the genome (Edwards 2009;Xu and Yang 2016;Kubatko 2019;Rannala et al. 2020).
Two lines of research into the MSC have provided the foundation for species tree methods. The first concerns the probabilities of different gene tree topologies (Hudson 1983;Pamilo and Nei 1988) and algorithms for their efficient calculation given the species tree (Degnan and Salter 2005;Degnan and Rosenberg 2006). The gene tree distribution can be used in the two-step method of species tree estimation, by inferring gene trees for the individual loci and then applying maximum likelihood (ML) to counts of gene tree topologies (as in STELLS, Wu 2012). Nevertheless, widely used two-step methods, including ASTRAL (Mirarab et al. 2014) and MP-EST (Liu et al. 2010), are simpler, and estimate species trees for species triplets (assuming the molecular clock) or quartets (without the clock) and then assemble the subtrees to produce a species-tree estimate for all species. Studies of gene-tree probabilities led to the discovery of the anomaly zone, the region of the parameter space in which the most probable gene tree has a different topology from the species tree (Degnan and Salter 2005;Degnan and Rosenberg 2006). In the anomaly zone, the two-step method, which uses the most common gene tree as the species tree estimate, will be inconsistent.
The second line of research into MSC is the development of the joint probability distribution of the gene Mol. Biol. Evol. 0(0):1-15 doi:10.1093/molbev/mst012 Advance Access publication xxx 1 MBE tree and coalescent times (Rannala and Yang 2003). This forms the basis for exact methods of inference, including ML (Yang 2002;Dalquen et al. 2017) and Bayesian methods (Liu and Pearl 2007;Heled and Drummond 2010;Yang and Rannala 2014;Ogilvie et al. 2017;Rannala and Yang 2017). While heuristic methods use summaries of the data, exact methods use the multi-locus sequence alignments directly and naturally accommodate phylogenetic reconstruction errors and uncertainties (Xu and Yang 2016;Kubatko 2019;Rannala et al. 2020). Simulation has been used to examine the performance of different species-tree methods (e.g., Leaché and Rannala 2011;Mirarab et al. 2014;Chou et al. 2015;Xu and Yang 2016). A limitation of simulation is that it can examine only a small portion of the parameter space and the results often have limited applicability. Analytical results on the efficiency of different methods have been lacking. Here we analyze species tree estimation under the MSC in the case of three species, with one sequence from each species per locus. We focus on closely related species and assume the JC mutation model (Jukes and Cantor 1969) and the molecular clock. We are in particular interested in the efficiency of the various methods, measured by the probability of recovering the correct species tree.
We consider four inference methods: (i) ML (a full likelihood method under the MSC applied to the multilocus sequence alignments), (ii) 2-STEP (or majority-vote), (iii) concatenation (CONCAT), and (iv) independent-sites ML (ISML, also known as coalescent-aware concatenation or CONCAT) (Xu and Yang 2016). ML is the fulllikelihood method and calculates the likelihood function using the multi-locus sequence alignments or a sufficient summary. The 2-STEP method estimates the gene tree at each locus and then uses the most common gene tree as the species tree estimate. It does not account for the uncertainties in the estimated gene trees. For the case of three species considered here, 2-STEP is equivalent to the maximum pseudo-likelihood method (MP-EST) (Liu et al. 2010). Concatenation applies ML to the concatenated sequences, assuming that the same tree underlies all sites in the super alignment. In the case considered here, concatenation is equivalent to STEAC (Liu et al. 2009), which uses average coalescent times over loci as data to infer a gene tree, which is the species tree estimate. ISML (or CONCAT) estimates the species tree by ML under the assumption that all sites, both from the same locus and from different loci, have independent gene trees (Xu and Yang 2016). This was suggested as an improvement to SVDQUARTETS of Chifman and Kubatko (2014). All four methods considered here use ML, but the likelihood function is applied to different summaries of the same data. Here we refer to the fulllikelihood or full-data method as the ML method, while all other methods concatenation,and ISML) are considered heuristic summary methods: 2-STEP uses the (estimated) gene tree topologies, while concatenation and ISML use the site-pattern counts pooled across loci. We derive approximations to the error rate of species tree estimation by the different methods and assess their accuracy. We use the theory to characterize the differences in the use of information in the data by different methods.

MBE
The data consist of sequence alignments at m loci. Under the JC mutation model, the data at locus i can be summarized as counts of five site patterns: xxx, xxy, yxx, xyx, and xyz, where x, y, z are any three distinct nucleotides. Let those counts be x x j=0 x i j = n to be the number of sites (sequence length) at each locus. Let f i j = x i j /n be the frequencies. Let data at all m loci be Given the gene tree and coalescent times at locus i, the probability of the sequence data, f (x x x i |G i ,t t t i ), is given by the multinomial distribution for the five site patterns. For example, given gene tree G 1 with node ages t i0 and t i1 ( fig. 1b), the site-pattern probabilities, The probabilities for gene trees G 2 or G 3 are given by symmetry. Then the sequence data or the five site-pattern counts at the locus have the multinomial probabilities The ML method of species tree estimation The log-likelihood function for species tree S 1 with parameters θ θ θ 1 is given by summing over the gene trees and integrating over the coalescent times.
is the MSC density for the gene tree and coalescent times at locus i (eq. 2), and f (x x x i |G i ,t t t i ) is the probability of the sequence data at locus i given the gene tree (eq. 4). The log likelihood functions, 2 (θ θ θ 2 ) and 3 (θ θ θ 3 ), for S 2 (with parameters θ θ θ 2 ) and S 3 (with θ θ θ 3 ) are defined similarly.
Maximizing the log-likelihood function (eq. 5) with respect to the parameters will lead to a log-likelihood value for the given species tree, and the species tree that achieves the highest is the ML species tree. This is not analytically tractable. The program 3S implements the method by explicitly summing over the gene trees (G i ) and by using Gaussian quadrature to calculate the 2-D integrals over t t t i (eq. 5) (Yang 2002;Zhu and Yang 2012;Dalquen et al. 2017). This is used in simulations.
In this paper, ζ represents the error probability of species tree estimation. Thus the bounds Φ(−h) ≤ ζ < 2Φ(−h) suggest that when m → ∞, the probit transform of the species-tree error probability, Φ −1 (ζ ), where Φ −1 is the inverse CDF of N(0, 1), decreases linearly with √ m. For practical calculations for finite m in this paper, eq. 6 is more accurate (see Appendix A) and will be used later.
The error rate for the ML method (eq. 5) is analyzed in Appendix B. When the number of loci m → ∞, the MLEθ θ θ j → θ θ θ * j in species tree S j , j = 1, 2, 3. Note that S 1 represents the true model and θ θ θ * 1 are the true parameter values, while S 2 and S 3 are misspecified models and θ θ θ * 2 and θ θ θ * 3 are the best-fitting or pseudo-true parameter values. The Kullback-Leibler distance D 12 from S 2 to S 1 is where l j (θ θ θ * j ) ≡ log f (x|S j , θ θ θ * j ), with x to be one data point (or site pattern counts at one locus), and where the integral means summation over all possible data outcomes at a locus. We use the per-locus log-likelihood values to compare the three species trees:z j ≡ 1 m j (θ θ θ j ), j = 1, 2, 3. When m is large, these have the means E(z j ) ≈ E(l j (θ θ θ * j )) ≡ μ j , with μ 1 − μ 2 = D 12 , and the variance matrix 1 m Σ, where Σ = {σ jk } and σ jk ≡ Cov(l j (θ θ θ * j ), l k (θ θ θ * k )). The error of the ML method, e ML = P{ 1 (θ θ θ 1 ) < max( 2 (θ θ θ 2 ), 3 (θ θ θ 3 ))}, is then given by Theorem 1 as Eq. 11 cannot be used to calculate the error rate for ML as D 12 and σ jk are not easily computable. It predicts a linear relationship between Φ −1 (e ML ) and √ m. This is confirmed by simulation ( fig. 2a -c ).
Precise results may be obtained in special cases. In the case of one locus (m = 1), the ML gene tree is the ML species tree except for rare datasets: the true species tree S 1 is recovered if x i1 > max(x i2 , x i3 ). In rare datasets of extreme divergence, even if x i1 > max(x i2 , x i3 ), ties for gene trees are possible, with the star tree being as good as the binary trees (Yang 2000), while ML under MSC favors S 1 . One such dataset is x i = (4, 13, 12, 11, 50), in which case the three gene trees as well as the star tree achieve the same likelihood, while ML under MSC favors S 1 . However, such datasets involve sequences more divergent than random sequences have vanishingly small probability when n is large. Thus we ignore them and consider all methods to be equivalent when m = 1. With one locus, it is impossible to identify all parameters in the MSC model: there are four parameters and only three independent site-pattern frequencies ( f i0 , f i1 , f i2 + f i3 for S 1 , for example).
The case of one site per locus (n = 1) is analyzed later in the section on ISML. Numerical calculations on a model species tree are presented in table 1. They will be discussed later in comparison with other methods.
In the case of n → ∞, the gene tree (including the coalescent times) at each locus is given without errors. The likelihood is then the product of MSC densities of gene trees across the loci (eq. 2). This likelihood has singularities, with one or more species trees achieving infinite likelihood (Liu et al. 2010;Yang 2014). In the case of three species considered here, only one species tree (given by the smallest coalescent time) achieves infinite likelihood and will be the unique species-tree estimate, so that the estimation can proceed despite the singularity (Yang 2014, p.360, Problem 9.4). Let the smallest coalescent/divergence time between species across all loci be t ab ,t bc and t ca . If t ab is the smallest among the three, then species tree S 1 achieves infinite likelihood, by collapsing on the coalescent time t ab ; that is, 1 (θ θ θ 1 ) → ∞ asτ 0 =τ 1 = t ab andθ 1 → 0 (see eq. 2) (Yang 2014, p.338-9), while the other two species trees have finite likelihood.
Given S 1 as the true species tree, both t bc and t ca are > τ ABC ( fig. 1b). If sequences a and b coalesce in population AB at any of the m loci, t ab will be smaller than both t bc and t ca , and S 1 will be the ML species tree. Thus an incorrect species tree is inferred only if a and b do not coalesce in AB at any of the m loci and are not the first to coalesce in the root population ABC. Thus

Concatenation
Sequence alignments at the m loci are merged into a super-alignment of length nm, and the data are the sitepattern counts pooled across loci: The likelihood function is given by the multinomial probability of eq. 4 except that (Yang 1994b;2000). We discuss the error MBE rate of concatenation below in the section on the ISML method.
We also examine biases in parameter estimation using concatenation. We use species tree S 1 with τ ABC = 0.02, τ AB = 0.01, θ ABC = 0.02 and θ AB = 0.01 to simulate m = 10 4 loci each with n = 250 sites. We obtain MLEŝ t 0 andt 1 on gene tree G 1 from the concatenated data for comparison with the MLEsτ 0 andτ 1 on species tree S 1 in the MSC model (eq. 5). With so much data, both concatenation and ML recover the true tree with near certainty. The MLEs under the MSC (obtained using the 3S program) are very close to the true values, while concatenation (BASEML in PAML, Yang 2007) produced seriously biased estimates (table 2). Even the relative age,t 0 /t 1 = 1.92, differs from τ ABC /τ AB = 2, which means that molecular clock dating analysis using concatenated data will produce biased time estimates (Angelis and dos Reis 2015; Ogilvie et al. 2017;Tiley et al. 2020).

ISML
The ISML method assumes that all sites in the super-alignment are i.i.d. Like concatenation, the data are summarized as pooled site-pattern counts, However, ISML is coalescent-aware and uses the MSC model to calculate the probabilities for the site patterns. By averaging the conditional sitepattern probabilities of eq. 3 over the MSC density of gene trees and coalescent times of eq. 2, we derive the marginal site-pattern probabilities,p p p = (p 0 , · · · ,p 4 ), as and a 1 + c 1 , although these do not appear to permit simple biological interpretations. The cases for S 2 and S 3 are given by symmetry.
The likelihood function (or the probability for the pooled site-pattern counts) for each species tree is f

THEOREM 3. (a) If the true species tree is S
Proof . (a) Each of the marginal site pattern probabilitiesp j , j = 0, · · · , 4, is a sum over the four gene trees of figure 1b: G 1a , G 1b , G 2 and G 3 . The three gene trees G 1b , G 2 , and G 3 have the same densities (eq. 2). Together their contribution to the site pattern xxy is the same as that to the pattern yxx or pattern xyx. If the gene tree is G 1a (with any coalescent times t 0 > t 1 ), site pattern xxy will have a higher probability than yxx or xyx, with p 1 > p 2 = p 3 . Averaging over all the four gene trees, we Let q 1 =p 1 (S 1 ,θ θ θ 2 ) and q 2 = p 2 (S 1 ,θ θ θ 2 ), and we have (S 1 ,θ θ θ 2 ) > (S 2 ,θ θ θ 2 ). In other words, even if we useθ θ θ 2 (the MLE for S 2 ) to calculate the likelihood for species tree S 1 , tree S 1 will have a higher likelihood than S 2 . Sinceθ θ θ 2 may not be optimal for S 1 , it follows that (S 1 ,θ θ θ 1 ) ≥ (S 1 ,θ θ θ 2 ) > (S 2 ,θ θ θ 2 ).

Theorem 3 means that ISML infers species tree
To study the error rate for ISML (or CONCAT), let p i j , j = 0, · · · , 4 be the site-pattern probabilities at any locus i. Data at each locus are represented by the site-pattern At n = 1, they are given by the multinomial distribution as jk , can be generated by simulating gene trees with coalescent times and calculating the sitepattern probabilities (eq. 3) (table S1). This distribution is 3-dimensional (for f i0 , f i1 , and f i2 = f i3 under S 1 ), indexed by four parameters (θ θ θ 1 in S 1 ), and is a mixture distribution with 4 components corresponding to the four gene trees of figure 1b. It reflects the coalescent fluctuation in gene genealogies.
For any finite 1 ≤ n < ∞, the variances are given by jk are the variances/covariances over the coalescent process. These are calculated for a set of parameter values in table S1. The variances of f i j are thus weighted averages of variances at n = 1 and ∞.
The approximation e ISML ≈ ζ N (m,p 1 −p 2 , Σ) is very accurate, with errors < 0.002 in the simulation of table 1. At large n, accommodating correlation is useful as ζ N0 which ignores correlation is less accurate (see fig. 4 for the case of n = ∞). For example, the correlation 153, and −0.181 at n = 1, 1000, and ∞, respectively (table S1).
We now consider parameter estimation by ISML. Theorem 3 allows species tree estimation by ISML without knowledge of the MLE of the parameters. With data of x. j , j = 0, · · · , 4, there are only three observations (three free proportionsf 0 ,f 1 , andf 2 +f 3 in the case of S 1 ). As there are four parameters in the MSC model, it is impossible to identify all of them.
If we assume θ 0 = θ 1 = θ (as in Tian and Kubatko 2016), all three parameters (τ 0 , τ 1 , θ ) will be identifiable. As c 0 = c 1 = 0, eq. 13 simplifies tō where a 0 , a 1 and b are defined in eq. 13 with θ 0 = θ 1 = θ . By equating the observed site-pattern frequencies to their expected probabilities (eq. 17), we have This always has a unique positive root. Givenθ , the estimatesτ 0 andτ 1 are given by eq. 18, which are guaranteed to be positive. Thus under the assumption θ 0 = θ 1 , the ISML method provides estimates of the three parameters in the model: θ , τ 0 and τ 1 . As there is a one-to-one correspondence between the parameters and the multinomial proportions, the estimates are consistent and approach the true values when m → ∞ for any n ≥ 1 if the assumption of θ 0 = θ 1 is correct (table 3c&d). However, the pooled sitepattern counts or average site-pattern frequencies are summaries of the original data and are not sufficient statistics. It then follows that the ISML estimates will be less efficient and have larger asymptotic variances than the MLEs obtained from the full data under the same model assumption of θ 0 = θ 1 (table 3, case c). Furthermore, if θ 0 = θ 1 , assuming θ 0 = θ 1 will lead to biased and inconsistent parameter estimates even if the same species tree estimate is produced. In other words if θ 0 = θ 1 , the ISML method assuming θ 0 = θ 1 will produce a consistent estimate of the species tree and inconsistent estimates of the model parameters (table 3, cases e&f).

Two-step method (majority vote)
In the 2-STEP method we estimate gene trees at individual loci and then use the most common gene tree topology as the species tree estimate. Under JC, the ML gene tree for locus i (which is also the UPGMA tree) is tree G j if x i j is the largest among x i1 , x i2 , and x i3 (Yang 1994b;2000); site patterns xxy, yxx and xyx 'support' gene trees G 1 , G 2 , and G 3 , respectively. There is no need for numerical optimization to obtain the ML tree at each locus.
Let g 1 , g 2 and g 3 be the probabilities that the estimated gene tree is G 1 , G 2 and G 3 , respectively; that is, g 1 = P{x i1 > max(x i2 , x i3 )}, and so on. These are functions of all four parameters in the MSC model (θ θ θ 1 ) as well as the sequence length n, and can be computed numerically (Yang 2002, eq. 12) or by simulation. Under JC and the clock, g 2 = g 3 < g 1 < P(G 1 |S 1 , θ θ θ 1 ) (Yang 2002). This result has several implications. First, g 1 < P(G 1 ) means that phylogenetic errors inflate gene-tree-species-tree discordance and lead to underestimation of the internal branch length in the species tree (Yang 2002). Second g 1 < P(G 1 ) also means that use of estimated (rather than true) gene trees leads to reduced probability for recovering the correct species tree. Third, g 1 > g 2 = g 3 means that the 2-STEP estimate of the species tree is consistent even if estimated gene trees are used.
Let the number of loci at which G 1 is the ML tree be , where the indicator function I a = 1 if statement a is true and 0 otherwise. Similarly define m 2 and m 3 to be the counts for the two mismatching gene trees. The correct species tree is inferred if and only if m 1 > max(m 2 , m 3 ). Thus the error rate can be approximated by e 2-STEP ≈ ζ (m, g 1 , g 2 ) (eq. 8).
The accuracy of this approximation is assessed in table 1 at different values of n with m = 1000 and with parameter values τ 0 = 0.02, τ 1 = 0.019, θ 0 = 0.01, and θ 1 = 0.05. Consider first the case of n = 1. The gene tree is resolved if the single site at the locus has site patterns 1, 2, or 3, but is unresolved if the site has patterns 0 or 4. Whether we ignore loci with ties (with MBE site patterns 0 or 4) or break ties evenly (assigning 1 3 to each gene tree) does not affect the species tree estimate. Thus g 1 (1) =p 1 and g 2 (1) =p 2 (eq.13) and the error is e 2-STEP ≈ ζ (m,p 1 ,p 2 ). This is equivalent to e ISML ≈ ζ N (m,p 1 −p 2 , Σ) for ISML, consistent with the fact that at n = 1 all methods considered here are equivalent.
If n = ∞, the estimated gene trees will be the true gene trees so that g 1 = P(G 1 ) and g 2 = P(G 2 ). The error rate is then ζ (m, P(G 1 ), P(G 2 )) = ζ (1000, 0.3594737, 0.3202631) = 0.1132, close to 0.114 from simulation. At n = 1000, the proportions of estimated gene trees are g 1 = 0.33273 and g 2 = 0.30811, so that ζ (m, g 1 , g 2 ) = 0.264, close to 0.260 by simulation (table 1). These are much larger than 0.114 at n = ∞, suggesting that with n = 1000 sites in the sequence, the estimated gene trees have substantial errors and uncertainties.
The approximations ζ ZLY and ζ (eq. 8) give nearly identical results. The error rate is found to be very sensitive to the precise values of g 1 and g 2 . Overall, the approximation is good, with errors within or close to 1%.

Numerical comparison of different methods
We use simulation to compare the different species-tree estimation methods and to assess the reliability of our approximations. We use a challenging species tree with parameters τ 0 = 0.02, τ 1 = 0.019, θ 0 = 0.01, and θ 1 = 0.05. The error is plotted against the number of loci (m) when the number of sites per locus is fixed at n = 1, 2, or 1000 ( fig. 2).
In the case of one site per sequence (n = 1), all four methods considered in this study are equivalent, with the species tree given by the most frequent pooled site pattern (i.e., the greatest of x ·1 , x ·2 , and x ·3 ). With one site, the independent-sites assumption is correct, and ML and ISML are exactly the same. As discussed earlier, concatenation and 2-STEP also select the species tree according to the pooled site patterns. Treatment of ties among x ·1 , x ·2 , x ·3 has very minor effects on the error rate. For n = 1 and m = 1000, simulation gave the error estimate e = 0.642 if ties are broken evenly (table 1)  In the case of n = 2 sites per locus, ISML (= concatenation), 2-STEP and ML are all distinct. To see that concatenation and 2-STEP may produce different species trees, consider the case of m = 3 loci and n = 2 sites. If the dataset at the three loci are 11, 02, and 00, where 0-4 represent the five site patterns, concatenation will infer the correct species tree S 1 (as x ·1 = 2, x ·2 = 1, x ·3 = 0), while 2-STEP will have a tie between S 1 and S 2 (as m 1 = 1, m 2 = 1, m 3 = 0). If the dataset at the three loci are 33, 01, and 14, concatenation will have a tie between S 1 and S 3 (as x ·1 = 2, x ·2 = 0, x ·3 = 2) while 2-STEP will infer the correct species tree (as m 1 = 2, m 2 = 0, m 3 = 1). We also confirm that at n = 2 ML differs from all three summary methods and can identify and consistently estimate all four parameters in the MSC model. Indeed ML is far more efficient for species tree estimation than the summary methods when n = 2 ( fig. 2b&b ). While the summary methods improve only slightly when n changes from 1 to 2, there is a major performance boost for ML ( fig. 3a). This may be due to the fact that the model is fully identifiable with n = 2 but not when n = 1. The predicted linear relationship between Φ −1 (e) and √ m holds well for the three summary methods ( fig. 2b ). For ML, if we remove the first two points (for m = 10 and 20), the relationship is nearly linear, with y = −0.0022x + 0.0391, with R 2 = 0.97. The most interesting case is with n 1, since in real datasets n may be in the range 50-5000, say. We used n = 1000 in fig. 2c&c . As in the case of n = 2, there is a large performance divide between ML and the three summary methods (ISML = CONCAT and 2-STEP), while the summary methods have similar performance. The approximate linear relationship between Φ −1 (e) and √ m holds well for all methods. The superior performance of ML persist in the limit of n = ∞ ( fig. 3b). For example, e ML,∞ = 0.45 and 0.01 for ML at m = 10 and 100, respectively, compared with e 2-STEP,∞ = 0.60 and 0.46 for 2-STEP or e ISML,∞ = 0.62 and 0.51 for ISML. The differences between ML and 2-STEP reflect the information in the coalescent times or gene-tree branch lengths. The differences between ML and ISML reflect the information in the variation of sitepattern frequencies among loci, as ISML uses only the averages across loci. Fig. 3c examines the error rates of different methods while nm = 10 4 is fixed. At the two ends (n = 1 or m = 1), all four methods are equivalent, with e =0.587 at n = 1 and m = 10 4 , and e =0.646 at m = 1 and n = 10 4 . Note that when n = 1 and m → ∞, the error e → 0, whereas if m = 1 and n → ∞, the error e = 1 − g 1 (n) → 1 − P(G 1 ) = 0.6405. The high error at m = 1 even when n = ∞ is because a single gene tree (with coalescent times), even if known with certainty, does not contain much information about the MSC process. Away from the two ends (n > 1 or m > 1), ML is considerably more efficient than the summary methods ( fig. 3c). The case of m = 10 4 (n = 1), at which e ML = 0.587, and the case of m = 2 (n = 5000), at which e ML = 0.487, make an interesting contrast. In the first case all sites are i.i.d., while in the second there are only two independent genes, each of 5000 sites in complete linkage. One might expect data of independent sites to be more informative than two loci with correlated sites at the same locus (e.g., Long and Kubatko 2018), but the opposite is true. With n = 1, not all model parameters are identifiable, and this non-identifiability issue appears to impact species tree estimation as well (Shi and Yang 2018, p.172). With nm fixed, the smallest error e ML occurs at intermediate values of n and m, around n = m = 100, although performance is similar over a large range of n ( fig. 3c).
In table 1, we calculated the species-tree error probability using eqs. 6 and 8, as well as two pairs of bounds (ζ L1 , ζ U1 ) and (ζ L2 , ζ U2 ) (Theorem 1, Appendix A), for comparison with the simulation results. The asymptotic results are expected to apply when the sequence length n is fixed while the number of loci m → ∞. Here m is fixed at 1000, so that b < 2 for all cases (table 1), and is too small for the asymptotic approximations to be reliable. As a result, eqs. 6 and 8 are more accurate.

Errors of species tree estimation by different methods
Under the MSC model, data at different loci are i.i.d., so that the number of loci (m) constitutes the sample size in the statistical model. Thus we have derived approximations to the error rate for different methods when m increases, with the sequence length n fixed. For large m, the error can be approximated by where c is a constant. This is seen to apply to all four methods considered in this study (ML, ISML = concatenation, and 2-STEP) (see table 4 for a summary). The theory for ML in Appendix B applies generally to ML selection of nonnested models, whether one model (which may and may not be the true model) fits the data better than the others, judged by the K-L divergence to the true data-generating model. In particular, the theory applies to conventional phylogenetic reconstruction without the MSC model. For example, figure 5 applies the same prediction to simulation results on four-taxa trees from Yang (1997). Previously Susko (2011) developed a large-sample approximation to the log-likelihood difference between two trees and to the probability that each tree will be the ML tree in the case of four-species without the molecular clock. It was assumed that the internal branch length in the tree is small and approaches 0 at the rate of n − 1 2 or faster when the number of sites n increases. In our analysis, we take the conventional approach of fixing the parameters when the data size increases.
We note that in problems of parameter estimation, the standard error for the parameter estimate or the width of the confidence interval typically decreases at the rate of n − 1 2 , so that quadrupling the data size halves the interval. In contrast, the probability of recovering the best-fitting model approaches 1 much faster. As the probit transform of the error decreases linearly with √ n, it will soon reach a point beyond which the precise error probability is of no practical significance: for example, Φ −1 (e) = −3 means e = 0.0013 while Φ −1 (e) = −5 means e = 2.9 × 10 −7 . The different dynamics between model selection and parameter estimation when the data size grows is consistent with the fact that we tend to obtain extreme support for phylogenies inferred in large datasets (Yang and Zhu 2018).

Implications of our study to species tree methods
While the species tree problem studied here is the simplest, it has the complexities of the general problem. Furthermore, we have represented all major species tree methods in our analysis. We expect ML to be asymptotically similar to Bayesian inference as both are full-data methods.
We have assumed the JC mutation model and the molecular clock. Our results are thus applicable to shallow species phylogenies and may not apply to distantly related species for which the JC model may be inadequate for multiple-hit correction and the molecular clock may be seriously violated. In the case of three species examined in this paper, concatenation and ISML always produce the same species tree estimate. However, in more general settings with four or more species and when the clock is violated and unrooted trees are used, concatenation and ISML are known to be different. In particular, concatenation (as well as 2-STEP) can be inconsistent (Roch and Steel 2015), while ISML is a coalescent-aware method and is always consistent.
The ISML method considered here is similar to SVDQUARTETS (Chifman and Kubatko 2014). Both are summary methods based on pooled site-pattern counts. SVDQUARTETS is sometimes described as a site pattern-based method (e.g. Kubatko 2019). This is not a helpful description. Site-pattern counts for different loci ({ f i j }) are sufficient statistics under the model and carry the same amount of information as the sequence alignments at the same loci so that it makes no difference whether site patterns or sequences are used. Indeed virtually all methods involving likelihood calculation on sequences operate on site patterns instead of sites. Instead what matters is whether site patterns are pooled across loci. In the original data, the sites of the same locus share the same gene tree and the variation among loci provides information about parameters of the coalescent process such as the ancestral population sizes. Pooling sites across loci means that such information is lost (Shi and Yang 2018). As a result, the pooled sitepattern counts are unable to identify all parameters of the MSC model even if they can identify the species tree topology. Previously Long and Kubatko (2018) found in simulations that SVDQUARTETS performed better in datasets of 600 coalescent-independent sites (m = 600, n = 1 in the notation of this paper) than in data of two genes each of 300 bp (m = 2, n = 300), and suggested that this is because "[t]he 600 sites observed from 600 distinct gene trees give independent genealogical information about the species tree, though indirectly, while the 300 sites for each of the two genes can give a reasonable indication of the individual gene trees, but still provide only two observed gene genealogies." Our analysis suggests that this is not a correct interpretation. When the information in the data is used properly (as in the ML method), there is in fact more information in two genes each of 300bp than in 600 independent sites (fig. 3c).
To understand the issue of parameter unidentifiability and the potential information loss for species tree estimation due to the pooling of sites across loci in SVDQUARTETS, consider the simple random-effects model where the treatment effect α i ∼ N(0, σ 2 a ) and the error e i j ∼ N(0, σ 2 e ). Parameters in the model include the grand mean μ and the variance components σ 2 a and σ 2 e . It is obvious that if there are no replications within treatment (n = 1) or if the observations (y i j ) are pooled across treatments, the between-treatment variation and within-treatment errors will be confounded so that σ 2 a and σ 2 e will not be identifiable even though μ still is. In species tree estimation, pooling site patterns across loci (as in ISML and SVDQUARTETS) causes some parameters of the MSC model to become unidentifiable even though the species tree still is. This issue of information loss due to averaging over the whole genome may be even more serious for methods designed for data of single nucleotide polymorphisms (SNPs) (Leaché and Oaks 2017), such as SNAPP (Bryant et al. 2012), because the removal of constant sites in the SNP data causes further loss of information (even if the ascertainment bias is accounted for in the method).
An important difference between ISML and SVDQU-ARTETS is that ISML applies ML to the pooled sitepattern counts while SVDQUARTETS uses a criterion based on linear invariants to avoid the ML optimization (Xu and Yang 2016). Use of a non-ML criterion is expected to lead to further reduction in efficiency, in addition to information loss due to the pooling of sites across loci (Chou et al. 2015;Xu and Yang 2016;Shi and Yang 2018).
The MSC model analyzed in this paper assumes free recombination among loci and no recombination between sites of the same locus. Data for such analysis are typically loosely linked short genomic segments that are far apart from each other so that recombination within a locus is rare while different loci are nearly independent (e.g. Takahata et al. 1995;Burgess and Yang 2008;Lohse et al. 2016). Both assumptions of free recombination among loci and no recombination within locus are expected to be violated in real data analysis, and the impact of within-locus recombination is of particular concern. The ML method considered in this paper assumes no recombination (with r = 0) while ISML (and SVDQUARTETS) assumes free recombination (r = ∞). The relative performance of the methods will depend on the true recombination rate: ML may be expected to perform better than ISML if r is close to 0, while ISML may be superior if r is large. At very high recombination rates, it may even be possible for ML (assuming r = 0) to be inconsistent since the method is similar to concatenation and merges sites of the same locus with different histories into one sequence. In contrast ISML is consistent for all values of r. Previously Lanier and Knowles (2012) found in a computer simulation that species-tree estimation was robust to moderate levels of within-locus recombination (see also discussions in Xu and Yang 2016;Edwards et al. 2016). It will be interesting to evaluate the relative performance of modern species-tree estimation methods (including ISML and SVDQUARTETS) under realistic recombination rates.

Simulation
We use a challenging species tree with parameters τ 0 = 0.02, τ 1 = 0.019, θ 0 = 0.01, and θ 1 = 0.05 ( fig. 1a). A C program is written to simulate gene trees and sequence alignments for the case of three species/sequences, under the JC model (Jukes and Cantor 1969) with the clock. To simulate the gene tree and the sequence alignment for each locus, we generate an exponential coalescent waiting time (s 1 ) with mean θ 1 /2. If s 1 < τ 1 , the gene tree is G 1a , and another exponential waiting time s 0 is generated with mean θ 0 /2 to get t 0 = τ 0 + s 0 and t 1 = s 1 . If s 1 > τ 1 , the gene tree is one of G 1b , G 2 , G 3 , chosen at random, and two coalescent waiting times (s 1 and s 0 ) are generated with means θ 0 /6 and θ 0 /2, respectively, so that t 1 = τ 0 + s 1 and t 0 = t 1 + s 0 ( fig. 1b). The gene tree and node ages (t 0 ,t 1 ) are then used to calculate the site-pattern probabilities for the locus (eq. 3), and the site-pattern counts are generated from multinomial sampling (eq. 4). Each dataset consists of m loci with the sequence length of n sites. We use a large number of replicates (typically R = 10 6 or 10 8 ) so that sampling errors due to a limited number of replicates is not a concern. Species tree estimation by concatenation (= ISML) and 2-STEP is done by counting site patterns.
For the ML method (eq. 5), we used the simulation program MCCOAL, which is part of the BPP program (Yang 2015), to simulate the gene trees and sequence alignments. The data are then analyzed using the ML program 3S (Yang 2002;Dalquen et al. 2017). The JC model is used to simulate and analyze data.

MBE
Appendix A. Proof of Theorem 1 (a) Define the random variable where y 1 =z 1 − 1 2 (z 2 +z 3 ) ∼ N(Δμ, s 2 1 /m) and y 2 = Here we treatz 1 ,z 2 andz 3 as normal variables, according to the central limit theorem as m → ∞. As Cov(y 1 , y 2 ) = 0 and both y 1 are y 2 are normal variables, they are independent. Then where a = s 2 /s 1 , b = Δμ √ m/s 1 , and φ (x; μ, σ 2 ) is the probability density function (PDF) for N(μ, σ 2 ) while φ (x) is the PDF for N(0, 1). The last integral has been studied by Yang and Rodríguez (2013, SI) in a different context and can be written as or, by letting t = a − tan θ , with dθ = − 1 t 2 +1 dt, as Eqs. A4 and A5 can be calculated using Gaussian quadrature and match direct calculations using the CDF for the bivariate normal distribution for (z 1 −z 2 ,z 1 −z 3 ). When Δμ = 0, we have b = 0 and In the symmetrical case of Δμ = 0, σ 2 1 = σ 2 2 , and σ 12 = σ 23 (with a = 1 √ 3 , b = 0), this gives 1 2 + 1 π tan −1 1 √ 3 = 2 3 , as expected. In this case the three variablesz 1 ,z 2 and z 3 have the same probability of being the greatest so that the error is 2 3 . To avoid numerical integration, we note that y 2 ∼ N 0, 1 2m (σ 2 2 − σ 23 ) , and |y 2 | is a folded normal variable with mean and variance Collecting all terms in eq. A8, we get If we assume that y is approximately normally distributed, as in Zharkikh and Li (1992) and Yang (1996), then eq. 6 follows. Note that eq. 6 can also be written . Because |y 2 | has a folded normal distribution and is not a normal variable, the error of approximation of eq. 6 does not approach zero when m → ∞. For instance, in the symmetrical case (Δμ = 0, σ 2 1 = σ 2 2 , and σ 12 = σ 23 ), eq. 6 gives Φ 1 √ 2π−1 = 0.66824, not 2 3 . This level of accuracy is acceptable for our calculations for finite m in this paper, as the precise value of the error is unimportant if the error is nearly zero, but eq. 6 may not give the correct asymptotic error rate when m → ∞ ( fig. 6, a = 10).
(b) To study the asymptotic behavior of the error probability ζ when m → ∞, we derive bounds on ζ . From eq. A3, where the first integral is to be the distance from the origin (0, 0) to . 7), and the second integral is By considering the area of integration ( fig. 7), it is obvious that where the equality holds when b = 0. Let or as in eq. 7. The equality in the lower bounds is achieved at b = 0. Note that the bounds apply to all a > 0 and b > 0. We use the bounds (ζ L1 , ζ U1 ) in Theorem 1 and in the calculation of table 1. The width of the interval is Φ(−h) 1 − 2 π tan −1 a ≤ Φ(−h) ≤ ζ , so that using any value inside the interval as the estimate will give an error of approximation that is smaller than the error probability ζ .
Next we consider the upper bound in eq. A15 when h or b is large. Note that Thus for large h, Φ(−h) is bounded by Let ε > 0 such that Φ(−(h + ε)) = αΦ(−h) for 0 < α < 1; in other words, ε is the offset at the probit level to reduce the probability by a fraction. From eq. A20, In particular, for α = 1 2 , we have Thus for large h, we have as in eq. 7. It may be noteworthy that for large h, a very small change at the probit level, of about 1 h log 2, changes the probability by a factor of 2.

MBE
The integral over the arc ABA (the shaded area) is 2A = P{z 1 <z 2 ,z 1 <z 3 }. This is smaller than Φ(−b) · ψ/ π 2 and greater than the integral over the area shaded with the brick pattern: these give the bounds (ζ L2 , ζ U2 ) in Appendix A. The pink dashed lines are t = (x + b)/(ka) and t = −(x + b)/(ka). They cross the blue lines at A and A , with the length of the line segment OA to be r = b √ a 2 k 2 +1 a(k−1) . Note that the integral over the circle x 2 + t 2 < r 2 is 1 − e − 1 2 r 2 .