Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees

Abstract Phylogenomic analyses routinely estimate species trees using methods that account for gene tree discordance. However, the most scalable species tree inference methods, which summarize independently inferred gene trees to obtain a species tree, are sensitive to hard-to-avoid errors introduced in the gene tree estimation step. This dilemma has created much debate on the merits of concatenation versus summary methods and practical obstacles to using summary methods more widely and to the exclusion of concatenation. The most successful attempt at making summary methods resilient to noisy gene trees has been contracting low support branches from the gene trees. Unfortunately, this approach requires arbitrary thresholds and poses new challenges. Here, we introduce threshold-free weighting schemes for the quartet-based species tree inference, the metric used in the popular method ASTRAL. By reducing the impact of quartets with low support or long terminal branches (or both), weighting provides stronger theoretical guarantees and better empirical performance than the unweighted ASTRAL. Our simulations show that weighting improves accuracy across many conditions and reduces the gap with concatenation in conditions with low gene tree discordance and high noise. On empirical data, weighting improves congruence with concatenation and increases support. Together, our results show that weighting, enabled by a new optimization algorithm we introduce, improves the utility of summary methods and can reduce the incongruence often observed across analytical pipelines.

. Counters w * * are defined for each node w in each gene tree, and Q is defined globally. Here, X,Y,Z are distinct colors of A, B, and C. Let u,v be the children of w; e be the parental edge of w; p be the parent of w; Px,w be the path between x and w; s(P) = 1− ê∈P (1−s(ê)); m(i,j) = MRCA of i and j. Counters for leaves are set to zero unless explicitly noted. For each counter, we show a recursive equation on top and the equivalent non-recursive definition on the bottom. w X (u X +v X )e −l(e) for internal node w; e −l(e) for leaf node w colored X i e −l(Pi,p) for all leaf nodes i colored X under w Pi,j) for all leaf nodes i colored X and j colored X/Y under w i,j e −l(Pi,j) 1−s(P m(i,j),p ) for all leaf nodes i colored X and j colored X/Y under w i,j,k e −l(Pi,j)−l(Pk,p) s(P m(i,j),m(i,k) ) for leaf nodes i colored X, j colored X/Y, k colored Z under w, and m(i,j) under m(i,k) for all leaf nodes h,i colored X, j colored Y, k colored Z, and w = MRCA h,i,j,k Q G∈G w (w AA|BC +w BB|AC +w CC|AB ) for internal nodes w in G G∈G h,i,j,k w G (hi|jk) for leaf nodes h,i,j,k in G where h,i have the same color and i,j,k have different colors; when species coloring matches all gene trees, Q = W [A|B|C] = G∈G W (A|B|C,G) (Proposition 5).    Species tree shape with parameters E1-6 and E1-7 are used. Results with aBayes supports are labeled wASTRAL-s and wASTRAL-h; results with SH-like support are labeled wASTRAL-s* and wASTRAL-h*.
x        ..I 6 . The species tree with SU branch lengths S † is created by multiplying each branch length in S * with a corresponding multiplier; the multipliers are jointly drawn from some distribution and are drawn independently across gene trees. Gene tree G * is sampled under MSC process from S * independent of S † . However, it inherits the same division of its lineages into segments as S * at the same locations. The gene tree with SU branch lengths G is created by translating branch lengths of G * into SU by multiplying the CU length of each of segment I i by Λ Ii S † , the multiplier associated with the segment I i in S † and hence G.

Supplementary Algorithm
Algorithm S1 Recursive placement algorithm. Place inserts the species i into an existing species tree S and computes tripartition scores W (A|B|C,G) := G∈G W (A∩L G |B ∩L G |C ∩L G ,G) for all tripartitions resulting from adding i onto each branch of S. A global counter Q and a set of per-node counters w A ,w B ,w C ,w + .. ,w − .. ,w ..|. ,w ..|.. are all initialized to 0. OptimalTreeDP is defined in Algorithm S2. Each gene tree is rooted on an arbitrary branch e and the support of e is kept for the branch on one side of the root and zero support is given to the branch on the other side of root. L v is the set of leaves under v.
if w is not the root then 33: 34:

41:
RecursiveUpdate(the parent of w) 42: procedure UpdateCounters(w,X) w is a leaf, X is a color

47:
return Q Algorithm S2 The Algorithm S2 of O(n 2 kH logn) running time. At start, the function is called as with L S ,G,r as input.

11:
if DPTree(P ) available then return DPTree(P ) Algorithm S3 The DAC algorithm of O(n 1.5+ k) running time given some assumptions. OptimalTreeDP and NaivePlacement are defined in Algorithm S2, and Place is defined in Algorithm S1. At start, the function is called as with L S ,G,r as input. for j ∈ C ∅ do 23: 24: if C ∅ = ∅ then 25:

26:
Add all elements of W i to W *

Proofs
Weighting by support: Proof of Proposition 1 and Theorem 1 For ease of reference, we reproduce Table 2 from the main paper here: Recall that the expected value and variance of α G,Q across genes is denoted byᾱ Q and σ 2 α .
Proof . To prove the Proposition, we start with the following lemma.
Lemma 1. For each estimated gene tree G with a given α G,Q , Proof . From Table 2, we can compute similarly, From this lemma, we can prove the proposition. First, assume α G,Q is drawn from a discrete distribution. Then, It is straightforward to change these calculations to use integral instead of sum and P(α G,Q ) to the PDF in the case that the distribution of α G,Q is continuous. It means that in order to produce a statistically consistent estimator using (1), the following equation must be satisfied for the true species tree topology S * and any species tree topology S: Notice that proving for any quartet Q = {a,b,c,d} we have E w G (ab|cd)−w G (ac|bd) ≥ 0 and E w G (ab|cd)−w G (ad|bc) ≥ 0 where S * Q = ab|cd is sufficient to prove (9); on the other hand, proving for any quartet Q = {a,b,c,d} where the internal branch of S * Q corresponds to only one branch in S * , Thus, from Proposition 1, we have guaranteed statistical consistency for wASTRAL for support under Similarly, we have guaranteed statistical consistency for unweighted ASTRAL under To prove Theorem 1, we only need to prove that D is a proper subset of D.

Weighting by length: Proof of Propositions 2 and 3 and Theorem 2
Before providing the proofs, we remind the reader of one property of the coalescent model. According to the coalescent model, at any point along a branch of the species tree with i gene tree lineages, the time (i.e., distance) x to the next coalescent event, reducing the number of lineages to i−1, is exponentially distributed with the rate i 2 , resulting in probability density function (PDF): and the two lineages that coalesce are independent of x.
Proposition 2. For a true quartet species tree S * with topology ab|cd and input gene trees G generated under the naive model with any multiplier λ, let f be the distance between anchors of S * . As f − → 0, given Proof . We analyze balanced and unbalanced trees separately.
Case 1: Unbalanced trees (i.e., the root of S * has a terminal branch as a child). W.o.l.g., we assume the root branch is located on branch leading to d.
Let p,q, and r be the MRCA nodes of (a,b), (a,c), and (a,d) on rooted species tree S * , respectively. Let p and r be the points of coalescence of leaves a,b and leaves c,d on the rooted gene tree G, respectively.
Let f X (x) be the probability density that x is the CU difference in heights of (p,p ) and p is the lowest point of coalescence. Notice that by (10): Let f Z|X (z;x) be the probability density that z is the CU difference in heights of (r,r ), conditioned on that x is the CU difference in heights of (p,p ) and p is the lowest point of coalescence. Notice that: We specify three coalescence scenarios by indicator functions δ 1 ,δ 2 ,δ 3 : Note that Similarly, since only scenarios 2 and 3 have deep coalescence events that may lead to gene tree disagreement with the species tree, and by the symmetry of all three topologies under scenarios 2 and 3, and since w G (ab|cd)w G (ac|bd) = 0, We next compute both elements of (11) as well as some elements of (12) (others will not be necessary).
• δ 1 : When G has topology ab|cd, p must be the lowest point of coalescence. Thus, • δ 2 : When G has topology ab|cd, p must be the lowest point of coalescence. Thus, • δ 3 : When G has the topology ab|cd, either p or q must be the lowest point of coalescence, and by symmetry, the two cases must have the same PDFs. Thus, Replacing in (11), we get f +O(f 2 ) ; and replacing in (12), we get from which our assumption of Var[X G ] = Ω(1) follows.
Let p,q, and r be the MRCA nodes of (a,b), (c,d), and (a,d) on rooted species tree S * , respectively. Let p and q be the points of coalescence of leaves a,b and leaves c,d on the rooted gene tree G, respectively.
Let x, x 0 , y, and y 0 be the CU difference in heights of points (p,p ), (p,r), (q,q ), and (q,r), respectively.
We specify three coalescence scenarios by indicator functions δ 1 ,δ 2 ,δ 3 : Note that Similarly, since only scenarios 3 have deep coalescence events that may lead to gene tree disagreement with the species tree, and by the symmetry of all three topologies under scenarios 3, and since w G (ab|cd)w G (ac|bd) = 0, • δ 1 : Here, • δ 2 : Here, • δ 3 : Similar to the unbalanced case, when G has the topology ab|cd, either p or q must be the lowest point of coalescence, and by symmetry, the two cases must have the same PDFs. Thus, Replacing in (13), we get and replacing in (14), we get from which our assumption of Var[X G ] = Θ f (1) follows.
Thus, in both balanced and unbalanced cases, Proposition 3. For a true quartet species tree S * with topology ab|cd and input gene trees G generated under the variable rate model, let f be the distance between anchors of S * and L be the total length of all other branches. Assume that for every branch segment I, the variance of its multiplier is bounded Proof . We follow the same logic in proof of Proposition 2.
Case 1: Unbalanced trees. Let P (x) be functions to random variables denoting SU difference in heights of points (p,p ) where p is x CU distance above p; let R(z) be functions to random variables denoting SU difference in heights of points (r,r ) where r is z CU distance above r. Note that P (f +y 0 )+R(z) = P (f +y 0 +z) where P (f +y 0 ) denote the SU length of (p,r). Let random variable Λ := l S † (a,p)+l S † (b,p)+ l S † (c,r)+l S † (d,r) be the total SU terminal branch lengths and the constant value L be the CU distance corresponding to Λ. • δ 1 : When G has topology ab|cd, p must be the lowest point of coalescence. Thus, • δ 3 : When G has the topology ab|cd, either p or q must be the lowest point of coalescence, and by symmetry, the two cases must have the same PDFs. Thus, Replacing in (11), by Jensen's inequality, we get And replacing in (12), we get from which our assumption of Var[X G * ] = Θ f (1) follows.
Let F P (u;x), F R (v;z), and F Λ (w) be the CDF of P (x), R(z), and Λ respectively; let F P RΛ (u,v,w;x,z) and F P RΛ (u,v,w;x,z) be the joint CDF and the joint PDF. Let F −1 P (t;x), F −1 R (t;z), and F −1 Λ (t) be the inverse function of CDF of P (x), R(z), and Λ respectively. Then, Thus, for any 0 < t 0 < 1, By Chebyshev's inequality (using t − 1 2 0 as the constant), F −1 .
Case 2: Balanced tree. Let P (x) be functions to random variables denoting SU difference in heights of points (p,p ) where p is x CU distance above p; let Q(y) be functions to random variables denoting SU difference in heights of points (q,q ) where q is y CU distance above q. Note that P (x 0 +z)−P (x 0 ) = Q(y 0 +z)−Q(y 0 ) where P (x 0 ) and Q(y 0 ) denote the SU length of (p,r) and (q,r), respectively. Let random variable Λ := l S † (a,p)+l S † (b,p)+l S † (c,q)+l S † (d,q) be the total SU terminal branch lengths and the constant value L be the CU distance corresponding to Λ.
• δ 3 : Similar to the unbalanced case, when G has the topology ab|cd, either p or q must be the lowest point of coalescence, and by symmetry, the two cases must have the same PDFs. Thus, and replacing in (14), for any 0 < t 0 < 1, ) +O(f ) , from which our assumption of Var[X G ] = Θ f (1) follows. Thus, for both balanced and unbalanced trees, the variance is bounded the by same expression, and thus in both cases, Proof . We start with proving this theorem under the conditions of Proposition 2. Recall X G := w G (ab|cd)−w G (ac|bd) and Y G := δ G (ab|cd)−δ G (ac|bd), and letX G = 1 k G∈G X G andȲ G = 1 k G∈G Y G . Recall also that under Proposition 2, proved below, under conditions of Theorem 2, we have Var[X G ] = Ω(1) and Similarly, we can compute the ratio of mean and variance for Y (corresponding to unweighted ASTRAL): Given Proposition 2, we can use Berry-Esseen theorem to derive where Φ denotes CDF of the standard Normal distribution. Since k = Θ(f −2 ), and