Advancing admixture graph estimation via maximum likelihood network orientation

Abstract Motivation Admixture, the interbreeding between previously distinct populations, is a pervasive force in evolution. The evolutionary history of populations in the presence of admixture can be modeled by augmenting phylogenetic trees with additional nodes that represent admixture events. While enabling a more faithful representation of evolutionary history, admixture graphs present formidable inferential challenges, and there is an increasing need for methods that are accurate, fully automated and computationally efficient. One key challenge arises from the size of the space of admixture graphs. Given that exhaustively evaluating all admixture graphs can be prohibitively expensive, heuristics have been developed to enable efficient search over this space. One heuristic, implemented in the popular method TreeMix, consists of adding edges to a starting tree while optimizing a suitable objective function. Results Here, we present a demographic model (with one admixed population incident to a leaf) where TreeMix and any other starting-tree-based maximum likelihood heuristic using its likelihood function is guaranteed to get stuck in a local optimum and return an incorrect network topology. To address this issue, we propose a new search strategy that we term maximum likelihood network orientation (MLNO). We augment TreeMix with an exhaustive search for an MLNO, referring to this approach as OrientAGraph. In evaluations including previously published admixture graphs, OrientAGraph outperformed TreeMix on 4/8 models (there are no differences in the other cases). Overall, OrientAGraph found graphs with higher likelihood scores and topological accuracy while remaining computationally efficient. Lastly, our study reveals several directions for improving maximum likelihood admixture graph estimation. Availability and implementation OrientAGraph is available on Github (https://github.com/sriramlab/OrientAGraph) under the GNU General Public License v3.0. Supplementary information Supplementary data are available at Bioinformatics online.


Overview
In Section 4 of the main paper, we provide a high-level algorithm description for OrientAGraph , which takes a vector X of f -statistics as input and searches for a maximum likelihood (ML) admixture graph (N, Θ N ) with h admixture events. The ML search proceeds as follows.
1. Search for a ML starting tree N 0 , rooting it at outgroup g.

Return (N i , Θ Ni )
OrientAGraph is implemented on top of TreeMix [5]. Steps 1 and 2b are the same as TreeMix.
Step 2a can be performed using the same heuristic as TreeMix or using an exhaustive search (Sections 1.3 and 1.4). The exhaustive search can be executed by using the -addmigs option in OrientAGraph.
Step 2b is new to OrientAGraph and can be performed using a heuristic or an exhaustive search for the MLNO (Section 1.5). The exhaustive search can be executed using the -mlno options in OrientAGraph; we plan to implement the heuristic described in Section 7 in the near future.

Data Structures
The admixture graph data structure implemented by TreeMix, and thus OrientAGraph, must include a directed phylogenetic network N C with an edge labeling L : E(N C ) → {0, 1}. The network N C must have the following properties: • the root is a vertex with indegree 0 and outdegree 2, • the leaves are vertices with indegree ≥ 1 and outdegree 0, and • internal vertices have indegree ≥ 1 and outdegree 2.
The edge labeling ψ must have the following properties: • for the root vertex, both of the outgoing edges must be labeled 0, • for any non-root vertex, exactly one of the incoming edges must be labeled 0, and • for any non-root, non-leaf vertex, at least one of the outgoing edges must be labeled 0.
Such an edge labeling ψ is said to be tree-based, and we can verify whether a given an edge labeling for N C is tree-based in O(|V (N C )|) time when the number h of admixture nodes is fixed. Any network N for which a tree-based labeling exists is referred to as tree-based; a tree-based labeling for N , if one exists, can be found in linear time via a reduction to 2-SAT [1]. Although N C is not a binary phylogenetic network, it can be viewed as such, as we now describe. Let N be a binary, directed phylogenetic network, and let ψ be a tree-based labeling of N . Because N is binary, every admixture node in N has indegree 2 and outdegree 1. For some admixture node in N , let e, f denote the two incoming edges and let g denote the outgoing edge. Because ψ is tree-based, either e and g are drawn as part of base tree (in which case f is drawn as gene flow); otherwise, f and g are drawn as part of the base tree (in which case e is drawn as gene flow). As discussed in Section 2.2 of the main paper, in the former case, TreeMix forces f and g to have length 0, and in the latter case, TreeMix forces e and g to have length 0. It follows that we do not need to store edge g and can contract it (Definition 1). Contracting edges in N until there are no edges with outdegree 1 produces a contracted version of the binary phylogenetic network N , denoted N C , that satisfies the requirements given above. This is the data structure implemented by TreeMix, and thus OrientAGraph.

Definition 1 (Refinements and Contractions
). An edge contraction corresponds to deleting edge e = u → v but not its vertices u and v from the graph G and then identifying u and v (i.e. combining u and v into a single vertex in the natural way). A vertex refinement is simply the opposite of a contraction.

Edge Addition Neighborhood
In Step 2a, TreeMix and OrientAGraph search for a ML network in what we will call the gene flow edge addition neighborhood of a given the contracted version of a binary network N C i with tree-based labeling ψ i . First, we consider the edge addition neighborhood of a binary network. An edge addition takes a binary network N i as input and returns a binary network N i+1 such that there exists an edge e ∈ E(N i+1 ) whose deletion from N i+1 produces a network that is isomorphic to N i , after suppressing vertices with indegree 1 and outdegree 1. This is encoded as the following operation.
Definition 2 (Edge Addition Operation). An edge addition is an operation in which two different edges of a binary, directed phylogenetic network N i , the source edge e s = u s → w s and the target edge e t = u t → w t , are subdivided with new vertices v s and v t , respectively, and then an edge e is added from v s to v t . An edge addition is only legal if the resulting network N i+1 is a binary, directed phylogenetic network.
Now we consider the case where N i has a tree-based labeling ψ i . If we require the new edge e to be labeled as gene flow in N i+1 and e s and e t to be labeled as part of the base tree in N i , then the labeling ψ i can be extended to N i+1 , denoted ψ i+1 , in the natural way (i.e. ψ i+1 (e ) = 1 and ψ i+1 (u s → v s ) = ψ i+1 (v s → w s ) = 0). Nothing is done for the remaining edges (i.e. ψ i (e) = ψ i+1 (e) for all e ∈ E(N i ) ∩ E(N i+1 )). We refer to edge additions of this form as being in the gene flow edge addition neighborhood of N i .
Recall that TreeMix and OrientAGraph operate on the contracted version of a binary network, so they would immediately contract the edge v t → w t . Therefore, edge additions can be implemented on the contracted version of a binary network N C i by subdividing an edge e s = u s → w s ∈ E(N C i ) with a new vertex v s , and then adding an edge e from v s to some target vertex v t ∈ V (N C i ). In order for this edge addition to be legal, w s = v t (as the edge addition would produce a parallel edge in N C i+1 ) and v t cannot be on a some directed path from the root to w s (as the edge addition would produce a cycle in N C i+1 ). TreeMix and OrientAGraph constrain their search to the gene flow edge addition neighborhood, so e s must be labeled as part of the base tree in N C i (note that we can view the edge addition as being on the incoming edge to v t that is labeled as part of the base tree). Lastly, TreeMix restricts edge additions to meet some additional requirements that are likely important for preventing issues when estimating numerical parameters (see function PopGraph::is legal migration). It appears that these constraints can be verified by looking locally around e s and v t . We will broadly refer to edge additions that meet the requirements of TreeMix (and thus OrientAGraph) as being in the gene flow edge addition neighborhood of N C i .

Maximum Likelihood Edge Addition
We now turn to the task of searching for a ML network in the gene flow edge addition neighborhood of N C i , defined in Section 1.3. TreeMix builds a subset of possible edge additions in this neighborhood that seem promising by using the residual matrix to identify pairs of populations for which the current model is a poor fit to the input data. For each edge addition in this subset, TreeMix determines the legality and then, if possible, fits parameters and evaluates the log-likelihood function. The most computationally intensive step of checking the legality of an edge addition on edge e s = u s → w s and vertex v t is to determine whether v t is on any directed path from the root to w s (because an edge addition in this case produces a cycle). OrientAGraph enables an exhaustive search of the gene flow edge addition neighborhood (Algorithm 1), so it makes sense to compute this information for each vertex in a dynamic programming preprocessing step. Afterward, the legality of an edge addition is fast to compute. However, if an edge addition is legal, then we still need to fit parameters and evaluate the log-likelihood function. Doing this work for the |V (N C i )| × |E(N C i )| possible (gene flow) edge additions will not scale to large admixture graphs, although it is suitable for modestly-sized graphs.

Maximum Likelihood Network Orientation
Lastly, we discuss how to find the maximum likelihood network orientation (MLNO) of a binary, directed phylogenetic network N with root r and h admixture nodes (i.e. vertices with indegree > 1). Let Y = V (N ) \ L(N ) \ {r}, and recall that an orientation of the undirected version of N , denoted N | u , is uniquely determined by a set A ⊆ Y of h admixture nodes (i.e. nodes with indegree 2) and an edge e r that determines the position of the root [2]. If a valid orientation exists for some pair A, e r , it can be found in linear time by running Algorithm 2 (adapted from [2]). To exhaustively search the orientation neighborhood of N , we consider the directed networks found by orientating N | u for every possible subset of Y with size h (note that we fix the root r on an edge e r , typically chosen so that the root is incident to an outgroup).
There are some technicalities to consider as OrientAGraph operates on the contracted-version of a binary phylogenetic network, denoted N C , with tree-based labeling ψ. First, we need to refine vertices in N C (Definition 1), until every vertex with indegree 2 has outdegree 1. If all vertices in N C have indegree ≤ 2, then applying this process to N C produces a unique binary network. Otherwise, the resulting network, denoted N R , will have stacked admixture nodes (i.e. there exists an edge that has an admixture node as its source and its target) and the branching order of admixture nodes in this stack will have been arbitrarily resolved. It is possible that the chosen branching order could impact orientation, but we speculate that if there exists a set of stacked admixture nodes, the utility of orientation will be greatest for the first edge addition that produces an admixture node in a stack (also note that there will be issues with parameter identifiability in the case of stacked admixture nodes). Then, after applying the orientation algorithm, we need form the contracted version of the resulting network and evaluate its likelihood using a tree-based labeling. The entire process is summarized in Algorithm 3.
We want to emphasize that this work is only necessary because we implemented MLNO on top of TreeMix, with the goal of taking advantage of TreeMix's existing functions for fitting parameters, computing likelihood, etc. Moving forward, it would be advantageous, at least from the perspective of computational complexity, to take the approach of admixturegraph (setting all admixture edges to length 0) so that we can optimize parameters without finding a tree-based labeling (and to store a binary tree rather than the contracted version of one). For these reasons, we did not focus on optimizing the compute kernels specific to TreeMix.

Algorithms
Algorithm 1: Exhaustive search ML network in the (gene flow) edge addition neighborhood of N C i ; see Section 1.4 above and Section 4 in the main paper.
Input : A network N C i with tree-based labeling ψ i , and the vector X of observed f -statistics (note that N C i is the contracted version of a binary network) Output: A ML network N C i+1 with tree-based labeling ψ i+1 in the gene flow edge addition neighborhood of N C i (note that N C i+1 is the contracted version of a binary network). If the result is a network with a worse score than N C i , then N C i is returned.
if v t ∈ P root [w s ] then continue // Verify other (local) requirements of TreeMix so that edge additions do not cause numerical issues.
Algorithm 2: Reorient a directed phylogenetic network given a set of admixture nodes and a location for the root using the algorithm from [2].
Input : A directed binary phylogenetic network N , a set A ⊂ V (N ) of vertices to be designated as admixture nodes, and an edge e r ∈ E(N ) on which to place the root. Output: A reoriented version of N Function Reorient(N, A, e r ): // Step 1: Set current/desired in-degrees of vertices, and set edges as undirected. if not e.isDirected then e ← SwapSourceAndTarget(e) X ← X ∪ {e} for e ∈ X do e.isDirected ← T rue; n ← n + 1; t ← GetTarget(e) t.currentIndegree ← t.currentIndegree + 1 if t.currentIndegree == t.desiredIndegree then vqueue.PushBack(t) // Step 4: Check for success. if n == |E(N )| then return N else return N ull Algorithm 3: Exhaustive search ML network in the orientation neighborhood of N C i ; see Section 1.5 above and Section 4 in the main paper. Note that this is pseudocode showing the algorithm structure and that the work related to transforming data structures and evaluating different labelings could be avoided by optimizing parameters differently, i.e. it is a result of building on top of existing code.

Input :
A network N C (with tree-based labeling ψ, h admixture nodes, and root r with edge e r incident to the outgroup), and the vector X of observed f -statistics (note that N C is the contracted version of a binary network) Output: A MLNO network of N C with a tree-based labeling.

Topological Accuracy
We computed topological accuracy as the triplet distance compute the true and estimated admixture graph topology, using the first algorithm from [3] implemented in ntd. Triplets are rooted phylogenetic trees on three leaves and can be identified in a phylogenetic network by restricting it to three populations; if at least one of the populations is admixed, the restricted graph implies multiple triplets. The triplet distance between two networks is the number of triplets in the true graph that are not in the estimated graph plus the number of triplets in the estimated graph that are not in the true graph. Before computing the triplet distance, we rooted all estimated networks at the outgroup-this changes the position of the root but NOT the admixed populations. In some cases, TreeMix identifies gene flow into the outgroup; in this case, we place the root directly above this admixture node, as shown in Figure S2a Figure S4: On M4, TreeMix and OrientAGraph (with -mlno only) recover different directed graph topologies for different population orders. All of the topologies returned were incorrect and had a triplet distance of 14 to the true admixture graph. TreeMix and OrientAGraph (with -mlno only) returned a graph with a log-likehood score of 137 on 60% and 86% of population orders, respectively. Some graphs returned by these methods with these scores are shown above.  Figure S11: On M7, we ran TreeMix and OrientAGraph, both with 100 different population addition orders, selected uniformly at random. TreeMix returned a graph with triplet distance of 20 to the correct admixture graph on 67% of runs. OrientAGraph (MLNO only) returned a graph with triplet distance of 16 on 55% of runs. OrientAGraph (ALLMIGS only and MLNO+ALLMIGS) returned a graph with a triplet distance of 20 on 100% of runs. Both of these graphs seem to have the same base tree, but there is an edge of length 0 (it looks like there is a vertex with outdegree 3, which is not allowed by TreeMix's data structure). Different resolutions in the branching order around this edge would explain why these graphs have the same likelihood score but slightly different triplet scores.  Figure S13: On M7, TreeMix and OrientAGraph find incorrect directed admixture graph topologies with likelihood score of -2883. We gave OrientAGraph the true graph topology as input, so that it could fit its parameters and compute its likelihood score (373). There is a difference in residuals of these graphs. Note that the standard error of all f 2 -statistics was set to 0.0001. Table S1: Results for estimating admixture graphs on given the f -statistics implied by the models in Figure 3 in the main text. Admixture graphs were estimated using TreeMix as well as two versions of admixturegraph , one using the MLNO subroutine only and the other using an exhaustive search for the best admixture edge addition (ALLMIGS). We report the log-likelihood score, the topological error (i.e. the triplet distance) between the model admixture graph and the estimated admixture graph, and the running time in seconds. All methods were run 100 times, as discussed in Section 5 in the main text, so the running time is the average of 100 runs. * Indicates there was a difference in the log-likelihood score or triplet distance for the graphs returned on different runs, in which case, we report the average.