MIPUP: minimum perfect unmixed phylogenies for multi-sampled tumors via branchings and ILP

Abstract Motivation Discovering the evolution of a tumor may help identify driver mutations and provide a more comprehensive view on the history of the tumor. Recent studies have tackled this problem using multiple samples sequenced from a tumor, and due to clinical implications, this has attracted great interest. However, such samples usually mix several distinct tumor subclones, which confounds the discovery of the tumor phylogeny. Results We study a natural problem formulation requiring to decompose the tumor samples into several subclones with the objective of forming a minimum perfect phylogeny. We propose an Integer Linear Programming formulation for it, and implement it into a method called MIPUP. We tested the ability of MIPUP and of four popular tools LICHeE, AncesTree, CITUP, Treeomics to reconstruct the tumor phylogeny. On simulated data, MIPUP shows up to a 34% improvement under the ancestor-descendant relations metric. On four real datasets, MIPUP’s reconstructions proved to be generally more faithful than those of LICHeE. Availability and implementation MIPUP is available at https://github.com/zhero9/MIPUP as open source. Supplementary information Supplementary data are available at Bioinformatics online.


Background
Cancer is an evolutionary disease, with new mutations accumulating over time. Tumor genomes may carry up to thousands of mutations and one of the major challenges in cancer research is to distinguish between driver and passenger mutations. Furthermore, tumors are composed of several genetically distinct subpopulations, each harboring driver mutations. Identifying the set of mutations that belong to each subpopulation may help pinpoint which (gene) mutations are drivers. Moreover, understanding the order in which each driver mutation occurs will provide us with a more comprehensive view of tumor evolution. This can lead to a better understanding (Campbell et al., 2008;Nik-Zainal et al., 2012), and help in diagnosis and therapies (Newburger et al., 2013).
High-throughput sequencing can offer a moderately-priced, genome-wide perspective of the mutations involved in the subclones of a tumor, as opposed to other more targeted methods such as singlecell sequencing, fluorescence in situ hybridization (FISH), or silver in situ hybridization (SISH) (Malikic et al., 2015). However, the main drawback is that, by nature, more cell subpopulations are mixed in each sample.
Given such tumor high-throughput sequencing data, several questions pertain to it: what are the subpopulations of the tumor, in what proportion they occur, and what is the evolutionary relation among them. In case there is an evolutionary relation, the cell subpopulations are also called subclones of the tumor. Various computational methods have been proposed to address these questions, each answering a subset (or all) of them. Some methods assume as input a single sequencing sample from a tumor Schwartz and Shackney, 2010;Strino et al., 2013), whereas, as we will review in Section 1.2 below, other start the analysis with multiple samples.
In this paper we propose a multi-sample method for finding the tumor evolution, called MIPUP (minimum perfect unmixed phylogenies). MIPUP works by solving a problem equivalent to the Minimum-Split-Row problem proposed by Hajirasouliha and Raphael (2014), asking to minimally decompose the samples so that they form a perfect phylogeny. This phylogeny model is a common one, also used by e.g. Malikic et al. (2015), Popic et al. (2015), Jiao et al. (2014), El-Kebir et al. (2015). The method of this paper exploits a relation between perfect phylogenies and branchings in a directed acyclic graph from (Hujdurovi c et al., 2018). Based on it, we give here a simple and efficient Integer Linear Programming (ILP) formulation for this problem.
We tested MIPUP against four other popular tools for discovering the tumor evolution, CITUP (Malikic et al., 2015), LICHeE (Popic et al., 2015), AncesTree (El-Kebir et al., 2015), and Treeomics (Reiter et al., 2017). We also tried testing against PASTRI (Satas and Raphael, 2017), but we could not run it (see the Supplementary Material). Under the perfect phylogeny assumption, over a range of scenarios (read coverage 100 1000 and 10000, a number of samples from 5 to 20) and 100 random trees simulated for each of these scenarios, MIPUP proved the most accurate in reconstructing the shape of the phylogenetic tree. This was measured as a proportion of how many of the original ancestor-descendant relations in the original tree were kept also in the reconstructed tree, as done also in (Popic et al., 2015) and in (El-Kebir et al., 2015). Our experiments show that, with respect to the overall two best performing tools among these four, MIPUP improves this metric by up to 34% for read coverage 100, by up to 11% for read coverage 1000, and by up to 20% for read coverage 10 000. In some cases, MIPUP reconstructs more than 92% of all relations, also on low coverage datasets. MIPUP also appeared resilient to a low number of loss of mutation events, which violate the perfect phylogeny assumption.
We also tested MIPUP and LICHeE on four real datasets. We manually inspected the output of both, and compared them to the reconstructions given in the papers the datasets were published in. We observe that, even though both tools output overall comparable trees, MIPUP's results are generally more faithful to the original reconstructions, and require much less input parameters to fix.

Related work
In this section we review several methods that analyze multi-sample data from tumors. A few methods, such as Salari et al. (2013) and of van Rens et al. (2015), are primarily focused on improving the variant calling results in each sample. Many other methods are instead focused on reconstructing the evolutionary tree of the tumor using multiple samples. Among these latter methods, CITUP (Malikic et al., 2015), LICHeE (Popic et al., 2015) and AncesTree (El-Kebir et al., 2015) assume only the variant allele frequencies (VAFs) of the mutations. Other methods, such as PhyloWGS (Deshwar et al., 2015), Canopy (Jiang et al., 2016), SPRUCE (El-Kebir et al., 2016), also explicitly take into account copy-number aberrations.
Method CITUP works by exhaustively enumerating all possible trees with up to N max nodes (where N max is provided by the user), and decomposing each sample into several nodes of this tree. The fit between each sample and the tree is one minimizing a Bayesian information criterion on the VAF values. This fit is computed either exactly, with quadratic integer programming, or with a heuristic iterative method. The best tree is then output, together with the decompositions of each sample as nodes of this tree.
Method LICHeE also tries to fit the VAF values to a phylogenetic tree, but with an optimized search for such a tree. Mutations are first assigned to clusters based on their frequencies (a mutation can belong to more clusters). Then clusters are transformed to binary absence/presence vectors (with wildcards), based on two thresholds below which, and above which, the value is transformed into a 0 or a 1, respectively. Values in between are marked with a wildcard. The containment relation between these vectors induces a directed acyclic graph. Spanning trees of this graph are exhaustively enumerated, and the ones best compatible with the mutation frequencies are output.
Method AncesTree derives an ILP for the so-called VAF factorization problem (VAFF), namely the problem of determining the composition of each sample, including the number and proportion of clones in each sample, and a tree that describes the ancestral relationships between all clones. As the authors argue, this problem generalizes several previous formulations, including the abovementioned Jiao et al., 2014;Malikic et al., 2015;Strino et al., 2013). The implementation behind AncesTree uses a more complex model than the VAFF problem, that also accounts for errors and is solved with a Mixed ILP. El-Kebir et al. (2015) also argue that in the case of a single input sample, the VAFF problem generalizes the so-called Perfect Phylogeny Mixture Problem also proposed by Hajirasouliha and Raphael (2014), see (El-Kebir et al., 2015, p. i64). Note that El-Kebir et al. (2015) propose an ILP for the initial VAFF problem, which is thus also applicable to the Perfect Phylogeny Mixture Problem. However, this problem is not equivalent to the problem underlying MIPUP, as it only asks for some decomposition of the samples into a perfect phylogeny, not necessarily a minimal one. Therefore, we cannot directly compare the efficiency of the ILP from this paper with the ILP of El-Kebir et al. (2015). See Table 1 for an overview of the advances relative to these two problems.

Overview of the approach
In this section we give an informal overview of our approach. We refer the reader to Figure 1 for a visual overview.
Assume we obtained samples r 1 ; . . . ; r m from a tumor. Using a somatic point mutation caller, such as VarScan 2 (Koboldt et al., 2012), we can detect the somatic single nucleotide variants (SSNVs) present in each sample and derive their VAF values from the read alignments over their positions. Denote these SSNVs by c 1 ; . . . ; c n . We then build a binary matrix M with rows labeled r 1 ; . . . ; r m and columns labeled c 1 ; . . . ; c n , such that M i;j ¼ 1 if and only if the VAF value of SSNV c j in sample r i is greater or equal to a given threshold t.
Matrix M is the input to our problem. From it, we would like to infer (i) the individual subclones of the tumor making up each sample r i (i.e., the binary pattern of SSNVs in each such subclone) and (ii) the evolutionary relation among these unknown subclones.
Let us now make these notions more precise. In this paper we consider the model and problem formulation proposed by Hajirasouliha and Raphael (2014). This considers as evolutionary relation among the tumor subclones the so-called perfect phylogeny model, in line with previous studies such as (El-Kebir et al., 2015;Jiao et al., 2014;Malikic et al., 2015;Popic et al., 2015). This assumes that (i) all mutations in the parent cells are passed to the descendants, and (ii) once a mutation occurs at a particular site, it does not occur again at that site (the "infinite sites assumption"). Being mixtures of subclones of the tumor, the rows of M may not necessarily form a perfect phylogeny. Thus, we would like to split each row r i of M into a set of rows R i so that the resulting matrix M 0 does correspond to a perfect phylogeny. (See Definition 2.2 for a formal definition of the split operation, and Figure 2 for an example of a matrix M and a matrix M B obtained by splitting the rows of M.) Hajirasouliha and Raphael (2014) proposed to perform this split so that the resulting matrix is "minimal". Such parsimony criterion is often employed when modeling real-life problems, and it is one of the most basic investigations one can perform.
More specifically, Hajirasouliha and Raphael (2014) proposed that M 0 has the minimum number of rows. In terms of perfect phylogeny trees, this means that we are looking to split each sample into a collection of subclones forming a perfect phylogeny, and the total number of subclones from all samples is minimum. We will call this problem MinimumConflict-FreeRowSplit (MCRS), see Section 2.2. Hajirasouliha and Raphael (2014) claimed that the MCRS problem is NP-hard (and gave an incorrect proof), and in (Hujdurovi c et al., 2015, 2016) a correct hardness proof was given. Hujdurovi c et al. (2016) also proposed a polynomial-time heuristic algorithm for it based on coloring co-comparability graphs and tested it on real samples.
As opposed to the above heuristic algorithm, in this paper we propose an exact algorithm for the MCRS problem. We obtain this by using a recent result from (Hujdurovi c et al., 2018) showing that the problem is equivalent to a problem related to finding an optimal branching in a directed acyclic graph. A branching is a subgraph in which every vertex has out-degree at most 1. We formally describe this correspondence in Section 2.3. Using this branching formulation, we then show in Section 2.4 that the MCRS problem can be expressed using ILP, and solve it using the CPLEX ILP solver.
See Table 1 for a summary of these results.

Problem formulation
A binary matrix M 2 f0; 1g mÂn is a matrix having m rows and n columns, and all entries 0 or 1. Each row of such a matrix is a vector in f0; 1g n ; each column is a vector in f0; 1g m . We will denote by R M ¼ ðr i Þ 1 i m and C M ¼ ðc j Þ 1 j n the families of rows and columns of M, respectively. The entry of M at row r i and column c j will be denoted by M i;j or M ri;j when appropriate. For brevity, we will often write "the number of distinct rows (resp., columns) of M" to mean Overview of the approach. In (a) we illustrate a tumor with six subclones labelled A; . . . ; F . In (b) we illustrate a binary matrix M 0 such that every row is a tumor subclone, and every column is an SSNV found in at least one of the subclones (here the SSNVs are labeled c1; . . . ; c8). A 1 indicates presence and a 0 indicates absence of that SSNV in a subclone. In (c) we show the perfect phylogeny tree that gave rise to these patterns of mutations; here every subclone is a leaf of the tree and every SSNV labels an edge (and only one) of the tree. The SSNVs present in a subclone are the ones labeling the path from the root of the tree to the corresponding leaf. For example, the SSNVs present in subclone A are fc1; c4; c6g, which are the same as the columns containing a "1" on row A in matrix M from (b). In practice, each sequencing sample may generally contain more than a single subclone of a tumor. In (d) we show four samples r1; . . . ; r4 sequenced from the tumor, some combining more than one subclone. In (e) we show the binary matrix M indicating presence/absence of the SSNVs in each of these four samples. Observe that each row r i of M is the bitwise OR of the binary rows of M 0 corresponding to the subclones that are in sample r i . For example, sample r 1 contains subclones A, B, C, and thus row r 1 of M is the bitwise OR of rows A, B, C of M 0 . Figure 1f shows the same perfect phylogeny tree as in (c), in which we again mark the phylogeny nodes being combined in each sample r i . Matrix M is the input to our problem, and matrix M 0 and the phylogeny tree corresponding to M 0 are the unknowns that must be reported in output "the maximum number of pairwise distinct rows (resp., columns) of M". Two rows (resp., columns) are considered distinct if they differ as binary vectors. All binary matrices in this paper will be assumed to contain no row in which all entries are 0. DEFINITION 2.1. Given a matrix M, three distinct rows r, r 0 ; r 00 of M and two distinct columns i and j of M, we denote by M½ðr; r 0 ; r 00 Þ; ði; jÞ the 3 Â 2 submatrix of M formed by rows r, r 0 ; r 00 and columns i, j (in this order). Two columns i and j of a binary matrix M are said to be in conflict if there exist rows r; r 0 ; r 00 of M such that M½ðr; r 0 ; r 00 Þ; ði; jÞ ¼ We say a binary matrix M is conflict-free if there exist no two columns of M that are in conflict.
The rows of a binary matrix M are the leaves of a perfect phylogenetic tree if and only if M is conflict-free, see (Estabrook et al., 1975;Gusfield, 1997). Moreover, if this is the case, then the corresponding phylogenetic tree can be retrieved from M in time linear in the size of M (Gusfield, 1991). As such, we formulate our problems just in terms of finding optimal conflict-free matrices. REMARK 2.1. We are following here the formalism on perfect phylogenies from (Gusfield, 1991). Namely, each row of a matrix is a leaf of the phylogenetic tree, and columns label edges. However, a leaf whose in-coming edge has no label is in fact an internal node of the evolution, that is, it has no "private" mutations. See for example Figure 1c where leaves C and E have no labels on the in-coming edges. We follow the same formalism in the trees output by MIPUP, see Figure 3. DEFINITION 2.2. Let M 2 f0; 1g mÂn . Label the rows of M as r 1 ; r 2 ; . . . ; r m . A binary matrix M 0 2 f0; 1g m 0 Ân is a row split of M if there exist a partition of the set of rows of M 0 into m sets R 1 ; R 2 ; . . . R m such that for all i 2 f1; . . . ; mg, r i is the bitwise OR of the binary vectors in R i . The set R i of rows of M 0 is said to be the set of split rows of row r i (with respect to M 0 ). For simplicity, we defined a row split as a binary matrix M 0 for which a suitable partition of rows exists. However, throughout the paper we will make a slight technical abuse of this terminology by considering any row split M 0 of M as already equipped with an arbitrary (but fixed) partition of its rows R 1 ; . . . ; R m satisfying the above condition.
We denote by cðMÞ the minimum number of rows in a conflictfree row split M 0 of M. Formally, the minimum conflict-free row split problem is defined as follows: MinimumConflict-FreeRowSplit (MCRS): Input: A binary matrix M. Task: Compute cðMÞ and find a conflict-free row split M 0 of M with cðMÞ rows.  (Gerlinger et al., 2014). The last row of square gray nodes in the trees of MIPUP are the original samples. The oval nodes are the rows in which the input matrix is split. Notice that, due to our tree building algorithm, they are drawn as leaves of the phylogeny. However, if their in-coming edge has no label (i.e., no mutations occurring on that edge) then they are actually internal nodes of the evolution, recall Remark 2.1. For example, node R1 is internal to the evolution. Arrows indicate the composition of the original samples in terms of split rows. The legend contains the equalities among split rows; only one split row in each equality class is a node of the tree

The branching formulation
In this section we review the formulation from (Hujdurovi c et al., 2018) of the MCRS problem in terms of branchings in a directed acyclic graph (DAG). We refer the reader to (Hujdurovi c et al., 2018) for the proof of this equivalence. In Section 2.4 we will use this formulation to write an ILP for the problem. The announced equivalence between the MCRS and the MUB problems is captured in the following result.   Figure 2 for an example of a binary matrix M with two branchings B 1 and B 2 of its containment digraph and the corresponding row splits.
The proof of Theorem 2.1 from (Hujdurovi c et al., 2018) shows that the B-split of M is conflict-free and has jUðBÞj rows. This means that if we have a branching minimizing jUðBÞj, then the B-split of this branching is an optimal solution for the MCRS problem.

ILP formulation
The notion of B-split can be used to transform an optimal solution to the problem of computing one of the parameters fb; cg to an optimal solution for the other parameter. The problem formulation in terms of b is directly expressible in terms of packing and covering constraints, and thus leads to a natural integer programming formulation of the MUB problem. We will express the ILP only in terms of finding the value bðMÞ. However, the optimal branching attaining this value can be trivially retrieved from the values of the variables in an optimal solution of the ILP. REMARK 2.2. It is easy to check that the decision version of the MCRS problem is in NP and thus admits a polynomially-sized certificate. Furthermore, since Integer Linear Programming is NP-hard, it follows that there exists a polynomially sized ILP formulation of the MCRS problem. However, applying Theorem 2.1 allows to obtain a direct and simple polynomially-sized ILP formulation for it, which will also turn out to be efficient in practice.
Let M be the input binary matrix to the problem, and let D M ¼ ðV; AÞ be its containment digraph. Our goal is to find a branching B of D M minimizing the number of elements in U(B). We introduce the following binary variables: • for every edge ðu; vÞ 2 A, we introduce a variable x u;v with the intended meaning that x u;v ¼ 1 if and only if ðu; vÞ 2 B; • for all v 2 V and for all r 2 v, we introduce a variable y r;v , meaning y r;v ¼ 1 if and only if r is uncovered in v with respect to B.
Consider the following integer program: min P v2V P r2v y r;v subject to X ðu;vÞ2A THEOREM 2.2. The optimal value of the above integer program is bðMÞ.
PROOF. Let OPT denote the optimal value of the above ILP.
First, we prove that OPT bðMÞ. Let B be a branching of D M such that jUðBÞj ¼ bðMÞ. Define a binary vector x 2 f0; 1g A by setting x u;v ¼ 1; if ðu; vÞ 2 B; 0; otherwise: For every v 2 V and every r 2 v set y r;v ¼ 1 if and only if r is uncovered in v with respect to B. The objective function value at (x, y) equals to the sum, over all v, of the number of uncovered elements in v with respect to B, that is, the size of U(B). The definition of a branching implies that constraints (1) are satisfied. Consider now a constraint of type (1). Let v 2 V and r 2 v. If y r;y ¼ 1, then the constraint holds due to the non-negativity of the x-variables. If y r;v ¼ 0, then r is covered in v with respect to B. This implies that there exists an arc ðu; vÞ 2 B such that r 2 u. Since ðu; vÞ 2 B, it holds x u;v ¼ 1 and thus the constraint is satisfied in this case. It follows that (x, y) is a feasible solution of the ILP with objective function value jUðBÞj, therefore OPT jUðBÞj ¼ bðMÞ. The proof of the other inequality is similar. Let (x, y) be an optimal solution to the ILP and let B be the set of arcs ðu; vÞ 2 A such that x u;v ¼ 1. Constraints (1) guarantee that B is a branching of D M . Constraints (2) and the optimality of (x, y) imply that for all v 2 V and all r 2 v, we have y r;v ¼ 1 if and only if P u2N À A ðvÞ:r2u x u;v ¼ 0. Indeed, if the above sum is at least 1, then setting y r;v to 0 would result in a feasible solution with strictly smaller objective function value. Therefore, y r;v ¼ 1 if and only if ðu; vÞ 6 2 B for all u 2 N À A ðvÞ such that r 2 u, which is in turn equivalent to the condition r 6 2 [ v 0 2N À B ðvÞ v 0 , that is, r is uncovered in v (with respect to B). It follows that the objective function value at (x, y) equals the total number of uncovered pairs, that is, the size of U(B). We conclude that B is a branching such that jUðBÞj ¼ OPT, which implies bðMÞ OPT. h The above integer program has p ¼ jAj þ P v2V jvj binary variables and q ¼ jVj þ P v2V jvj constraints. In terms of the binary matrix M, the numbers of variables and constraints can be described as: , and o denote the number of columns, the number of comparable pairs of columns (with respect to the containment relation), and the number of ones in the matrix obtained by taking from M exactly one copy from each set of identical columns, respectively. If M is m Â n, then the number of variables is Oðnðm þ nÞÞ and the number of constraints is O(mn).

Implementation
MIPUP is implemented in Java and uses the CPLEX ILP solver. MIPUP can report all optimal solutions, or at most a user-provided number of optimal solutions. The input format is the same as for LICHeE, namely a matrix with VAF values of each SSNV in each sample. As input we also assume a threshold t to transform VAF values into binary ones. LICHeE applies a further filtering to the input matrix, namely removing those weak SSNVs whose binary presence/absence pattern in the samples appears strictly less than k times (option minClusterSize) in the entire matrix (default k ¼ 2). We also provide a Python script that, given t and k, filters the matrix in this manner.
Apart from an optimal conflict-free row split binary matrix, MIPUP also outputs the perfect phylogeny tree corresponding to it. We label each edge of the tree with the set of mutations that occurred along the edge. The label format is Sjnjmean6std, where S is an internal name for the group of mutations (the mutations corresponding to each group are output in a separate file), n is the cardinality of S, mean is the mean value of their VAF values, in all samples, and std is the standard deviation of their VAF values. See the caption of Figure 3 for further details on the layout of the phylogenetic trees.

Simulated data
We performed an evaluation of simulated data as done in (El-Kebir et al., 2015) and in (Popic et al., 2015). Our evaluation pipeline is freely available at https://github.com/huanyannizu/Data-simulation-and-evaluation-in-MIPUP. We created uniformly at random a tree with c nodes (i.e., clones), and randomly chosen a node as root. This was done using an algorithm based on Prü fer's encoding of a labeled tree (Prü fer, 1918). We randomly assigned n mutations to the nodes of this tree, making sure each node gets at least one mutation. Our main experiments are with c ¼ 10 and n ¼ 100, as in (El-Kebir et al., 2015). In order to see how the tools scale, we also tested MIPUP, LICHeE, and Treeomics with c ¼ 20, n ¼ 200 and c ¼ 30, n ¼ 300.
Note that, under the perfect phylogeny assumption, the mutations in a node must be iteratively propagated to all descendants of a node. To test also loss of mutation events, we added a further parameter d 2 f0; 1; . . . ; 9g that denotes the number of times one of these propagation events of a mutation in some node v (that may have originated in v or in an ancestor of v) is not propagated to a child u of v (and thus to none of the descendants of u). Note that d ¼ 0 corresponds to the perfect phylogeny assumption. We then assigned to each node a random cell population size between 100 and 200.
We created a number of m samples from the tree as follows. Each sample randomly selects 2-4 nodes of the tree, and will include all cells and mutations in those nodes. As in (El-Kebir et al., 2015), we then created three matrices, U, B, F: usage matrix U 2 R mÂc is such that an entry (r i , c j ) contains the fraction of cells of clone c j out all the cells in sample r i ; clonal matrix B 2 f0; 1g cÂc is such that an entry (c i , c j ) equals 1 iff c i ¼ c j , or c i is a descendant of c j in the tree; VAF value matrix F 2 R mÂc equals 1 2 UB and contains the true VAF values of all mutations in each clone. See (El-Kebir et al., 2015, Fig. 1) for details. We then unpack matrix F into F unpack 2 R mÂn , which has a column for each mutation, so that the column corresponding to mutation m j from clone c k is the same as column c j of F.
Note that tools MIPUP, LICHeE and CITUP accept in input VAF values. However, tools AncesTree and Treeomics require reads counts. For this reason, we simulated reads counts as in done in (El-Kebir et al., 2015). Given a read coverage a 2 f100; 1000; 10 000g, we draw the number of reads containing mutation m j in sample r i as y ri;mj $ PoissðaÞ. We then draw the number of reads containing the variant allele as x ri;mj $ Bionomialðy ri;mj ; F ri;mj Þ. The number of reads containing the reference allele is y ri;mj À x ri;mj . The values x ri;mj =y ri;mj are thus noisy VAF values that are used as input also for MIPUP, LICHeE, CITUP.
We evaluated how well the tools are able to reconstruct the original tree, as done in (Popic et al., 2015) and (El-Kebir et al., 2015). Given the original tree, and given two mutations m i and m j in clones c i and c j , we say that m i is an ancestor (resp. descendant) of m j if c i is an ancestor (resp. descendant) of c j . An AD pair is an ordered pair (m i , m j ) of mutations such that m i is an ancestor of m j . Note that two mutations in the same node are not an AD pair. Given an output tree reported by each tool, we computed the fraction of AD pairs in the original tree that were present in the output tree.
Note that MIPUP, CITUP, and Treeomics can report more output "best" trees. (In MIPUP's case, unless otherwise stated, we output all optimal trees.) In this case, we report three results for them, "Best"-the tree achieving the best results under our metric; "Avg"-the average metric over all reported trees, and "Std"-their standard deviation. Note that results "Best" are usually unattainable in practice.
The results for ðc; nÞ ¼ ð10; 100Þ are in Table 2. For none, or very few, losses of mutation (d 2) MIPUP is generally the best performing tool. As d increases, Treeomics becomes the best performing tool. However, for large values of d, the results of all tools are significantly worse than under the perfect phylogeny assumption (d ¼ 0). See Table 2 for results for d ¼ 9 and the Supplementary Material for all other values of d. While it appear that Treeomics produces better results as we increase the number of loss mutation events, it is worth noting that all models (MIPUP, CITUP, Treeomics and LICHeE) do assume the perfect phylogeny model. Manually checking the outputs, we observe that one reason why MIPUP performs better is that other tools (especially CITUP and AncesTree) combine more parent-child nodes of the initial tree into a single node, and thus are not able to recover the initial AD pairs from these nodes (for example, in a few cases, AncesTree outputs a tree made up of a single node).
As seen from Table 3, MIPUP (even when outputting all optimal solutions) and LICHeE generally run in less than two seconds, and Treeomics generally runs in less than one minute. The running time of CITUP and AncesTree is an order of magnitude higher and more variable.

Real data
We experimented on four real datasets: ultra-deep-sequencing of clear cell renal cell carcinoma (ccRCC) (Gerlinger et al., 2014) (also analysed by LICHeE), high-grade serous ovarian cancer (HGSC) by (Bashashati et al., 2013), breast cancer xenoengraftment in immunodeficient mice (Eirew et al., 2015) and (four) clonally related uterine leiomyomas (Mehine et al., 2015). The first three datasets are public and were also considered by Popic et al. (2015). The public datasets can also be found in the MIPUP repository, together with the experiment results, and the scripts and parameters used to run them. We ran only MIPUP and LICHeE on these real datasets.
In Supplementary Table S1 we show an overview of the sizes of the input matrices. In Figure 3 we show the results on the RMH008 samples from the ccRCC study of Gerlinger et al. (2014). The results on other samples are shown and discussed in the Supplementary Material.
Even though the results of LICHeE and MIPUP generally agree, in many instances there are many slight differences among them, and MIPUP is generally closer to the original phylogenies proposed in the papers analyzing the datasets. For example, on sample RMH008 from Figure 3, MIPUP reports that samples R6 and R4 are combinations of two phylogeny nodes, which lie on a tree branch together with R1, R2, and R3, and on a tree branch together with R5, R7, and R8. This is in line with LICHeE's prediction and with (Gerlinger et al., 2014). However, there are some differences: in line with (Gerlinger et al., 2014) (right branch), MIPUP reports that R6 is made up of some SSNVs common only to R3, as opposed to all of R1, R2, R3 in LICHeE's case. It also reports that R6 is made up of SSNVs common to R4, R5, R7 (node R6_2), in line with (Gerlinger et al., 2014) (left branch), as opposed to all of R4, R5, R7, R8 in LICHeE's case.
Moreover, in order to run LICHeE accurately, the user must guess many input parameters, while in MIPUP's case the user must fix only one, the threshold for converting a VAF value into a binary one. In fact, for many of the samples in the ccRCC dataset analyzed by LICHeE, the input parameters were chosen by LICHeE's authors as different from the default values.

Conclusion
MIPUP solves exactly and efficiently a natural problem related to minimally unmixing sequencing samples so that they fit a perfect phylogeny. We tested MIPUP against a large number of competing tools, and shown that MIPUP reconstructs the original tree (under the ancestor-descendant metric) significantly better. On real data, MIPUP generally has more faithful reconstructions than LICHeE, with much less input parameters to guess correctly. On the methodological side, MIPUP's novelty is in the reduction of a phylogeny problem to a branching problem and in the search for the optimum phylogeny embedded in the ILP formulation itself.
We believe that MIPUP's performance stems from two ingredients. First, from a much simpler problem formulation. Second, MIPUP's most significant increase in performance is for low read coverage, where noisy data can have greater effects on methods using VAF values explicitly. MIPUP transforms VAF values to binary ones. Since MIPUP does not try to reconstruct the proportion of each clone in each sample, but only their ancestral relation, this suggests that transforming VAF values into binary ones is actually a more resilient choice for this scenario and thus an advantage for MIPUP.