## Abstract

**Motivation:** High-throughput sequencing of tumor samples has shown that most tumors exhibit extensive intra-tumor heterogeneity, with multiple subpopulations of tumor cells containing different somatic mutations. Recent studies have quantified this intra-tumor heterogeneity by clustering mutations into subpopulations according to the observed counts of DNA sequencing reads containing the variant allele. However, these clustering approaches do not consider that the population frequencies of different tumor subpopulations are correlated by their shared ancestry in the same population of cells.

**Results:** We introduce the *binary tree partition* (*BTP*), a novel combinatorial formulation of the problem of constructing the subpopulations of tumor cells from the variant allele frequencies of somatic mutations. We show that finding a BTP is an NP-complete problem; derive an approximation algorithm for an optimization version of the problem; and present a recursive algorithm to find a BTP with errors in the input. We show that the resulting algorithm outperforms existing clustering approaches on simulated and real sequencing data.

**Availability and implementation:** Python and MATLAB implementations of our method are available at http://compbio.cs.brown.edu/software/

**Contact:**braphael@cs.brown.edu

**Supplementary information:**Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

Cancer is a disease driven by somatic mutations that accumulate in the genome during the lifetime of an individual. High-throughput sequencing technologies now provide an unprecedented ability to measure these somatic mutations in tumor samples (Ding *et al.*, 2013). Application of these technologies to cohorts of cancer patients has revealed a number of new cancer-causing mutations and cancer genes (Kandoth *et al.*, 2013; Lawrence *et al.*, 2013; Vogelstein *et al.*, 2013). Cancer sequencing studies have also demonstrated that most tumors exhibit extensive *intra-tumor heterogeneity* characterized by individual cells in the same tumor harboring different complements of somatic mutations (Ding *et al.*, 2012; Gerlinger *et al.*, 2012; Nik-Zainal *et al.*, 2012; Schuh *et al.*, 2012; Shah *et al.*, 2012). Such heterogeneity is a consequence of the fact that cancer is an evolutionary process in a population of cells. The *clonal theory* of cancer evolution (Nowell, 1976) posits that the cells of a tumor descended from a single founder cell. This founder cell contained an advantageous mutation leading to a clonal expansion of a large population of cells descended from the founder. Subsequent clonal expansions occur as additional advantageous mutations accumulate in descendent cells. A sequenced tumor sample thus consists of multiple subpopulations of tumor cells from the most recent clonal expansions (Fig. 1).

Nearly all cancer sequencing efforts thus far sequence DNA from a *single* sample of a tumor at a *single* time. This is because of technical limitations: the most cost-effective DNA sequencing technologies (e.g. Illumina) require input DNA from many tumor cells, and such samples are typically available only when patients undergo surgery (Occasionally, paired samples from two time points, such as a primary tumor and a metastasis, are also sequenced.). While sequencing of multiple samples from the same tumor (Gerlinger *et al.*, 2012; Newburger *et al.*, 2013; Salari *et al.*, 2013) or single-cell sequencing (Hou *et al.*, 2012; Navin *et al.*, 2011; Xu *et al.*, 2012) might eventually provide even better datasets to assess intra-tumor heterogeneity, technical considerations have limited their applicability utility thus far. Thus, there is tremendous interest in methods that infer the relative proportion of different subpopulations of tumor cells in a single sample.

Several recent studies (Ding *et al.*, 2012; Nik-Zainal *et al.*, 2012; Shah *et al.*, 2012) have demonstrated that it is possible to infer the subpopulations of tumor cells by counting the number of DNA sequence reads that contain a somatic mutation. For single-nucleotide mutations, or variants, the *variant allele frequency (VAF)* is defined as the fraction of DNA sequence reads covering the variant position that contains the variant allele rather than the reference/germ line allele. The VAF provides an estimate of the fraction of tumor chromosomes containing the mutation, but with error due to the stochastic nature of the sequencing process (Fig. 1). In addition, technologies currently used in cancer sequencing studies produce short reads that rarely contain more than one somatic mutation. Thus, for any pair of somatic mutations, the only information available to distinguish their subpopulation of origin is the VAF.

To overcome substantial variability in measured VAFs, a common approach is to cluster VAFs and from these clusters infer the number and proportion of various subpopulations of tumor cells in the sample. A number of techniques have been introduced to perform this clustering, with Dirichlet Process Mixture models and related non-parametric models being particularly popular, as they do not fix the number of clusters in advance (Miller *et al.*, forthcoming; Nik-Zainal *et al.*, 2012; Shah *et al.*, 2012). VAF clusters correspond to tumor subpopulations, and the *cellular fraction*, or fraction of tumor cells containing the cluster of somatic mutation, is derived from the VAF of the clusters. In the simplest case, the VAF directly determines the cellular fraction: e.g. a cluster with VAF = 0.5 corresponding to homozygous mutations in 50% of tumor cells, or heterozygous mutations in 100% of tumor cells. However, in practice, the inference of cellular fraction is complicated by copy number aberrations and normal admixture (percentage of the tumor sample that is normal cells), and these two factors are themselves correlated (Carter *et al.*, 2012; Oesper *et al.*, 2013; Strino *et al.*, 2013). We will not consider such complications here; rather we restrict attention to heterozygous somatic mutations outside of copy number aberrations; assumptions that were made in sequencing studies including Ding *et al.* (2012).

With two exceptions (Jiao *et al.*, 2014; Strino *et al.*, 2013), current techniques for clustering VAFs treat each subpopulation independently and do not consider that these frequencies are correlated by the fact that they are partitions of the same cellular population. Thus, these approaches do not explicitly construct an evolutionary history of the accumulated somatic mutations in the cells. Strino *et al.* (2013) performs a heuristic search over possible trees, and we discuss Jiao *et al.* (2014) below.

*Contributions*. In this article, we formulate the problem of inferring subpopulations of tumor cells from VAF data obtained from a single tumor sample as the combinatorial problem of constructing a *Binary Tree Partition (BTP)*. We show that the problem of finding a BTP is NP-complete and present a $23\u2212o(1)$-approximation algorithm for a related $max\u2061$-BTP problem. The approximation algorithm is based on *Local Search* and we use the well-known packing bound of Hurkens and Schrijver (1989) for the purpose of analysis. Next, we define ε-BTP, a generalization of the BTP problem that allows for the possibility that VAFs are observed with errors and some VAFs are not observed. We present a straightforward recursive algorithm to find an ε-BTP and show that this algorithm outperforms existing VAF clustering approaches on simulated and real data. This recursive algorithm is fast in practice and runs in less than a minute, on a single CPU, for each of our simulated or real samples.

## 2 TUMOR SUBPOPULATIONS AND THE BTP

In this section, we formulate the problem of determining tumor subpopulations from VAF clusters. The clonal theory of cancer evolution proposes that the cancerous cells in a tumor are the result of multiple waves of somatic mutation and clonal expansion. Given the relationship between sequence coverage (1000–10 000× for targeted studies) and number of tumor cells that are sequenced in a tumor sample (millions), we first assume that any somatic mutation reported in the data must be present in an appreciable fraction of tumor cells. This implies that the observed somatic mutations were present in at least one clonal expansion. Second, as in other recent studies (Jiao *et al.*, 2014; Salari *et al.*, 2013), we assume that somatic mutations follow the infinite sites assumption such that at most one single mutation occurs at a genomic locus (e.g. single position) during the evolution of the tumor. It follows from this assumption that if a mutation β occurs in a tumor cell subsequent to a mutation α, then the fraction of cells containing mutation α must be at least as large as the fraction of cells containing β. This condition was also recently noted in Jiao *et al.* (2014).

We assume that at any particular time in the cancer progression *at most* one cell in the tumor population acquires a new mutation leading to a clonal expansion. We emphasize that this assumption restricts only the number of clonal expansions that *begin* at a given time and not the number of clonal expansions that are ongoing at one time. Under this assumption, each clonal expansion splits the present tumor cell population into exactly two subpopulations: the subpopulation *P* of cells containing the newly acquired somatic mutation and the subpopulation *P′* of cells without the mutation (and possibly with a different set of new mutations occurring later in time). Thus, the ancestral history of the sequenced tumor cell population is represented by a rooted *binary* tree with nodes corresponding to populations of cells at each clonal expansion, and edges indicating the ancestral relationships between these populations. Consequently, each node *v* in the tree has a set *M _{v}* of somatic mutations that accumulate along the unique edge

*p*that connects

_{v}*v*to its parent (As the root

*r*does not have a parent, the set

*M*represents the somatic mutations that accumulate before the first clonal expansion. Alternatively, we can let

_{r}*p*be a

_{r}*hidden*edge that connects the founder cell of the tumor (represented by the root

*r*) to the normal cell from which it is derived.). Each node

*v*also has a frequency

*a*representing the proportion of sequenced tumor cells with mutations

_{v}*M*(Fig. 1). Note that most tumor samples contain admixture by normal cells with no detectable somatic mutations, and thus in general

_{v}*a*< 1. A consequence of these assumptions is that every internal node

_{r}*v*in the tree satisfies the

*children sum to parents (CSP)*condition: $\u2211u child of vau=av$. We make the following definition:

Given a multiset $L={a1,\u2026,an}$ with 0 < *a _{i}* ≤ 1, a BTP for $L$ is a complete rooted weighted binary tree

*T*= (

*V*,

*E*) with nodes $V={v1,\u2026,vn}$ such that

*v*has weight

_{i}*a*and every internal node satisfies the CSP condition.

_{i}Figure 1 shows a simple example of a complete binary tree in which the CSP conditions are satisfied for every internal node (also see Supplementary Appendix D.6 for a more general example). Recall that a complete rooted binary tree is a binary tree wherein there is a unique node of degree 2 (the root), and every node in the tree is either a leaf or has exactly two children. Our goal is to construct such a tree from the measured VAF data. We define the following.

Given a multiset $L={a1,\u2026,an}$, find a BTP for $L$ if one exists.

Note that in some cases, at the split defined by a clonal expansion, cells from *P ′* may survive to the present, without undergoing additional clonal expansions (e.g. such cells may cease dividing, or senesce). In this case, there are no mutations that exclusively occur in *P′*. We discuss this case in Section 5 below.

## 3 COMPLEXITY OF the BTP PROBLEM

In this section, we outline the proof of the following theorem.

*The BTP problem is NP-complete.*

The proof of this theorem relies on the idea of finding a set of *conflict-free triangles* in the multiset $L$. This idea is also useful below for deriving an approximation for a related problem of finding a $max\u2061$-BTP, and so we now define the relevant concepts.

Suppose $L={a1,\u2026,an}$ is a multiset of *n* elements. For any distinct *i*, *j* and *k* such that *a _{i}* +

*a*=

_{j}*a*, we define the ordered pair $(k,{i,j})$ as a

_{k}*triangle*in the multiset. See Supplementary Appendix A.1 for an example. We call

*k*as the

*peak*of the triangle and

*i*,

*j*as the

*tails*of the triangle.

We say that triangles $t=(k,{i,j})$ and $t\u2032=(k\u2032,{i\u2032,j\u2032})$ are *in conflict* if *k* = *k′* or ${i,j}\u2229{i\u2032,j\u2032}\u2260\u2205$. In other words, two triangles are in conflict if and only if they either share a common peak or a common tail. A set *Z* of triangles is *conflict-free* if no pair of triangles in *Z* are in conflict. If *T* is a BTP for $L$, for each internal node *a _{k}* and its children

*a*and

_{i}*a*, we have

_{j}*a*=

_{k}*a*+

_{i}*a*, by CSP. Therefore, $(k,{i,j})$ is a triangle in $L$, and thus,

_{j}*T*corresponds to a set of conflict-free triangles in $L$: each internal node and its children form a triangle in $L$, and no two triangles share a common peak or a common tail.

Because the number of nodes in a complete binary tree is always odd, $|L|$ being an odd number is a necessary condition for the existence of a BTP for $L$. For a multiset $L$, with $|L|=2q\u22121$, the size of a conflict-free set of triangles is at most *q* − 1. This is because each triangle has exactly two tails, and in a set of conflict-free triangles, all the tails must be distinct elements.

We have the following theorem, whose proof is in Supplementary Appendix A.2.

*Suppose *$L={a1,\u2026,a2q\u22121}$*. *$L$*has a BTP if and only if there exists a set of q **−** 1 conflict**-**free triangles in *$L$*.*

The proof of NP-completeness of the BTP problem (Theorem 3) is by a reduction from the Numerical Matching with Target Sums (NMTS) problem, (Garey and Johnson, 1979). An instance of NMTS is a triple $I=(X,Y,B)$ where $X,Y,B\u2282\mathbb{Z}+$, and $X={x1,\u2026,xm},Y={y1,\u2026,ym}$ and $B={b1,\u2026,bm}$. The goal is to find two permutations $\pi X$ and $\pi Y$ on ${1,\u2026,m}$ such that $x\pi X(i)+y\pi Y(i)=bi$ for *i* = 1 … ,*m*.

*Let *$I=(X,Y,B)$*be an instance of **NMTS**. Then, *$I$*has a solution if and only if a particular multiset *$LI$*has a BTP (i.e. a set of 2m **−** 1 conflict-free triangles). Moreover, each solution of *$I$*can be obtained (in polynomial time) from a BTP of *$L$*, and vice versa.*

For a given instance $I=(X,Y,B)$, let $x^i=4(m+1)xi$$+1,y^i=4(m+1)yi+3$ and $b^i=4(m+1)bi+4,1\u2264i\u2264m$. Moreover, let $\beta j=\u2211k=1jb^k,2\u2264j\u2264m$. Now we construct an instance $LI$ of BTP (i.e. a multiset), with 4*m* − 1 elements as follows: $LI={x^1,\u2026,x^m}\u222a{y^1,\u2026,y^m}\u222a{b^1,\u2026,b^m}\u222a$${\beta 2,\u2026,\beta m}.$

(⇒) First assume that we have two permutations $\pi X,\pi Y$ on ${1,\u2026,m}$ such that $x\pi X(i)+y\pi Y(i)=bi.$ By definition of $x^i,y^i$ and $b^i$ we have also $x^\pi X(i)+y^\pi Y(i)=b^i$. Now we construct a set of conflict-free triangles for $LI$ of size 2*m* − 1: for each $i\u2208{1,\u2026,m}$, add the triangle $(b^i,{x^\pi X(i),y^\pi Y(i)})$. In addition, for each $i\u2208{1,\u2026,m\u22121}$, we add all triangles $(\beta i+1,{b^i,b^i+1})$. Thus, by Theorem 4 we obtain a BTP from π* _{X}* and π

*in polynomial time.*

_{Y}(⇐) Suppose *S* is a set of 2*m* − 1 conflict-free triangles for $LI$. We claim that for each $x^i$ a triangle $(b^\gamma (i),{x^i,y^\alpha (i)})$ exists, where α and γ are two permutations on {1, … ,*n*}. Note that this completes the proof, by taking π* _{X}* = γ

^{−}

^{1}and $\pi Y=(\gamma \u22121\u25cb\alpha )$ for the instance $I$. Node $x^i$ cannot be the root, as the largest number in $LI$ is β

*. Therefore, $x^i$ has a sibling*

_{m}*s*and a parent

*p*. For all $i\u2208{1,\u2026,m}$, we have $x^i=1,y^i=3,b^i=4$ and β

*= 4*

_{i}*i*, all in

**mod**4(

*m*+ 1). Thus, if $x^i+s=p\u2208LI$, we have $s=3(mod\u20614(m+1))$ and $p=4(mod\u20614(m+1))$. This implies $s=y^\alpha (i)$ and $p=b^\gamma (i)$ for some α(

*i*) and γ(

*i*). Finally, because all the elements of $LI$ are presented uniquely in

*T*, α and γ are two permutations on ${1,\u2026,m}$. Note that we construct π

*and π*

_{X}*from*

_{Y}*T*in polynomial time, and the proof is complete.□

We note that the proof above shows that the BTP problem in NP-complete in the *strong sense*, i.e. the problem is still NP-complete if the elements of the multiset are polynomially bounded. In addition, the analogous partition problem for non-binary trees is also NP-complete by reduction from the subset sum problem. See Supplementary Appendix A.6.

## 4 A $23\u2212o(1)$ APPROXIMATION FOR MAX-BTP

In the previous section, we showed that for a given multiset $L$ of 2*q* − 1 elements, each BTP for $L$ corresponds to a collection of *q* − 1 *conflict-free* triangles. Because $L$ can have at most *q* − 1 conflict-free triangles, we define the $max\u2061$-BTP problem to be the problem of finding the *maximum sized set of conflict-free triangles*. This is a closely related problem to the BTP problem: in the context of VAF data, the maximum sized set of conflict-free triangles denotes partial information about the ancestral relationships among mutations. Moreover, for a multiset of *m* elements if $max\u2061$-BTP has a solution of size Δ then a BTP with *k* = *m* − 1 − 2Δ additional nodes can be found. See Supplementary Appendix A.7.

We derive a $23\u2212o(1)$ approximation algorithm for the $max\u2061$-BTP problem for $L$. The algorithm is based on Local Search. We start with any collection of conflict-free triangles in $L$ as a solution and iteratively add another triangle as follows. For a fixed constant *t* ≥ 1, we iteratively replace any subcollection of *s* ≤ *t* triangles in the solution with *s* + 1 triangles of $L$ such that the new collection still contains only conflict-free triangles.

It is easy to see that the above local search terminates in polynomial time. Let *OPT* be the size of the optimal solution. Because we cannot have more than *q* − 1 conflict-free triangles in a solution, *OPT* ≤ *q* − 1. After each iteration, the size of the collection increases by 1, and because *t* is a constant, the search procedure at each iteration is polynomial time. Similar to the technique that was used in Hajirasouliha *et al.* (2007), we use the packing bound of Hurkens and Schrijver (1989) to prove the following theorem (Proof in Supplementary Appendix A.7).

*There exists a polynomial time algorithm that gives an approximated solution to the problem of finding maximum set of conflict-free triangles within a factor of *$23\u2212\delta $*for any δ > 0.*

## 5 THE ε-BTP PROBLEM

Typically on real data, a BTP will not exist—either because the frequencies *a _{i}* are determined with some error or the VAF data does not capture the frequency of a subpopulation that does not have mutations that exclusively occur in that subpopulation (VAFs provide information only about the proportion of cells with a mutation, and do not provide information about proportions of cells that have a specific mutation and lack another mutation.). In this section, we introduce the ε-BTP to account for these scenarios. Suppose we have the multiset $L\u223c={a\u223c1,\u2026,a\u223cm}$ of observed frequencies and a corresponding

*VAF error vector*ε = (ε

_{1}, … ,ε

*) for $L\u223c$, where ε*

_{m}*is the maximum possible error in observing $a\u223ci$ for 1 ≤*

_{i}*i*≤

*m*. To account for subpopulations without distinguishing mutations, we may need to add

*auxiliary*frequencies to $L\u223c$ that correspond to the missing subpopulation frequencies. We make the following definitions.

Given a multiset $L\u223c={a\u223c1,\u2026,a\u223cm}$ with associated VAF error vector $\u03f5=(\u03f51,\u2026,\u03f5m)$, an ε-BTP with *k* ≥ 0 *auxiliary* nodes is a BTP for a multiset $L={a1,a2,\u2026am+k}$ such that for all $i\u2264m:|ai\u2212a\u223ci|\u2264\u03f5i$. We call the nodes *a _{m}*

_{+1}, … ,

*a*

_{m}_{+}

*the*

_{k}*auxiliary nodes*of the ε-BTP.

Given a multiset $L\u223c$ and an associated VAF error vector ε, find an ε-BTP of $L\u223c$ with minimum number of auxiliary nodes such that two auxiliary nodes are not siblings.

The constraint on auxiliary nodes in the definition of ε-BTP problem follows from the assumptions in our model of cancer progression: each branching in the cancer progression happens only when at least one clonal expansion starts. So, the VAF data captures the frequency of the newly formed subpopulation (see Section 2). Thus, at least one of the children of the current subpopulation node is not an auxiliary node.

It is straightforward to show that for any multiset $L$ of size *m*, it is always possible to obtain an ε-BTP with *k* = *m* − 1 auxiliary nodes (proof in Supplementary Appendix A.3). Also, when ε* _{i}* = 0 for all 1 ≤

*i*≤

*m*, a BTP exists for $L$ if and only if the corresponding ε-BTP has a solution with

*k*= 0 auxiliary nodes.

To outline our algorithm, we need the following definitions.

Given a VAF error vector ε, an ε-CSP tree is a (weighted) binary tree, such that for each internal node $a\u223ci$ we have $a\u223cj+a\u223ck\u2208[a\u223ci\xb1(\u03f5i+\u03f5j+\u03f5k)]$, where $a\u223cj$ and $a\u223ck$ are the children of $a\u223ci$.

We say that an ε-CSP tree $T\u223c$ for a multiset $L\u223c$ is *acceptable* if we can obtain a BTP, $\alpha (T\u223c)$, by replacing each $a\u223ci$ by a value *a _{i}* where $|ai\u2212a\u223ci|\u2264\u03f5i$. Note that $\alpha (T\u223c)$ is an ε-BTP for $L\u223c$. Also, note that an ε-CSP tree is not necessarily acceptable (See Supplementary Appendix D.8). However, one can easily check whether a given ε-CSP tree $T\u223c$ is acceptable by finding a collection of

*e*’s, where $|ei|\u2264\u03f5i$, satisfying the following constraints: $(a\u223ci+ei)=(a\u223cj+ej)+(a\u223ck+ek),$ for each internal nodes $a\u223ci$ and its children $a\u223cj,a\u223ck$. This can be easily done via a linear program, which we denote by $LP(T\u223c)$.

_{i}Our Rec-BTP algorithm (Algorithm 1) uses a recursive method that works as follows: at each recursion during the algorithm, we have (i) a partially constructed ε-CSP tree $T^$, (ii) a multiset of remaining frequencies $L^$ and (iii) the number of remaining auxiliary nodes that we are allowed to use. We check if $T^$ can be extended by attaching two elements of $L^$, or one element from $L^$ and an auxiliary node, to one of the leaves in $T^$ (we assign the auxiliary node’s weight accordingly). If $L^$ is empty, it means the algorithm has constructed an ε-CSP tree. So we output ${\alpha (T^)}$ if $LP(T^)$ has a feasible solution. Finally, Rec-BTP outputs all the ε-BTPs. Iterating over all values of *k* from 0 to *m* − 1, the algorithm will find the smallest *k* such that there exists an ε-BTP.

Later in Section 6, for the purpose of benchmarking our results, in case of multiple ε-BTP outputs, we choose only the tree whose list of node frequencies has the minimum root mean square deviation (RMSD) from the original VAFs data (defined below in Section 6).

## 6 EXPERIMENTAL RESULTS

### 6.1 Simulated data

We generate simulated mutation data from all complete rooted binary trees with 3, 5, 7 and 9 nodes (Supplementary Appendix D.7). For each tree topology, we generate 1000 random BTPs by assigning a weight *a _{i}* to each node

*i*, as follows. For the root

*r*, we set

*a*= 1, assuming that our tumor sample is pure, i.e. is not contaminated with normal cells. Next, we proceed down the tree: for each parent with weight

_{r}*a*, we select a pair of real numbers

_{i}*a*and

_{j}*a*for the children uniformly at random such that the CSP condition (

_{k}*a*=

_{i}*a*+

_{j}*a*) is satisfied. Finally, we generate a set

_{k}*M*of somatic mutations for each node, with |

_{i}*M*| selected uniformly from [50 400] independently for each node. We assume all somatic mutations in the set

_{i}*M*happened independently because the parental cell was created. For each such BTP and set of mutations, we generate a

_{i}*VAF data*corresponding to each tumor subpopulation. Ideally, the VAF of a tumor subpopulation from node

*v*equals

*a*. However, because the observed frequencies are estimated from alignments of sequencing data, the observed frequencies will deviate from the true values. We assume that the observed VAFs for mutations of a subpopulation

_{v}*v*are normally distributed with mean

*a*and standard deviation σ. Specifically, for each node

_{v}*v*, let

*X*be a set of |

_{v}*M*| samples from $N(av,\sigma )$. Here, we present the results when σ = 2, with results on σ = 1, 4 in Supplementary Appendix B. The VAF data for the tumor sample is thus $X=\u222avXv$. Note that we simulate VAFs directly rather than the number of mapped reads containing a mutation. Because we assume that our sequence coverage is high (>1000×), it follows that the corresponding binomial (or negative binomial) distribution of read counts is well approximated by the normal distribution. For lower coverages, the asymptotic normal approximation may not model the data as accurately; nevertheless, the simulations provide a comparison of the different methods on high-quality data.

_{v}For each simulated VAF dataset *X*, we estimate the number and frequencies of tumor subpopulations using two non-parametric clustering algorithms: (i) Accelerated variational Dirichlet process Gaussian mixture model (AVDPM; Kurihara *et al.*, 2006), as implemented in https://sites.google.com/site/kenichikurihara/academic-software/variational-dirichlet-process-gaussian-mixture-model, which is a general clustering method that we apply directly on VAF data, and (ii) *SciClone* (Miller *et al.*, forthcoming), which is a recent algorithm (with available software but no published paper) that estimates tumor composition from VAF data by clustering the data using a mixture of Gaussian model. Parameter settings for each method are given in Supplementary Appendix C. Also, because SciClone runtimes were extremely long, we down sampled the mutation data by a factor of 20. For each of our synthetic dataset, we implanted the fraction of the mutations (together with their corresponding VAF) on a synthetic chromosome with neutral copy number compatible with the SciClone input format (See Supplementary Appendix C). We ran SciClone with default parameters on each dataset and then extracted the means of reported clusters from SciClone.

From the output of each clustering algorithm, we obtain the input for our Rec-BTP algorithm. We compute the sample mean $a\u223ci$ and standard deviation σ* _{i}* for each cluster

*C*, and set the VAF error ε

_{i}*for the subpopulation frequency of cluster*

_{i}*C*equal to $1.96\xb7c\sigma i|Ci|$, where

_{i}*c*is a constant set to 3. Note that $1.96\xb7\sigma i/|Ci|$ is the radius of the empirical 95% confidence interval in estimating the true subpopulation frequency.

We set $L={a\u223c1,\u2026,a\u223cm}$ and $\u03f5=(\u03f51,\u2026,\u03f5m)$ as the input to Rec-BTP. We find the minimum *k* for which there exists a ε-BTP for $L$ with *k* auxiliary nodes. In many cases, there are multiple BTPs for $L$ with exactly *k* auxiliary nodes. In these cases, we select a single BTP with the *minimum cost*$costX(T)=\u2211i=1s\u2211x\u2208Di|x\u2212ai|2$, where $s=n\u2212k,X$ is the VAF dataset and $Di={x\u2208X|arg minj|aj-x|=i}$ is the subset of elements of *X* that are closest to *a _{i}*.

We compare the results of our Rec-BTP algorithm (applied to clusters from both AVDPM and SciClone) to the original clusters output from these algorithms over the 1000 randomly constructed BTPs for each of the seven tree topologies. We use two measures to compare the estimated subpopulation frequencies: (i) the *number* of subpopulations and (ii) the RMSD between the set of estimated subpopulation frequencies and the true subpopulation frequencies.

*Number of subpopulations*. Figure 2 shows that Rec-BTP outputs the correct number of clusters more frequently than the clustering methods. For trees with 3 and 5 nodes, Rec-BTP does not improve the AVDPM clusters much. However, with larger number of nodes the advantage of Rec-BTP grows. Rec-BTP provides a large improvement over the SciClone clusters.

We further examined the scenarios where each algorithm reported the correct and incorrect number of clusters. Figure 3 compares the fraction of cases where the clustering method and the Rec-BTP assisted method report too few, the correct number or too many clusters. We see that most of the cases where Rec-BTP reports the correct number of cases (blue squares) are those where the clustering algorithm reported too few clusters and the Rec-BTP algorithm created additional clusters. Only in the case of 3 and 5 nodes do AVDPM and SciClone determine the correct number of clusters (green squares) in an appreciable fraction of cases. Overall, we see that the clustering methods tend to underestimate the correct number of clusters (first columns in each table in Fig. 3). In a significant fraction of these cases, Rec-BTP adds auxiliary nodes to obtain the correct number of clusters, although this becomes more difficult with larger trees. We also see that SciClone tends to underestimate the number of subpopulations more frequently than AVDPM.

*Accuracy of the subpopulation frequencies**.* We compare the estimated population frequencies and true population frequencies for each method using the RMSD. Suppose *a*_{1} ≥ … ≥ *a _{n}* are the true subpopulation frequencies, and $a\u223c1\u2265\u2026\u2265a\u223cm$ are the estimated subpopulation frequencies. If

*m*=

*n*, the RMSD is $1n\u2211i=1n(ai\u2212a\u223ci)2$. If $m\u2260n$, we add zeros to the shorter sequence so the two sequences have equal length. The zeros reflect the fact that we have not estimated the frequencies of some subpopulations.

Table 1 gives RMSD for AVDPM, SciClone and Rec-BTP built from these clusters. Specifically, we provide the RMSDs when AVDPM or SciClone give (i) the same and (ii) less than the correct number of subpopulations. For some tree topologies, there is no sample in which both Rec-BTP and AVDPM/SciClone give the same number of subpopulations equal to the true number of subpopulations, which we denote by N/A.

Tree | Rec-BTP = AVDPM | Rec-BTP > AVDPM | Rec-BTP = SciClone | Rec-BTP > SciClone | ||||
---|---|---|---|---|---|---|---|---|

Rec-BTP | AVDPM | Rec-BTP | AVDPM | Rec-BTP | SciClone | Rec-BTP | SciClone | |

T3 | 0.9 ± 0.8 | 0.7 ± 1.1 | 1.3 ± 0.6 | 45.1 ± 11.3 | 1.2 ± 0.4 | 1.5 ± 0.4 | 2.6 ± 1.1 | 43.7 ± 11 |

T5 | 6.9 ± 5.6 | 7.5 ± 5.2 | 9 ± 12.8 | 17.4 ± 14.1 | 1 ± 0.5 | 1.2 ± 0.4 | 4.2 ± 6.4 | 15.7 ± 9.2 |

T7_{A} | 8.6 ± 4.8 | 8.2 ± 4.5 | 9.2 ± 5.9 | 10.1 ± 6.5 | N/A | N/A | 6.9 ± 7.7 | 12.8 ± 7.5 |

T7_{B} | 8.4 ± 2.8 | 8.1 ± 2.4 | 8.8 ± 6.2 | 11.7 ± 7.1 | N/A | N/A | 6.2 ± 6.9 | 13.2 ± 6.8 |

V 9_{A} | N/A | N/A | 8.9 ± 4.7 | 8 ± 4.6 | N/A | N/A | 6.4 ± 5.4 | 10.8 ± 5.5 |

V 9_{B} | 2 ± 0 | 2.2 ± 0 | 8.1 ± 4.5 | 8.3 ± 4.4 | N/A | N/A | 5.6 ± 4.9 | 11.4 ± 5.2 |

V 9_{C} | N/A | N/A | 7.5 ± 4.5 | 9.9 ± 4.4 | N/A | N/A | 6.1 ± 5.1 | 10.1 ± 4.5 |

Tree | Rec-BTP = AVDPM | Rec-BTP > AVDPM | Rec-BTP = SciClone | Rec-BTP > SciClone | ||||
---|---|---|---|---|---|---|---|---|

Rec-BTP | AVDPM | Rec-BTP | AVDPM | Rec-BTP | SciClone | Rec-BTP | SciClone | |

T3 | 0.9 ± 0.8 | 0.7 ± 1.1 | 1.3 ± 0.6 | 45.1 ± 11.3 | 1.2 ± 0.4 | 1.5 ± 0.4 | 2.6 ± 1.1 | 43.7 ± 11 |

T5 | 6.9 ± 5.6 | 7.5 ± 5.2 | 9 ± 12.8 | 17.4 ± 14.1 | 1 ± 0.5 | 1.2 ± 0.4 | 4.2 ± 6.4 | 15.7 ± 9.2 |

T7_{A} | 8.6 ± 4.8 | 8.2 ± 4.5 | 9.2 ± 5.9 | 10.1 ± 6.5 | N/A | N/A | 6.9 ± 7.7 | 12.8 ± 7.5 |

T7_{B} | 8.4 ± 2.8 | 8.1 ± 2.4 | 8.8 ± 6.2 | 11.7 ± 7.1 | N/A | N/A | 6.2 ± 6.9 | 13.2 ± 6.8 |

V 9_{A} | N/A | N/A | 8.9 ± 4.7 | 8 ± 4.6 | N/A | N/A | 6.4 ± 5.4 | 10.8 ± 5.5 |

V 9_{B} | 2 ± 0 | 2.2 ± 0 | 8.1 ± 4.5 | 8.3 ± 4.4 | N/A | N/A | 5.6 ± 4.9 | 11.4 ± 5.2 |

V 9_{C} | N/A | N/A | 7.5 ± 4.5 | 9.9 ± 4.4 | N/A | N/A | 6.1 ± 5.1 | 10.1 ± 4.5 |

*Note*: Bold face text indicates best performance.

In cases where both methods (Rec-BTP and a clustering algorithm) return the same number of subpopulations, they have similar performance in estimating the subpopulation frequencies. Also, as mentioned earlier, these results are for the simulated input data in which the variant allele frequencies deviate from their true value with standard deviation σ = 2. When σ = 1 Rec-BTP performs even better.

### 6.2 Comparison with Phylosub

As noted in the introduction, Jiao *et al.* (2014) is a recent method that clusters VAF frequencies using a tree constraint. In particular, Jiao *et al.* (2014) replace the Dirichlet process mixture for clustering with a Bayesian non-parametric prior over trees satisfying a weak form of the CSP constraint. We compared our Rec-BTP algorithm with *PhyloSub*.

We generated VAF data from a collection of 400 random complete binary trees with three, five, seven and nine nodes with fixed topologies. For each tree, we generated 100 random instances. In contrast to the simulations in the previous section, here we used only one of the two topologies for trees with seven nodes and only one of the three possible topologies for trees with nine nodes. We converted each random simulated VAF data to a *PhyloSub* input identical to the procedure performed for the simulations in the *PhyloSub* paper. In creating *PhyloSub* inputs, we assumed the total read counts for every single nucleotide variants (SNV) position is 10 000 (i.e. an ideal uniform coverage of 10 000× for the *PhyloSub* input) and assumed every SNV is heterozygous. On each dataset, we ran the Markov chain Monte Carlo (MCMC) method of *PhyloSub* 100 times, each with 5000 MCMC iterations as per Jiao *et al.* (2014), and used the reported *top trees* (i.e. those trees with best log likelihood) in our comparison.

We found that *PhyloSub* produced trees with many more nodes than the simulated value, and significantly more than Rec-BTP or SciClone (Fig. 4 and Supplementary Appendix D.9]). Because *PhyloSub* usually tends to report trees with a higher number of nodes, we also considered the size of the smallest tree reported by *PhyloSub* in their provided list of *top trees* for each input. While this value was smaller, it was still much larger than the true value or the values from the other approaches.

The large number of clusters produced by *PhyloSub* might result from the fact that the method does not assume that the trees are binary. The output trees contain many different topologies. Nevertheless, it is surprising that *PhyloSub* does not find binary trees when the data are produced from this topology. As *PhyloSub* reports a higher number of nodes than Rec-BTP, we were unable to directly compare the provided frequencies of the clusters of each method.

### 6.3 Acute myeloid leukemia sequencing data

We tested our algorithm on VAFs obtained from deep read counts information for SNVs from an acute myeloid leukemia sample (AML1/UPN933124) using data from Ding *et al.* (2012). We used the 386 SNVs reported in the primary AML sample, obtaining the tumor VAF data directly from Supplementary Table S5a in Ding *et al.* (2012). Note that Ding *et al.* (2012) also report data from a relapse sample following chemotherapy. As the relapse-specific mutations form only one cluster, we do not analyze the BTP problem for this sample in our study. Nevertheless, the generalization of the ε-BTP problem for the case where the input data contains two types of VAFs (e.g. both tumor- and relapse-specific mutations) is an interesting open problem. We first ran SciClone on the VAF data, obtaining four distinct VAF clusters with means of 47.17, 33.17, 22.42, 3.65%. We then ran Rec-BTP on these clusters, fixing the root of the BTP as an additional node with frequency 100%, reflecting the fact that the tumor sample was pure and started from a single founder clone (Ding *et al.*, 2012).

Figure 5 shows the resulting ε-BTP with corresponding multiset of population frequencies: $LRec\u2212BTP=$ {100, 53.75, 46.25, 42.86, 32.25, 21.5, 3.39}. Ding *et al.* (2012) present a history of clonal expansions that implies an ε-BTP for the multiset ${100,53.12,12.74,29.04,5.1}$ with two auxiliary nodes (Fig. 4b). There are two such ε-BTP, depending on the relative order of two clonal expansions (Fig. 4), one with subpopulation frequencies: $L1$ = {100, 53.12, 46.88, 12.74, 34.14, 29.04, 5.1} and another with frequencies $L2=$ {100, 34.14, 65.86, 53.12, 12.74, 29.04, 5.10}. However, because the frequencies reported in Ding *et al.* (2012) are scaled according to the estimated 93.72% purity of their sample, it is necessary to multiply the frequencies in $L1$ and $L2$ by 0.9372 before comparing with the clusters obtained from the VAF data.

We compare these different subpopulation frequencies estimates by computing the average *ℓ*_{1} norm between the VAF for each mutation and the closest subpopulation. Table 2 shows this measure for each of the subpopulation multisets. We see that the Rec-BTP gives a better fit to the VAF data than both estimates $L1$ and $L2$ given in Ding *et al.* (2012). This shows that using the tree constraint in the BTP provides additional information that is useful for clustering real VAF data.

Metric | $LRec\u2212BTP$ | $L1$ | $L2$ |
---|---|---|---|

Average ℓ_{1} norm | 1.6367 | 3.3309 | 1.8376 |

Average ℓ_{2} norm | 0.1161 | 0.2221 | 0.1331 |

Metric | $LRec\u2212BTP$ | $L1$ | $L2$ |
---|---|---|---|

Average ℓ_{1} norm | 1.6367 | 3.3309 | 1.8376 |

Average ℓ_{2} norm | 0.1161 | 0.2221 | 0.1331 |

Note that calculation of *ℓ*_{1} and *ℓ*_{2} norms for the SciClone cluster means would result in 2.4816 and 0.1823, respectively. However, because the number of SciClone clusters is only 4, these numbers cannot be reasonably compared with the ones calculated for the trees in Figure 5.

## 7 DISCUSSION

In this article, we provide the first rigorous combinatorial formulation of the problem of inferring the composition of tumor subpopulations constrained by a tree in a single tumor sample from the variant allele frequencies (VAFs) of somatic mutations. In our formulation, we introduced a novel definition of the BTP and the ε-BTP. We showed that the problem of finding a BTP (and hence an ε-BTP) is in general NP-complete; however, we derived an approximation algorithm for a related problem, $max\u2061$-BTP. We developed a recursive algorithm for the ε-BTP that works well in practice and showed the advantages on this algorithm on simulated and real sequencing data.

These results show the utility of the BTP, but also suggest additional areas of further investigation. In particular, it would be interesting to combine the clustering of VAFs and the construction of the BTP into a single model. While there has been some work to combine these two steps using a machine learning approach (Jiao *et al.*, 2014), the complexity of the corresponding inference problem is unknown. In our tests, we were unable to obtain satisfactory results using this model, suggesting there is room for additional improvements. One possible direction is to use MCMC or other sampling approaches over the space of BTPs, perhaps combining this inference into a graphical model that better models the features of real sequencing data [e.g. as used in pyClone (Shah *et al.*, 2012; Roth *et al.*, 2014)]. Finally, the extension of the BTP and related approaches to multiple samples from the same tumor (taken at the same or different times) will be increasingly useful as such data become available.

## ACKNOWLEDGEMENT

The authors thank Li Ding and Michael McLellan for assistance with the AML data.

*Funding*: This work was supported by National Science Foundation CAREER Award (CCF-1053753 to B.J.R.) and the National Institutes of Health (R01HG5690 to B.J.R.). B.J.R. is also supported by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund, an Alfred P. Sloan Research Fellowship. I.H. is also supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Postdoctoral Fellowship.

*Conflict of Interest*: none declared.

## REFERENCES

*t*of which have an SDR, with an application to the worst-case ratio of heuristics for packing problems

## Author notes

^{†}The authors wish it to be known that in their opinion, the first two authors should be regarded as Joint First Authors.

## Comments