## Abstract

Phylogenetic trees are widely used to display estimates of how groups of species are evolved. Each phylogenetic tree can be seen as a collection of *clusters*, subgroups of the species that evolved from a common ancestor. When phylogenetic trees are obtained for several datasets (e.g. for different genes), then their clusters are often contradicting. Consequently, the set of all clusters of such a dataset cannot be combined into a single phylogenetic tree. *Phylogenetic networks* are a generalization of phylogenetic trees that can be used to display more complex evolutionary histories, including *reticulate events*, such as hybridizations, recombinations and horizontal gene transfers. Here, we present the new Cass algorithm that can combine any set of clusters into a phylogenetic network. We show that the networks constructed by Cass are usually simpler than networks constructed by other available methods. Moreover, we show that Cass is guaranteed to produce a network with at most two reticulations per biconnected component, whenever such a network exists. We have implemented Cass and integrated it into the freely available Dendroscope software.

**Contact:**l.j.j.v.iersel@gmail.com

**Supplementary information:**Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

*Phylogenetics* studies the reconstruction of evolutionary histories from genetic data of currently living organisms. A (rooted) *phylogenetic tree* is a representation of such an evolutionary history in which species evolve by mutation and speciation. The leaves of the tree represent the species under consideration and the root of the tree represents their most recent common ancestor. Each internal node represents a speciation: 1 species splits into several new species. Thus, mathematically speaking, such a node has indegree 1 and outdegree at least 2. In recent years, a lot of work has been done on developing methods for computing (rooted) *phylogenetic networks* (Gambette, 2009; Gusfield *et al.*, 2007b; D.H.Huson *et al.*, submitted for publication; Nakleh, 2009; Semple, 2007), which form a generalization of phylogenetic trees. Next to nodes representing speciation, rooted phylogenetic networks can also contain *reticulations*: nodes with indegree at least 2. Such nodes can be used to represent the recombinations, hybridizations or horizontal gene transfers, depending on the biological context. In addition, phylogenetic networks can also be interpreted in a more abstract sense, as a visualization of contradictory phylogenetic information in a single diagram.

Suppose we wish to investigate the evolution of a set 𝒳 of taxa (e.g. species or strains). Each edge of a rooted phylogenetic tree represents a *cluster*: a proper subset of the taxon set 𝒳. In more detail, an edge (*u*, *v*) represents the cluster containing those taxa that are descendants of *v*. Each phylogenetic tree *T* is uniquely defined by the set of clusters represented by *T*. Phylogenetic networks also represent clusters. Each of their edges represents one ‘hardwired’ and at least one ‘softwired’ cluster. An edge (*u*, *v*) of a phylogenetic network *represents* a cluster *C* ⊂ 𝒳 *in the hardwired sense* if *C* equals the set of taxa that are descendants of *v*. Furthermore, (*u*, *v*) *represents C in the softwired sense* if *C* equals the set of all taxa that can be reached from *v* when, for each reticulation *r*, exactly one incoming edge of *r* is ‘switched on’ and the other incoming edges of *r* are ‘switched off’. An equivalent definition states that a phylogenetic network *N represents* a cluster *C in the softwired sense* if there exists a tree *T* that is displayed by *N* (formally defined below) and represents *C*. In this article, we will always use ‘represent’ in the softwired sense. It is usually the clusters in a tree that are of more interest, and less the actual trees themselves, as clusters represent putative monophyletic groups of related species. For a complete introduction to clusters see D.H.Huson *et al.* (submitted for publication).

In phylogenetic analysis, it is common to compute phylogenetic trees for more than one dataset. For example, a phylogenetic tree can be constructed for each gene separately, or several phylogenetic trees can be constructed using different methods. To accurately reconstruct the evolutionary history of all considered taxa, one would preferably like to use the set 𝒞 of all clusters represented by at least one of the constructed phylogenetic trees. In general, however, some of the clusters of the different trees will be incompatible, which means that there will be no single phylogenetic tree representing 𝒞. Therefore, several recent publications have studied the construction of a phylogenetic *network* representing 𝒞. Huson and Rupp (2008) describe how a phylogenetic network can be constructed that represents 𝒞 in the hardwired sense (a *cluster network*). A network is a *galled network* if it contains no path between two reticulations that is contained in a single *biconnected component* (a maximal subgraph that cannot be disconnected by removing a single node, see Fig. 1). Huson and Klöpper (2007) and Huson *et al.* (2009) describe an algorithm for constructing a galled network representing 𝒞 in the softwired sense.

Related literature describes the construction of phylogenetic networks from phylogenetic trees or *triplets* (phylogenetic trees on three taxa). A tree or triplet *T* is *displayed* by a network *N* if there is a subgraph *T*′ of *N* that is a subdivision of *T* (i.e. *T*′ can be obtained from *T* by replacing edges by directed paths). Computing the minimum number of reticulation required in a phylogenetic network displaying two input trees (on the same set of taxa) was shown to be APX-hard by Bordewich and Semple (2007). Bordewich *et al.* (2007) proposed an exact exponential-time algorithm for this problem and Linz and Semple (2009) showed that it is fixed parameter tractable (FPT), if parameterized by the minimum number of reticulations. The downside of these algorithms is that they are very rigid in the sense that one generally needs very complex networks in order to display the given trees.

The *level* of a binary network is the maximum number of reticulations in a biconnected component,^{1} and thus provides a measure of network complexity. Given an arbitrary number of trees on the same set of taxa, Huynh *et al.* (2005) describe a polynomial-time algorithm that constructs a level-1 phylogenetic network that displays all trees and has a minimum number of reticulations, if such a network exists (which is unlikely in practice). Given a triplet for each combination of three taxa, Jansson and coworkers (Jansson and Sung, 2006; Jansson *et al.*, 2006) and give a polynomial-time algorithm that constructs a level-1 network displaying all triplets, if such a network exists. The algorithm by van Iersel and Kelk (2009) can be used to find such a network that also minimizes the number of reticulations. These results have later been extended to level-2 (van Iersel and Kelk, 2009; van Iersel *et al.*, 2009a) and more recently to level-*k*, for all *k* ∈ ℕ (To and Habib, 2009). Although this work on triplets is theoretically interesting, it has the practical drawback that biologists do not work with triplets (but rather with trees or clusters) and that it is rather difficult to intuitively convey what it means for a triplet to be ‘in’ a network. An additional drawback is that these triplet algorithms need at least one triplet in the input for each combination of three taxa, while some triplets might be difficult to derive correctly. If, for example, one induces triplets from a set of trees, then this is likely not to give you a triplet for each combination of three taxa, if one or more input trees are not fully resolved or if some input trees do not have exactly the same set of taxa.

In this article, we present the algorithm Cass,^{2} which takes any set 𝒞 of clusters as input and constructs a phylogenetic network that represents 𝒞 (in the softwired sense). Furthermore, the algorithm aims at minimizing the level of the constructed network and in this sense Cass is the first algorithm to combine the flexibility of clusters with the power of level minimization. Cass constructs a phylogenetic tree representing 𝒞 whenever such a tree exists. Moreover, we prove that Cass constructs a level-1 or level-2 network representing 𝒞 whenever there exists a level-1 or level-2 network representing 𝒞, respectively. Experimental results show that also when no level-2 network representing 𝒞 exists, Cass usually constructs a network with a significantly lower level and lower number of reticulations compared with other algorithms. In fact, we conjecture that similar arguments as in our proof for level-2 can be used to show that Cass always constructs a level-*k* network with minimum *k*. We prove a decomposition theorem for level-*k* networks that supports this conjecture. Finally, we prove that Cass runs in polynomial time if the level of the output network is bounded by a constant.

We have implemented Cass and added it to our popular tree-drawing program Dendroscope (Huson *et al.*, 2007), where it can be used as an alternative for the cluster network (Huson and Rupp, 2008) and galled network (Huson *et al.*, 2009) algorithms. Experiments show that, although Cass needs more time than these other algorithms, it constructs a simpler network representing the same set of clusters. For example, Figure 2a shows a set of clusters and the galled network with four reticulations constructed by the algorithm in Huson *et al.* (2009). However, for this dataset also a level-2 network with two reticulations exists, and Cass can be used to find this network, see Figure 2b. Dendroscope now combines the powers of Cass and the two previously existing algorithms for constructing galled- and cluster networks.

## 2 LEVEL-*K* NETWORKS AND CLUSTERS

Consider a set 𝒳 of taxa. A *rooted (phylogenetic) network* (on 𝒳) is a directed acyclic graph with a single root and leaves bijectively labeled by 𝒳. The indegree of a node *v* is denoted δ^{−}(*v*) and *v* is called a *reticulation* if δ^{−}(*v*) ≥ 2. An edge (*u*, *v*) is called a *reticulation edge* if its head *v* is a reticulation and is called a *tree edge* otherwise. We assume without loss of generality that each reticulation has outdegree at least 1. Consequently, each leaf has indegree 1. When counting reticulations in a phylogenetic network, we count reticulations with more than two incoming edges more than once because, biologically, these reticulations represent several reticulate evolutionary events. Therefore, we formally define the *reticulation number* of a phylogenetic network *N* = (*V*, *E*) as

A directed acyclic graph is *connected* (also called ‘weakly connected’) if there is an undirected path (ignoring edge orientations) between each pair of nodes. A node (edge) of a directed graph is called a *cut-node* (*cut-edge*) if its removal disconnects the graph. A directed graph is *biconnected* if it contains no cut-nodes. A biconnected subgraph *B* of a directed graph *G* is said to be a *biconnected component* if there is no biconnected subgraph *B*′ ≠ *B* of *G* that contains *B*.

A phylogenetic network is said to be a *level-k network* if each biconnected component has reticulation number at most *k*.^{3} A phylogenetic network is called *binary* if each node has either indegree at most 1 and outdegree at most 2 or indegree at most 2 and outdegree at most 1. Note that the above definition of level generalizes the original definition (Choy *et al.*, 2005) for binary networks. A level-*k* network is called a *simple level*-≤*k network* if the head of each cut-edge is a leaf. A simple level-≤*k* network is called a *simple level-k network* if its reticulation number is precisely *k*. For example, Figure 2a is a simple level-4 network and Figure 2b is a simple level-2 network. A *phylogenetic tree* (on 𝒳) is a phylogenetic network (on 𝒳) without reticulations, i.e. a level-0 network.

Consider a set of taxa 𝒳. Proper subsets of 𝒳 are called *clusters*. We say that two clusters *C*_{1}, *C*_{2} ⊂ 𝒳 are *compatible* if either *C*_{1} ∩ *C*_{2} = ∅ or *C*_{1} ⊂ *C*_{2} or *C*_{2} ⊂ *C*_{1}. Consider a set of clusters 𝒞. We say that a set of taxa *X* ⊆ 𝒳 is *separated* (by 𝒞) if there exists a cluster *C* ∈ 𝒞 that is incompatible with *X*. The *incompatibility graph IG*(𝒞) of 𝒞 is the undirected graph (*V*, *E*) that has node set *V* = 𝒞 and edge set

## 3 DECOMPOSING LEVEL-*K* NETWORKS

In this section, we describe the general outline of our algorithm Cass. We show how the problem of determining a level-*k* network can be decomposed into a set of smaller problems by examining the incompatibility graph. Our algorithm will first construct a simple level- ≤*k* network for each connected component of the incompatibility graph and subsequently merge these simple level- ≤*k* networks into a single level-*k* network on all taxa.

We first give a formal description of the algorithm, which is illustrated by an example in Figure 3. After that we will explain why we can use this approach.

Consider a set of taxa 𝒳 and a set 𝒞 of input clusters. We assume that all singletons (sets {*x*} with *x* ∈ 𝒳) are clusters in 𝒞. Our algorithm proceeds as follows.

**Step 1.**Find the non-trivial connected components 𝒞_{1},…, 𝒞_{p}of the incompatibility graph IG(𝒞). For each*i*∈ {1,…,*p*}, let 𝒞_{i}′ be the result of collapsing unseparated sets of taxa as follows. Let . For each maximal subset*X*⊂ 𝒳_{i}that is not separated by 𝒞_{i}, replace, in each cluster in 𝒞_{i}, the elements of*X*by a single new taxon*X*, e.g. if*X*= {*b*,*c*} then a cluster {*a*,*b*,*c*,*d*} is modified to {*a*, {*b*,*c*},*d*}.**Step 2.**For each*i*∈ {1,…,*p*}, construct a simple level-≤*k*network*N*_{i}representing 𝒞_{i}′.**Step 3.**Let 𝒞^{*}be the result of applying the following modifications to 𝒞, for each*i*∈ {1,…,*p*}: remove all clusters that are in 𝒞_{i}, add a cluster 𝒳_{i}and add each maximal subset*X*⊂ 𝒳_{i}that is not separated by 𝒞_{i}. Construct the unique phylogenetic tree*T*on 𝒳 representing precisely those clusters in 𝒞^{*}. (Notice that each trivial connected component of the incompatibility graph is also a cluster in 𝒞^{*}.)**Step 4.**For each*i*∈ {1,…,*p*}, replace in*T*the lowest common ancestor*v*_{i}of 𝒳_{i}by the simple level-≤*k*network*N*_{i}as follows. Delete all edges leaving*v*_{i}and merge*T*with*N*_{i}by identifying the root of*N*_{i}with*v*_{i}and identifying each leaf of*N*_{i}labeled*X*by the lowest common ancestor of the leaves labeled*X*in*T*. Output the resulting network.

Notice that Steps 1, 3 and 4 are similar to the corresponding steps in algorithms for constructing galled trees (i.e. level-1 networks) and galled networks (Huson and Klöpper, 2007; Huson *et al.*, 2009; D.H.Huson *et al.*, submitted for publication). The reason why we use the same set-up in our algorithm, is outlined by Theorem 1. It shows that, when constructing a level-*k* network displaying a set of clusters, we can restrict our attention to level-*k* networks that satisfy the *decomposition property* (D.H.Huson *et al.*, submitted for publication), which intuitively says that the biconnected components of the network correspond to the connected components of the incompatibility graph. We now repeat the formal definition.

Since a cluster *C* ∈ 𝒞 can be represented by more than one edge in a network *N*, an *edge assignment* ϵ is defined as a mapping that chooses for each cluster *C*∈𝒞 a single tree edge ϵ(*C*) of *N* that represents *C*. A network *N* representing 𝒞 is said to satisfy the *decomposition property* w.r.t. 𝒞 if there exists an edge assignment ϵ such that:

for any two clusters

*C*_{1},*C*_{2}∈ 𝒞, the edges ϵ(*C*_{1}) and ϵ(*C*_{2}) are contained in the same biconnected component of*N*if and only if*C*_{1}and*C*_{2}lie in the same connected component of the incompatibility graph IG(𝒞).

### Theorem 1.

*Let* 𝒞 *be a set of clusters. If there exists a level-k network representing* 𝒞, *then there also exists such a network satisfying the decomposition property w.r.t.* 𝒞.

### Proof.

Let 𝒞 be a set of input clusters and *N* a level-*k* network representing 𝒞. Let 𝒞_{1},…, 𝒞_{p} be the non-trivial connected components of the incompatibility graph IG(𝒞). For each *i* ∈ {1,…, *p*}, we construct a simple level-≤*k* network *N*_{i} as follows. Let as before. For each maximal subset *X* ⊂ 𝒳_{i} (with |*X*|> 1) that is not separated by 𝒞_{i}, replace in *N* an arbitrary leaf labeled by an element of *X* by a leaf labeled *X* and remove all other leaves labeled by elements of *X*. In addition, remove all leaves with labels that are not in *X*_{i}. We tidy up the resulting graph by repeatedly applying the following five steps until none is applicable: (i) delete unlabeled nodes with outdegree 0; (ii) suppress nodes with indegree and outdegree 1 (i.e. contract one edge incident to the node); (iii) replace multiple edges by single edges, (iv) remove the root if it has outdegree 1 and (v) contract biconnected components that have only one outgoing edge. This leads to a level-*k* network *N*_{i}. Let 𝒞_{i}′ be defined as in Step 1 of the algorithm. By its construction, *N*_{i} represents 𝒞_{i}′. Furthermore, *N*_{i} is a simple level-≤*k* network, because if it would contain a cut-edge *e* whose head is not a leaf, then the set of taxa labeling leaves reachable from *e* would not be separated by 𝒞_{i}′ and would hence have been collapsed. Finally, the networks *N*_{1},…, *N*_{p} can be merged into a level-*k* network representing 𝒞 and satisfying the decomposition property by executing Steps 3 and 4 of the algorithm. ▪

Intuitively, Theorem 1 tells us that whenever there exists a level-*k* network *N* representing 𝒞, there also exists such a network *N*′ whose biconnected components correspond to the connected components of the incompatibility graph. Since *N*′ has level *k*, each biconnected component has level at most *k*. Hence, we can construct a simple level-≤*k* network for each connected component of the incompatibility graph. Subsequently, we can merge these simple level-≤*k* networks into a level-*k* network representing 𝒞. This is precisely what the set-up described above does.

Note finally that the statement obtained by replacing ‘level-*k* network’ by ‘network with *k* reticulations’ in Theorem 1 does not hold, as shown in Huson *et al.* (2009), based on Gusfield *et al.* (2007a).

## 4 SIMPLE LEVEL-*K* NETWORKS

This section describes how one can construct a simple level-*k* network representing a given set of clusters. We say that a phylogenetic tree *T* is a *strict subtree* of a network *N* if *T* is a subgraph of *N* and for each node *v* of *T*, except its root, it holds that the in- and outdegree of *v* in *T* are equal to the in- and (respectively) outdegree of *v* in *N*.

Informally, our method for constructing simple level-*k* networks operates as follows. See Figure 4 for an example. Cass loops over all taxa *x*. For each choice of *x*, *C*ass removes it from each cluster and subsequently collapses all maximal ‘ST-sets’ (‘strict tree sets’, defined below) of the resulting cluster set. The algorithm repeats this step *k* times, after which all leaves will be collapsed into just two taxa and the second phase of the algorithm starts. Cass creates a network consisting of a root with two children, labeled by the only two taxa. Then the algorithm ‘decollapses’, i.e. it replaces each leaf labeled by an ST-set by a strict subtree. Subsequently, Cass adds a new leaf below a new reticulation and labels it by the latest removed taxon. Since it does not know where to create the new reticulation, Cass tries adding the reticulation below each pair of edges. The algorithm continues with a new decollapse step followed by hanging the next leaf below a reticulation. These steps are also repeated *k* times. For each constructed simple level-*k* network, Cass checks whether it represents all input clusters. If it does, the algorithm outputs the resulting network, after contracting any edges that connect two reticulations.

The idea behind this strategy is as follows. Observe that any simple level-*k* network *N* (*k* ≥ 1) contains a leaf whose parent is a reticulation (since we assume that each reticulation has outdegree at least 1). If we remove this leaf and reticulation from *N*, the resulting network might contain one or more strict subtrees. To reconstruct the network, we need to identify these strict subtrees from the set of clusters. We will see below that each strict subtree corresponds to an ST-set. Moreover, for the case *k* ≤ 2, we prove that (without loss of generality) each maximal strict subtree corresponds to a *maximal* ST-set. Cass collapses the maximal ST-sets because it assumes that these correspond to the strict subtrees. Now observe that collapsing each maximal strict subtree of the network leads to a (not necessarily simple) level-(*k* − 1) network, which again contains a leaf whose parent is a reticulation. It follows that it is indeed possible to repeat the described steps *k* times. Finally, Cass checks if all clusters are represented and only outputs networks for which this is the case.

Let us now formalize this algorithm. Given a set *S* ⊆ 𝒳 of taxa, we use 𝒞∖*S* to denote the result of removing all elements of *S* from each cluster in 𝒞 and we use 𝒞|*S* to denote 𝒞∖(𝒳∖*S*) (the restriction of 𝒞 to *S*). We say that a set *S* ≠ 𝒳 is an *ST-set* (strict tree set) w.r.t. 𝒞, if *S* is not separated by 𝒞 and any two clusters *C*_{1}, *C*_{2} ∈ 𝒞|*S* are compatible. An ST-set *S* is *maximal* if there is no ST-set *T* with *S* ⊊ *T*. Informally, the maximal ST-sets are the result of repeatedly collapsing pairs of unseparated taxa for as long as possible.

We use Collapse(𝒞) to denote the result of collapsing each maximal ST-set *S* into a single taxon *S*. More precisely, for each cluster *C* ∈ 𝒞 and maximal ST-set *S* of 𝒞, we replace *C* by *C*∖*S*∪{{*S*}}. For example (omitting singleton clusters), if

*k*) in Algorithm 1. The actual implementation is slightly more complex and much more space efficient.

Figure 4 shows how the Cass(2) algorithm, for example, constructs a simple level-2 network. We will now show that Cass(1) and Cass(2) will indeed construct a simple level-1, respectively, level-2 network whenever this is possible.

### Lemma 1.

*Given a set of clusters* 𝒞, *such that IG*(𝒞) *is connected and any X* ⊊ 𝒳 *is separated*, Cass(1) *and* Cass(2) *construct a simple level-1, respectively, a simple level-2 network representing* 𝒞, *if such a network exists*.

### Proof.

The general idea of the proof is as follows. Details have been omitted due to space constraints. Assume *k*≤2. It is clear that any (simple) level-*k* network *N* contains a reticulation *r* with a leaf, say labeled *x*, as child. Let *N*∖{*x*} denote the network obtained by removing the reticulation *r* and the leaf labeled *x* from *N*. This network might contain one or more strict subtrees. By the definition of ST-set, the set of leaf labels of each maximal strict subtree corresponds to an ST-set w.r.t. 𝒞∖{*x*}. However, in general not each such set needs to be a *maximal* ST-set. This is critical, because the total number of ST-sets can be exponentially large. Therefore, the main ingredient of our proof is the following. We show that whenever there exists a simple level-*k* network representing 𝒞, there exists a simple level-*k* network *N*′ representing 𝒞 such that the sets of leaf-labels of the maximal strict subtrees of *N*′∖{*x*} are the maximal ST-sets w.r.t. 𝒞∖{*x*}, with *x* the label of some leaf whose parent is a reticulation in *N*′. This is clearly true for *k* = 1. For *k* = 2, we sketch our proof below.

Let us first mention that the actual algorithm is slightly more complicated than the pseudocode in Algorithm 1. First, when Cass(*k*) constructs a tree, it adds a new ‘dummy’ root to this tree and creates an edge from this dummy root to the old root. Such a dummy root is removed before outputting a network. Second, whenever the algorithm removes a dummy taxon δ (which we use to model the situation when the previous leaf removal caused more than one reticulation to disappear), it makes sure that it does not collapse in the previous step.

Suppose there exists some level-2 network representing 𝒞. It can be shown that any such network is simple and that there exists at least one binary such network, say *N*. Since *N* is a binary simple level-2 network, there are only four possibilities for the structure of *N* (after removing leaves), see van Iersel *et al.* (2009a). These structures are called *generators*. In each case, *N*∖{*x*} contains at most two maximal strict subtrees that have more than one leaf. Furthermore, *N*∖{*x*} contains exactly one reticulation *r*′, below which hangs a strict subtree *T*_{r} with set of leaf labels *X*_{r} (possibly, |*X*_{r}| = 1 or |*X*_{r}| = 0).

First, we assume that *X*_{r} is not a maximal ST-set w.r.t. 𝒞∖{*x*}. In that case it follows that there is some maximal ST-set *X* that contains *X*_{r} and also contains at least one taxon labeling a leaf ℓ that is not reachable by a directed path from the reticulation of *N*∖{*x*}. We can replace ℓ by a strict subtree on *X* that represents 𝒞|*X*. Such a tree exists because *X* is an ST-set. We remove all leaves that label elements of *X* and are not in this strict subtree. Since there are now no leaves left below the reticulation, we can remove this reticulation as well. It is easy to see that the resulting network is a tree representing 𝒞∖{*x*}. Moreover, we show that in each case a leaf labeled *x* can be added below a new reticulation (possibly with indegree 3) in order to obtain a network *N*′ that represents 𝒞. Since *N*′ contains just one reticulation, it is clear that the maximal strict subtrees of *N*′∖{*x*} are the maximal ST-sets w.r.t. 𝒞∖{*x*}. Cass(2) reconstructs such a network with an indegree-3 reticulation by removing *x*, removing a dummy taxon δ, constructing a tree, adding a leaf labeled δ below a reticulation, adding a leaf labeled *x* below a reticulation, removing the leaf labeled δ and contracting the (now redundant) edges between the two reticulations. Note that this works because Cass(2) does not collapse in this case.

It remains to consider the possibility that *X*_{r} is a maximal ST-set w.r.t. 𝒞∖{*x*}. In this case, we modify network *N* to *N*′ in such a way that also the other maximal ST-sets w.r.t. 𝒞∖{*x*} appear as the leaf-sets of strict subtrees in *N*′∖{*x*}. We again use a case analysis to show that this is always possible in such a way that the resulting network *N*′ represents 𝒞. ▪

### Lemma 2.

Cass*runs in time O*(|𝒳|^{3k+2}·|𝒞|), *if k is fixed*.

### Proof.

Omitted due to space constraints. ▪

### Theorem 2.

*Given a set of clusters* 𝒞, Cass*constructs in polynomial time a level-2 network representing* 𝒞, *if such a network exists*.

### Proof.

Follows from Lemmas 1 and 2 and Theorem 1. ▪

We conclude this section by showing that for each *r*≥2, there exists a set of clusters 𝒞_{r} such that any galled network representing 𝒞_{r} needs at least *r* reticulations, while Cass constructs a network with just two reticulations, which also represents 𝒞_{r}. This follows from the following lemma.

### Lemma 3.

*For each r* ≥ 2, *there exists a set* 𝒞_{r}*of clusters such that there exists a network with two reticulations that represents* 𝒞_{r}*while any galled network representing* 𝒞_{r}*contains at least r reticulations*.

### Proof.

Omitted due to space constraints. ▪

## 5 PRACTICE

Our implementation of the Cass algorithm is available as part of the Dendroscope program (Huson *et al.*, 2007). To use Cass, first load a set of trees into Dendroscope. Subsequently, run the algorithm by choosing ‘options’ and ‘network consensus’. The program gives you the option of entering a threshold percentage *t*. Only clusters that appear in more than *t* percent of the input trees will be used as input for Cass. Choose ‘minimal network’ to run the Cass algorithm to construct a phylogenetic network representing all clusters that appear in more than *t* percent of the input trees.

Cass computes a solution for each biconnected component separately. If the computations for a certain biconnected component take too long, you can choose to ‘skip’ the component, in which case the program will quickly compute the cluster network (Huson and Rupp, 2008) for this biconnected component, instead. Alternatively, you can choose to construct a galled network, or to increase the threshold percentage *t*. See van Iersel *et al.* (2009b) for a user guide for Cass and all datasets used for this article. See Huson *et al.* (2007) for more information on using Dendroscope.

We have tested Cass on both practical and artificial data and compared Cass with other programs. The results (using *t* = 0) are summarized in Table 1 and Figure 5. For Table 1, several example datasets have been used, which have been selected in such a way as to obtain a good variation in number of taxa, number of clusters and network complexity. The first four datasets are the sets containing *all* possible clusters on 5, 6, 7 and 8 taxa, respectively. The other datasets have been constructed by taking the set of clusters in (arbitrary) networks of increasing size and complexity. Mostly networks with just one biconnected component have been used because, for networks with more biconnected components, both algorithms use the same method to decompose into biconnected components and then both find a solution for each biconnected component separately. For each dataset, we have constructed one network using Cass, which we call the Cass network, and one galled network using the algorithm in Huson *et al.* (2009). Two conclusions can be drawn from the results. First, Cass uses more time than the galled network algorithm. Nevertheless, the time needed by Cass can still be considered acceptable for phylogenetic analysis. Second, Cass constructs a much simpler network in almost all cases. For three datasets, the Cass network and the galled network have the same reticulation number and the same level. For all other datasets, the Cass network has a significantly smaller reticulation number, and also a lower level, than the galled network.

Data | GalledNetwork | Cass | |||||
---|---|---|---|---|---|---|---|

|𝒞| | |𝒳| | t | k | r | t | k | r |

30 | 5 | 0 s | 6 | 6 | 1 s | 4 | 4 |

62 | 6 | 0 s | 8 | 8 | 7 s | 5 | 5 |

126 | 7 | 0 s | 10 | 10 | 28 s | 6 | 6 |

254 | 8 | 6 s | 12 | 12 | 4 m 3 s | 7 | 7 |

42 | 10 | 0 s | 4 | 4 | 6 s | 4 | 4 |

38 | 11 | 0 s | 7 | 7 | 14 s | 5 | 5 |

61 | 11 | 0 s | 6 | 6 | 47 s | 5 | 5 |

77 | 22 | 0 s | 9 | 9 | 36 s | 3 | 3 |

75 | 30 | 0 s | 11 | 11 | 5 s | 2 | 2 |

89 | 31 | 0 s | 16 | 16 | 27 m 32 s | 4 | 4 |

180 | 51 | 0 s | 11 | 11 | 30 s | 2 | 2 |

193 | 57 | 0 s | 1 | 4 | 1 s | 1 | 4 |

270 | 76 | 0 s | 16 | 16 | 4 m 52 s | 2 | 2 |

404 | 122 | 1 s | 2 | 2 | 21 m 10 s | 2 | 2 |

135.8 | 31.9 | 1s | 8.5 | 8.7 | 4 m 19 s | 3.7 | 3.9 |

Data | GalledNetwork | Cass | |||||
---|---|---|---|---|---|---|---|

|𝒞| | |𝒳| | t | k | r | t | k | r |

30 | 5 | 0 s | 6 | 6 | 1 s | 4 | 4 |

62 | 6 | 0 s | 8 | 8 | 7 s | 5 | 5 |

126 | 7 | 0 s | 10 | 10 | 28 s | 6 | 6 |

254 | 8 | 6 s | 12 | 12 | 4 m 3 s | 7 | 7 |

42 | 10 | 0 s | 4 | 4 | 6 s | 4 | 4 |

38 | 11 | 0 s | 7 | 7 | 14 s | 5 | 5 |

61 | 11 | 0 s | 6 | 6 | 47 s | 5 | 5 |

77 | 22 | 0 s | 9 | 9 | 36 s | 3 | 3 |

75 | 30 | 0 s | 11 | 11 | 5 s | 2 | 2 |

89 | 31 | 0 s | 16 | 16 | 27 m 32 s | 4 | 4 |

180 | 51 | 0 s | 11 | 11 | 30 s | 2 | 2 |

193 | 57 | 0 s | 1 | 4 | 1 s | 1 | 4 |

270 | 76 | 0 s | 16 | 16 | 4 m 52 s | 2 | 2 |

404 | 122 | 1 s | 2 | 2 | 21 m 10 s | 2 | 2 |

135.8 | 31.9 | 1s | 8.5 | 8.7 | 4 m 19 s | 3.7 | 3.9 |

For each algorithm, the level *k* and reticulation number *r* of the output network are given as well as the running time *t* in minutes (m) and seconds (s) on a 1.67 GHz 2 GB laptop. The last row gives the average values.

Figure 5 summarizes the results of an application of Cass to practical data. This dataset consists of six phylogenetic trees of grasses of the *Poaceae* family, originally published by the Grass Phylogeny Working Group (2001) and reanalyzed in Schmidt (2003). The phylogenetic trees are based on sequences from six different gene loci, ITS, ndhF, phyB, rbcL, rpoC and waxy, and contain 47, 65, 40, 37, 34 and 19 taxa, respectively. We have compared the results of Cass not only with the galled network and the cluster network algorithms, but also with the very recently developed algorithms HybridInterleave (J.Collins *et al.*, submitted for publication) and PIRN (Y.Wu, submitted for publication). HybridInterleave computes the minimum number of reticulations required to combine two binary phylogenetic trees (on the same set of taxa) into a phylogenetic network that displays both trees. PIRN has the same objective as HybridInterleave but has the advantage that it can accept more than two trees as input (which are still required to be binary). On the other hand, HybridInterleave has the advantage that it is guaranteed to find an optimal solution. For this experiment, we compiled PIRN with the ILP (Integer Linear Programming) solver CPLEX 10.2. We considered all possible subsets of at least two of the six gene trees; 57 in total. For each subset, we first restricted the trees to the taxa present in all trees in the subset to make the input data compatible with HybridInterleave and PIRN. Then, we executed each program for a maximum of 5 min on a 2.83 GHz quad-core PC with 8 GB RAM and recorded the best solution it could find in that time frame. The full results are available in Table 2 in the Supplementary Material. Results for HybridInterleave (which could only be applied to pairs of trees) differ from the results reported in (J.Collins *et al.*, submitted for publication) because there trees with a different rooting were used. Our results show that Cass always found a solution (within 5 min) when the minimum level was at most 4, and sometimes when the minimum level was 5 or 6. We also see that, in all these cases, no program found a solution using fewer reticulations than Cass.

To obtain each of the graphs in Figure 5, we averaged over those subsets where all the programs had terminated within 5 min (which was the majority). Several conclusions can be drawn from these graphs. The main conclusion is that Cass on average required fewer reticulations than each of the other programs. That Cass uses fewer reticulations than PIRN can be explained by the fact that PIRN (as well as HybridInterleave) requires the output network to display all input trees. The networks constructed by Cass do not necessarily display the input trees, but still represent all clusters from the trees, and in general use fewer reticulations to do so. Figure 5a is noteworthy in this regard. It turns out that, when restricted to subsets of exactly two trees, Cass, PIRN and HybridInterleave always achieved the same optimum. This turns out not to be coincidence, but a mathematical consequence of extracting clusters from exactly two binary trees on the same taxa set (L.J.J.van Iersel and S.M.Kelk, in preparation). The advantages of Cass clearly become most visible when larger subsets of trees are used.

In terms of running time, PIRN and HybridInterleave are in general faster than Cass, but Cass has the significant flexibility that it is not restricted to binary (i.e. fully resolved) input trees and is not restricted to trees on the same taxa set. Compared with HybridInterleave, Cass also has the advantage that it is not restricted to two input trees and that it constructs an actual network rather than to only report the number of reticulations. Finally, because Cass is not restricted to binary trees, the user is free to choose only well-supported clusters from the input trees. Figure 6 is a nice example of this: this is the output of Cass when given all clusters that were in at least three of the six gene trees (i.e. *t* = 34%), without having to first restrict to those taxa common to all six trees (in this case, only four taxa were common to all six input trees). This example also illustrates that, when there exists a solution with a low level, Cass can handle large numbers of taxa and reticulations.

## 6 DISCUSSION

We have introduced the Cass algorithm, which can be used to combine any set of clusters into a phylogenetic network representing those clusters. We have shown that the algorithm performs well on practical data. It provides a useful addition to existing software, because it usually constructs a simpler network representing the same set of input clusters. Furthermore, we have shown that Cass provides a polynomial-time algorithm for deciding whether a level-2 phylogenetic network exists that represents a given set of clusters. This algorithm is more useful in practice than algorithms for similar problems that take triplets as input (Jansson and Sung, 2006; Jansson *et al.*, 2006; To and Habib, 2009; van Iersel and Kelk, 2009; van Iersel *et al.*, 2009a), because clusters are more biologically relevant than triplets and because the latter algorithms need at least one triplet for each combination of three taxa as input, while Cass can be used for any set of input clusters. Furthermore, Cass is also not restricted to two input trees, as the algorithms in Bordewich *et al.* (2007); J.Collins *et al.*, (submitted for publication); Linz and Semple (2009) and not to fully resolved trees on identical taxa sets, as the algorithms in Bordewich *et al.* (2007), J.Collins *et al.* (submitted for publication), Y.Wu *et al.* (submitted for publication). Finally, we remark that Cass can also be used when one or more multi-labeled trees are given as input. In this case, Dendroscope first computes all clusters in the multi-labeled tree(s) and subsequently uses Cass to find a phylogenetic network representing these clusters. Several theoretical problems remain open. First of all, does Cass always construct a minimum-level network, even if this minimum is three or more? Second, what is the complexity of constructing a minimum level network, if the minimum level *k* is not fixed but part of the input? Is this problem FPT when parameterized by *k*? Finally, it would be interesting to design an algorithm that finds a network representing a set of input clusters that has a minimum reticulation number. So far, not even a non-trivial exponential-time algorithm is known for this problem.

^{2}Named after the Cass Field Station in New Zealand.

^{3}Note that to determine the reticulation number of a biconnected component one only counts edges inside this biconnected component.

## ACKNOWLEDGEMENTS

We thank Mike Steel for organizing the Cass workshop in the Cass Field Station in February 2009, where we started this work. We thank Yufeng Wu for providing us with the source code for his PIRN program.

*Funding*: Allan Wilson Centre for Molecular Ecology and Evolution (to L.V.I.); Computational Life Sciences grant of The Netherlands Organisation for Scientific Research (NWO to S.K.); Regula Rupp by the Deutsche Forschungsgemeinschaft (PhyloNet project to S.K.).

*Conflict of Interest*: none declared.

## REFERENCES

*Poaceae*)

*Lecture Notes in Computer Science*

*Lecture Notes in Bioinformatics*

*Lecture Notes in Bioinformatics*

*k*phylogenetic networks are constructable from a dense triplet set in polynomial time

*Lecture Notes in Computer Science*

## Comments