## Abstract

**Summary:** We describe an algorithm and software tool for comparing alternative phylogenetic trees. The main application of the software is to compare phylogenies obtained using different phylogenetic methods for some fixed set of species or obtained using different gene sequences from those species. The algorithm pairs up each branch in one phylogeny with a matching branch in the second phylogeny and finds the optimum 1-to-1 map between branches in the two trees in terms of a topological score. The software enables the user to explore the corresponding mapping between the phylogenies interactively, and clearly highlights those parts of the trees that differ, both in terms of topology and branch length.

**Availability:** The software is implemented as a Java applet at . It is also available on request from the authors.

**Contact:**thomas.nye@mrc-bsu.cam.ac.uk

## 1 INTRODUCTION

Many biological analyses involve the construction of a phylogenetic tree for some set of sequence data, and a variety of methods for inferring phylogenies are available. However, the choice of phylogenetic method can have a strong influence on the tree obtained for a given set of sequence data, both in terms of its topology and branch lengths. In addition, different gene trees can be obtained for some fixed set of species, where each gene tree is based on a different set of orthologous sequences chosen for the analysis. Comparison between gene trees and species trees can reveal concensus patterns of evolution as well as genes that diverge from this pattern. Methods for comparing phylogenies that are capable of revealing where two trees agree or differ are therefore desirable, in order to assess the quality of phylogenetic trees and analyse different phylogenetic methods.

While a number of algorithms for tree comparison have been developed previously, it seems as though there is no standard tool widely used by the phylogenetics community for visualizing similarities and differences. Several previous approaches to comparing trees (Robinson and Foulds, 1981; Cole *et al*., 2000; Hon *et al*., 2001) have concentrated on finding the maximal common subtree, or have involved computing a series of transformations that map one tree into the other in order to express the dissimilarity between the trees as an edit distance or metric. Our approach is rather different: given two trees to compare, we match up branches that have a similar topological characteristic. The topological characteristic we consider is the partition of leaf nodes determined by each branch in a tree. This process of matching up branches within the two trees under comparison leads to a form of alignment between the trees as opposed to a chain of operations relating them or a metric specifying their dissimilarity. In fact, our approach to tree comparison can be thought of as being analogous to sequence alignment, where instead of conserved letters in a sequence we have branches that share topological features. The alignment obtained is best explored interactively, as implemented in the web-based tool we describe below. Other approaches to tree comparison (Munzner *et al*., 2003; Page, 1995) have matched nodes in two alternative rooted trees according to the ancestors the nodes share. We discuss these methods in a later section.

## 2 ALGORITHM

Suppose we are given two phylogenetic trees *T*_{1} and *T*_{2} that share the same set of leaves *L*. The trees may not necessarily be bifurcating, and can be either rooted or unrooted. For simplicity we assume the trees have the same number of branches. Our comparison algorithm has two stages. First every pair of edges (*i*, *j*) with *i* ∈ *T*_{1} and *j* ∈ *T*_{2} is assigned a score *s*(*i*, *j*), that reflects the topological similarity of the branches *i* and *j*. Secondly, branches in the two trees are paired up to optimize the overall score. More formally, this is equivalent to finding a bijection (i.e. a 1-to-l correspondence) *f* : *T*_{1} → *T*_{2} between the branches of the trees, that maximizes the quantity

*f*between branches in the two trees, which we refer to as an alignment.

### 2.1 Scoring branch pairs

Each edge *e* in a tree *T* defines a partition of the leaf nodes into two subsets: cutting the branch divides the tree into two subtrees, and this determines the partition of leaf nodes. The score *s*(*i*, *j*) for any pair of edges (*i*, *j*), *i* ∈ *T*_{1} and *j* ∈ *T*_{2}, is obtained by comparing the two corresponding partitions of the leaf nodes *L*. Suppose the pair (*i*, *j*) gives rise to the two partitions

*P*

_{i0},

*P*

_{i1}are the two disjoint subsets forming the partition of

*L*corresponding to branch

*i*, and similarly for

*P*

_{j0},

*P*

_{j1}. We can then count the number of elements of

*L*shared by the partitions: for

*r*,

*s*= 0, 1 define

*i, j*),

*a*represents the proportion of elements shared by the sets

_{rs}*P*and

_{ir}*P*.

_{js}The score *s*(*i*, *j*) is then defined by

To illustrate how the score is calculated, consider the following example. Suppose the set of leaf nodes is {*a*, *b*, *c*, *d*, *e*, *f*, *g*} and that for some pair of branches (*i, j*) we have the partitions

The form of the score *s*(*i*, *j*) defined by the above equation warrants a brief discussion. Given the two partitions corresponding to branches *i*, *j*, their respective subsets can be compared in two ways: either *P*_{i0} is compared with *P*_{j0} and *P*_{i1} to *P*_{j1}, or *P*_{i0} is compared with *P*_{j1} and *P*_{i1} to *P*_{j0}. The score *s*(*i*, *j*) corresponds to assigning to each of these two modes of comparison the score *a _{rs}* of the most dissimilar sets, and then picking the mode of comparison that maximizes this score. While other scoring systems could be used, ours has the advantage that when two similar partitions both consist of one large set and one much smaller set, the effect of dissimilarities between the two smaller sets is not dominated by the effect of the two larger sets. As an illustration of this, consider padding out the sets

*P*

_{i1}and

*P*

_{j0}in the example above with extra elements

*k*,

*l*,

*m*, etc.: the score

*s*(

*i*,

*j*) does not change.

### 2.2 Finding the optimal tree alignment

As described above, we seek a bijection *f* : *T*_{1} → *T*_{2} between the branches of the trees that maximizes the quantity given in Equation (1). This problem is solved in *O*(*n*^{3}) steps, where *n* is the number of leaves, by the Munkres algorithm (Munkres, 1957; Bourgeois and Lassalle, 1971).

## 3 IMPLEMENTATION

The software is available as a Java applet from the website given in the abstract. The two trees are displayed side-by-side. The mean topological score between internal branches matched under *f* is displayed, as a global measure of similarity. Clicking with the mouse on any branch in one tree highlights its matching branch in the other tree under the bijection *f*. Branch thickness is used to indicate the topological score *s*(*i*, *f*(*i*)) for each branch: thicker branches represent a lower score (although this can be modified by the user). In this way the user's attention is drawn to regions where the two tree topologies differ. Hovering the mouse-pointer over a branch causes its topological score to be displayed. A colour scheme is used to indicate how the well the lengths of each branch *i* and its match *f*(*i*) agree. Shift-clicking on a branch *i* highlights the branch *j* in the other tree with the highest score *s*(*i*, *j*).

An example of output from the software is shown in Figure 1. Two alternative phylogenies for strains of HIV viruses, taken from Roques *et al*. (2004), are shown. Further details are given in the figure caption.

**Fig. 1**

**Fig. 1**

## 4 OTHER APPROACHES

Many approaches to tree comparison involve calculating a single metric to express the dissimilarity between two trees. Of note is the Robinson–Foulds metric (Robinson and Foulds, 1981) that counts the number of partitions of leaf nodes that arise in one tree but not the other. However, there are other methods, similar to the approach presented here, that construct maps between the nodes of rooted trees. The TreeJuxtaposer software of Munzner *et al*. (2003) scores pairs of nodes according to the similarity of their sets of descendants. Given trees *T*_{1} and *T*_{2}, it constructs a map *f*_{1} : *T*_{1} → *T*_{2} between nodes of the two trees, such that each node in *T*_{1} is mapped to the node in *T*_{2} with the most similar set of descendants. Another map *f*_{2} : *T*_{2} → *T*_{1} is constructed in a similar way. The two maps are not necessarily 1-to-l, and although this approach is similar to ours, the lack of symmetry can make visualization less intuitive. However, as described in the previous section, shift-clicking on branches in our application provides access to equivalents of the maps *f*_{1} and *f*_{2}. The TreeMap software package implements a similar approach to TreeJuxtaposer (the method is described in Page, 1995), involving two maps *f*_{1}, *f*_{2} between the nodes of rooted trees. TreeMap is particularly adapted to the analysis of trees from host species and their cospeciating parasites.

Funding to pay the Open Access publication charges for this article was provided by the Medical Research Council.

*Conflict of Interest:* none declared.

## Comments