## Abstract

Summary: We describe an algorithm and software tool for comparing alternative phylogenetic trees. The main application of the software is to compare phylogenies obtained using different phylogenetic methods for some fixed set of species or obtained using different gene sequences from those species. The algorithm pairs up each branch in one phylogeny with a matching branch in the second phylogeny and finds the optimum 1-to-1 map between branches in the two trees in terms of a topological score. The software enables the user to explore the corresponding mapping between the phylogenies interactively, and clearly highlights those parts of the trees that differ, both in terms of topology and branch length.

Availability: The software is implemented as a Java applet at . It is also available on request from the authors.

Contact:thomas.nye@mrc-bsu.cam.ac.uk

## 1 INTRODUCTION

Many biological analyses involve the construction of a phylogenetic tree for some set of sequence data, and a variety of methods for inferring phylogenies are available. However, the choice of phylogenetic method can have a strong influence on the tree obtained for a given set of sequence data, both in terms of its topology and branch lengths. In addition, different gene trees can be obtained for some fixed set of species, where each gene tree is based on a different set of orthologous sequences chosen for the analysis. Comparison between gene trees and species trees can reveal concensus patterns of evolution as well as genes that diverge from this pattern. Methods for comparing phylogenies that are capable of revealing where two trees agree or differ are therefore desirable, in order to assess the quality of phylogenetic trees and analyse different phylogenetic methods.

While a number of algorithms for tree comparison have been developed previously, it seems as though there is no standard tool widely used by the phylogenetics community for visualizing similarities and differences. Several previous approaches to comparing trees (Robinson and Foulds, 1981; Cole et al., 2000; Hon et al., 2001) have concentrated on finding the maximal common subtree, or have involved computing a series of transformations that map one tree into the other in order to express the dissimilarity between the trees as an edit distance or metric. Our approach is rather different: given two trees to compare, we match up branches that have a similar topological characteristic. The topological characteristic we consider is the partition of leaf nodes determined by each branch in a tree. This process of matching up branches within the two trees under comparison leads to a form of alignment between the trees as opposed to a chain of operations relating them or a metric specifying their dissimilarity. In fact, our approach to tree comparison can be thought of as being analogous to sequence alignment, where instead of conserved letters in a sequence we have branches that share topological features. The alignment obtained is best explored interactively, as implemented in the web-based tool we describe below. Other approaches to tree comparison (Munzner et al., 2003; Page, 1995) have matched nodes in two alternative rooted trees according to the ancestors the nodes share. We discuss these methods in a later section.

## 2 ALGORITHM

Suppose we are given two phylogenetic trees T1 and T2 that share the same set of leaves L. The trees may not necessarily be bifurcating, and can be either rooted or unrooted. For simplicity we assume the trees have the same number of branches. Our comparison algorithm has two stages. First every pair of edges (i, j) with iT1 and jT2 is assigned a score s(i, j), that reflects the topological similarity of the branches i and j. Secondly, branches in the two trees are paired up to optimize the overall score. More formally, this is equivalent to finding a bijection (i.e. a 1-to-l correspondence) f : T1T2 between the branches of the trees, that maximizes the quantity

(1)
$∑i∈T1s(i,f(i)).$
These steps are described in more detail below. The outcome of the algorithm is the correspondence f between branches in the two trees, which we refer to as an alignment.

### 2.1 Scoring branch pairs

Each edge e in a tree T defines a partition of the leaf nodes into two subsets: cutting the branch divides the tree into two subtrees, and this determines the partition of leaf nodes. The score s(i, j) for any pair of edges (i, j), iT1 and jT2, is obtained by comparing the two corresponding partitions of the leaf nodes L. Suppose the pair (i, j) gives rise to the two partitions

$Pi0∪Pi1=Pj0∪Pj1=L,$
where Pi0, Pi1 are the two disjoint subsets forming the partition of L corresponding to branch i, and similarly for Pj0, Pj1. We can then count the number of elements of L shared by the partitions: for r, s = 0, 1 define
$ars=|Pir∩Pjs||Pir∪Pjs|.$
For a fixed branch pair (i, j), ars represents the proportion of elements shared by the sets Pir and Pjs.

The score s(i, j) is then defined by

$s(i,j)=max{min{a00,a11},min{a01,a10}}.$

To illustrate how the score is calculated, consider the following example. Suppose the set of leaf nodes is {a, b, c, d, e, f, g} and that for some pair of branches (i, j) we have the partitions

$Pi0={a,b},Pi1={c,d,e,f,g},Pj0={d,e,f,g},Pj1={a,b,c}.$
We then have
$s(i,j)=max{min{0,17},min{23,45}}=23.$

The form of the score s(i, j) defined by the above equation warrants a brief discussion. Given the two partitions corresponding to branches i, j, their respective subsets can be compared in two ways: either Pi0 is compared with Pj0 and Pi1 to Pj1, or Pi0 is compared with Pj1 and Pi1 to Pj0. The score s(i, j) corresponds to assigning to each of these two modes of comparison the score ars of the most dissimilar sets, and then picking the mode of comparison that maximizes this score. While other scoring systems could be used, ours has the advantage that when two similar partitions both consist of one large set and one much smaller set, the effect of dissimilarities between the two smaller sets is not dominated by the effect of the two larger sets. As an illustration of this, consider padding out the sets Pi1 and Pj0 in the example above with extra elements k, l, m, etc.: the score s(i, j) does not change.

### 2.2 Finding the optimal tree alignment

As described above, we seek a bijection f : T1T2 between the branches of the trees that maximizes the quantity given in Equation (1). This problem is solved in O(n3) steps, where n is the number of leaves, by the Munkres algorithm (Munkres, 1957; Bourgeois and Lassalle, 1971).

## 3 IMPLEMENTATION

The software is available as a Java applet from the website given in the abstract. The two trees are displayed side-by-side. The mean topological score between internal branches matched under f is displayed, as a global measure of similarity. Clicking with the mouse on any branch in one tree highlights its matching branch in the other tree under the bijection f. Branch thickness is used to indicate the topological score s(i, f(i)) for each branch: thicker branches represent a lower score (although this can be modified by the user). In this way the user's attention is drawn to regions where the two tree topologies differ. Hovering the mouse-pointer over a branch causes its topological score to be displayed. A colour scheme is used to indicate how the well the lengths of each branch i and its match f(i) agree. Shift-clicking on a branch i highlights the branch j in the other tree with the highest score s(i, j).

An example of output from the software is shown in Figure 1. Two alternative phylogenies for strains of HIV viruses, taken from Roques et al. (2004), are shown. Further details are given in the figure caption.

Fig. 1

Comparison of alternative phylogenies for HIV strains. The phylogenies were constructed for strains with fully sequenced genomes, by using sequences from two different genomic regions. Three groups of human strains (M, N and O) are shown together with four simian strains (the SIV group). The phylogenies exhibit two different positions for the N-group, one closer to the simian group (left), and the other closer to the human groups (right). This was probably caused by an ancient recombination event in the N-group ancestor. The thicker branches are those receiving a low topological score: in particular the thickest branches arise from the two alternative positions for the N-group. The topology of the O-group matches exactly between the phylogenies, as indicated by the thin branches, while the internal topology of the M-group differs quite widely. Branches drawn in red are longer than the corresponding branches in the other tree, with the intensity of the colour indicating the level of mismatch. Blue branches conversely denote shorter branches. The yellow markers indicate a match between branches (clicking on any branch results in its match being highlighted and the score displayed). The particular match illustrated here has a topological score of 67%. Although both trees contain a branch separating the N and simian clades from the rest of the tree, those branches are not identified under the constraint of finding the optimal 1-to-l map between branches.

Fig. 1

Comparison of alternative phylogenies for HIV strains. The phylogenies were constructed for strains with fully sequenced genomes, by using sequences from two different genomic regions. Three groups of human strains (M, N and O) are shown together with four simian strains (the SIV group). The phylogenies exhibit two different positions for the N-group, one closer to the simian group (left), and the other closer to the human groups (right). This was probably caused by an ancient recombination event in the N-group ancestor. The thicker branches are those receiving a low topological score: in particular the thickest branches arise from the two alternative positions for the N-group. The topology of the O-group matches exactly between the phylogenies, as indicated by the thin branches, while the internal topology of the M-group differs quite widely. Branches drawn in red are longer than the corresponding branches in the other tree, with the intensity of the colour indicating the level of mismatch. Blue branches conversely denote shorter branches. The yellow markers indicate a match between branches (clicking on any branch results in its match being highlighted and the score displayed). The particular match illustrated here has a topological score of 67%. Although both trees contain a branch separating the N and simian clades from the rest of the tree, those branches are not identified under the constraint of finding the optimal 1-to-l map between branches.

## 4 OTHER APPROACHES

Many approaches to tree comparison involve calculating a single metric to express the dissimilarity between two trees. Of note is the Robinson–Foulds metric (Robinson and Foulds, 1981) that counts the number of partitions of leaf nodes that arise in one tree but not the other. However, there are other methods, similar to the approach presented here, that construct maps between the nodes of rooted trees. The TreeJuxtaposer software of Munzner et al. (2003) scores pairs of nodes according to the similarity of their sets of descendants. Given trees T1 and T2, it constructs a map f1 : T1T2 between nodes of the two trees, such that each node in T1 is mapped to the node in T2 with the most similar set of descendants. Another map f2 : T2T1 is constructed in a similar way. The two maps are not necessarily 1-to-l, and although this approach is similar to ours, the lack of symmetry can make visualization less intuitive. However, as described in the previous section, shift-clicking on branches in our application provides access to equivalents of the maps f1 and f2. The TreeMap software package implements a similar approach to TreeJuxtaposer (the method is described in Page, 1995), involving two maps f1, f2 between the nodes of rooted trees. TreeMap is particularly adapted to the analysis of trees from host species and their cospeciating parasites.

Funding to pay the Open Access publication charges for this article was provided by the Medical Research Council.

Conflict of Interest: none declared.

## REFERENCES

Bourgeois
F.
Lassalle
J.C.
An extension of the Munkres Algorithm for the assignment problem to rectangular matrices
Commun. ACM
,
1971
, vol.
14
(pg.
802
-
804
)
Cole
R.
, et al.  .
An O(n log n) algorithm for the maximum agreement subtree problem for binary trees
SIAM J. Comput.
,
2000
, vol.
30
(pg.
1385
-
1404
)
Hon
W.K.
, et al.  .
Improved phylogeny comparisons: non-shared edges, nearest neighbor interchanges, and subtree transfers
LNCS
,
2001
, vol.
1969
(pg.
527
-
538
)
Munkres
J.
Algorithms for the assignment and transportation problems
J. SIAM
,
1957
, vol.
5
(pg.
32
-
38
)
Munzner
T.
, et al.  .
TreeJuxtaposer: scalable tree comparison using Focus+Context with guaranteed visibility
ACM Trans. graphics
,
2003
, vol.
22
(pg.
453
-
462
)
Page
R.D.M.
Parallel phylogenies: reconstructing the history of host-parasite assemblages
,
1995
, vol.
10
(pg.
155
-
173
)
Robinson
D.F.
Foulds
L.R.
Comparison of phylogenetic trees
Math. Biosci.
,
1981
, vol.
53
(pg.
131
-
147
)
Roques
P.
, et al.  .
Phylogenetic characteristics of three new HIV-1 N strains and implications for the origin of group N
AIDS
,
2004
, vol.
18
(pg.
1371
-
381
)

## Author notes

Associate Editor: Keith A. Crandall
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the orginial authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org