A Full Characterization of Evolutionary Tree Topologies

The topologies of evolutionary trees are shaped by the nature of the evolutionary process, but comparisons of trees from different processes are hindered by the challenge of completely describing tree topology. We present a full characterization of the topologies of rooted branching trees in a form that lends itself to natural tree comparisons. The resulting metric distinguishes trees from random models known to produce different tree topologies. It separates trees derived from tropical vs USA influenza A sequences, indicating that the different epidemiology of tropical and seasonal flu leaves strong signatures in the tree topology. Our approach allows us to construct addition and multiplication on trees, and to create a convex metric on tree topologies which formally allows computation of average trees.


Introduction
The availability and declining cost of DNA sequencing mean that data on the diversity, variation and evolution of organisms is more widely available than ever before. Increasingly, thousands of organisms are being sequenced at the whole-genome scale [1,2,3]. This has had particular impact on the study of pathogens, whose evolution occurs rapidly enough to be be observed over relatively short periods. As the numbers of sequences gathered annually grow to the tens of thousands in many organisms, comparing this year's evolutionary and diversity patterns to previous years', and comparing one location to another, has become increasingly challenging. Despite the fact that evolution does not always occur in a tree-like way due to the horizontal movements of genes, phylogenetic trees remain a central tool with which we interpret these data.
The topologies of phylogenetic trees are of long-standing interest in both mathematics and evolution [4,5,6,7,8,9,10,11]. A tree's topology refers to the tree's connectivity structure, without reference to the lengths of its branches. A key early observation was that trees reconstructed from evolutionary data are more asymmetric than simple models predict. This spurred an interest in ways to measure tree asymmetry [8,12,13,14,15], in the power of asymmetry measures to distinguish between random models [16,8,17], and in constructing generative models of evolution that produce imbalanced trees [13,18,10]. Tree topologies carry information about the underlying evolutionary processes, and distributions of tree topologies under simple null models can be used to test hypotheses about evolution [9,10,19,7,11]. Recent work also relates fitness, selection and a variety of ecological processes to tree topology [20,21,22,23,24,18]. An additional motivation for studying the topologies of phylogenetic trees is that reconstructing branch lengths is challenging, particularly deep in a tree; there may be weak support for a molecular clock, and coalescent inference procedures may produce trees with consistent topology but di↵ering root heights.
Tree topology is well established as carrying important information about macroevolutionary processes, but also carries information about evolution in the short term. In the context of pathogens, diversity patterns represent a combination of neutral variation that has not yet become fixed, variation that is under selection, complex demographic processes (host behaviour and contact patterns), and an array of ecological interactions. The extent to which tree topologies are informative of these processes is not well understood, though there have been studies on the frequency of cherries and tree imbalance [25,26,27] and simulation studies aiming to explore the question [29,28,30,31].
A key limitation in relating tree topologies to evolution and ecology has been the limited tools with which trees can be quantified and compared. Comparing tree topologies from di↵erent models of evolution or from di↵erent datasets requires comparing unlabelled trees, whereas most established tree comparison methods (eg the Robinson-Foulds [32] and Billera-Holmes-Vogtmann [33] metrics) compare trees with one particular set of organisms at the tips (ie one set of taxa, with labels). The tools at our disposal to describe and compare tree topologies from di↵erent sets of tips are limited, and have focused on scalar measures of overall asymmetry [5,34,17,14,12,15,35,36] and on the frequencies of small subtree topologies such as cherries [37,31,25] and r-pronged nodes [38]. Recently, kernel [39] and spectral [40] approaches also have been used.
Here we give a simple and complete characterization of all possible topologies for a rooted tree. Our scheme gives rise to natural metrics (in the sense of true distance functions) on unlabelled tree topologies. It provides an e cient way to count the frequencies of sub-trees in large trees, and hence can be used to compare empirical distributions of sub-tree topologies. It is not limited to binary trees and can be formulated for any maximum size multifurcation, as well as for trees with internal nodes with only one descendant (sampled ancestors). The resulting topology-based tree metrics separate trees derived from di↵erent random tree models. We use the approach to compare trees from human influenza A (H3N2); it can distinguish between trees from influenza sampled in the tropics vs that sampled in the USA.

Results
Briefly, with details in Materials and Methods, our approach is to label any possible tree topology, traversing the tree from the tips to the root and assigning labels as we go. The simplest case is to assume a binary tree, in which all internal nodes have two descendants. We give a tip the label 1. For every internal node, we list its descendants' labels (k, j). Using lexicographic sorting, list all possible labels (k, j): (1), (1, 1), (2, 1), (2,2), (3,1), (3,2), (3,3), ... We define the label of a tree topology whose root node has descendants (K, J) to be the index at which (K, J) appears in this list. Accordingly, a "cherry" (a node with two tip descendants) is labelled 2 because its descendants are (1, 1), which is the second entry in the list. A node with a cherry descendant and a tip descendant (a (2, 1), or a pitchfork) has label 3. The tree topology (k, j) (a tree whose root has a descendant with label k and one with label j) has label 2 (k, j) = 1 2 k(k 1) + j + 1. The scheme takes a di↵erent explicit form if there are multifurcations or internal nodes with a single descendant, but proceeds in the same way (see Supporting Information; the form for trees with no multifurcations but allowing for internal nodes with one descendant isp hi 2 (k, j) = 1 2 k(k + 1) + j + 1). We continue until the root of the tree has a label. Figure 1 illustrates the labels at the nodes of two binary trees. The label of the root node uniquely defines the tree topology. Indeed, tree isomorphism algorithms use similar labelling [41,42,43,44,45]. If R a and R b are the root nodes of binary trees T a and T b , the tree topologies are the same if and only if 2 (R a ) = 2 (R b ). The map between trees and labels is bijective: every positive integer corresponds  Figure 1: Illustration of the labels of the nodes of binary trees under the full bifurcating tree model 2 (k, j) = 1 2 k(k 1) + j + 1. Tips have the label 1. Labels of internal nodes are shown in black. The only di↵erence between the trees in (a) and (b) is that in (b), the bottom-most tip from (a) has been removed. As a consequence, most of the labels are the same.
to a unique tree topology and vice versa.
Metrics are an appealing way to compare sets of objects; defining a metric defines a space for the set of objects -in principle allowing navigation through the space, study of the space's dimension and structure, and the certainty that two objects occupy the same location if and only if they are identical. The labelling scheme gives rise to several natural metrics on tree topologies, based on the intuition that tree topologies are similar when they share many subtrees with the same labels. In the context of relating tree topologies to underlying evolutionary processes, a useful metric will be one that both distinguishes trees from processes known to produce distinct topologies, and that fails to distinguish trees from processes known to produce the same distribution of tree topologies.
There are several ways to sample random trees, known to produce trees of di↵erent topologies. These include models capturing equal vs di↵erent speciation rates, continuous time birth-death processes with di↵erent rates and others (see Methods). We used the metric arising from our labelling scheme to compare these. Figure 2 shows a visualization of the tree-tree distances between trees from di↵erent random models. The metric groups trees from each process together and distinguishes between them well. Summary statistics such as tree imbalance also distinguish some of these groups well (particularly the PDA, Aldous, Yule and biased speciation model), but imbalance does not substantially di↵er between the continuous-time branching models.
We also compared trees inferred from sequences of the HA protein in influenza A H3N2 sequences. Influenza A is highly seasonal outside the tropics [46], with the majority of cases occurring in winter. In contrast, there is little seasonal variation in transmission in the tropics. In addition, over long periods of time, influenza evolves in response to pressure from the human immune system, undergoing evolution particularly in the surface HA protein. This drives the 'ladder-like' shape of long-term influenza phylogenies [47,25,48,49], but would not typically be apparent in shorter-term datasets. With this motivation, we compared tropical samples to USA samples, and recent (2010-2015) global samples to early samples (pre-2010). Figure 3 shows that the tropical and USA flu trees are well separated by the metric. In contrast, the five-year (post-2010) and pre-2010 global samples occupy di↵erent regions of the projected tree space, but have some overlap. In these groups of trees, the underlying processes are similar, but the time frames and sampling density di↵er.
Natural metrics associated with the labelling scheme are all based on the bijective map between the tree space T and the natural numbers N. Composing with bijective maps between N and other countable sets like the integers (Z), the positive rational numbers (Q + ), or the rationals (Q) opens up further possibilities because we can take advantage of the properties (addition, multiplication, distance, etc) of integer and rational numbers. If is a bijective map between N and one of these sets, then the composition is also bijective, and we can use it to define addition and multiplication operations on trees: where + and · are the usual addition and multiplication. Now the space of trees together with these definitions of addition and multiplication, T, + T , · T , inherits all the algebraic properties of the set it is mapped into. For instance, T, + T , · T is a commutative ring if : N ! Z. These constructions allow algebraic operations in the tree space T. However the choice of the map determines whether these operations are "meaningful" or "helpful" for applications of branching trees in biology or other fields. It 4 . CC-BY-NC 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted July 11, 2016. ; https://doi.org/10.1101/054544 doi: bioRxiv preprint turns out that the selection of a meaningful map is challenging. For example, we can use the labelling scheme to map tree topologies to the (positive and negative) integers. We first extend with (0) = ;, i.e. the the empty tree no tips. Consider the following well-known map between N and Z: Z is clearly bijective: each tree topology is mapped to a unique integer and each integer corresponds to a unique tree topology. A representation of ten trees is provided in Figure S1. To "add" or "multiply" trees, we can add or multiply their corresponding integers and then invert, as in Eq (1). This may seem intuitive for small trees; for example the sum of tree number 3 and tree number -1 gives tree number 2 which has one fewer tip than tree number 3. For larger trees, however, addition and multiplication operations are less intuitive and do not follow the numbers of tips.
Mapping tree topologies to other sets of numbers can help us to capture the space of tree topologies in new ways. A particularly nice property of a metric space is convexity -if given two trees T 1 and T 2 , there exists a tree T 3 lying directly between them, i.e. d( . Convex metrics are appealing because in a convex metric on tree topologies we can find the average tree topology for a set of trees, define a centre of mass topology, and further develop statistics on the space of tree topologies. We use the labelling scheme and a pairing of maps to construct a convex metric on tree topologies. To do this, we map tree topologies to the rational numbers, where the usual absolute value function is a convex metric (as there is always a rational number directly in between any two others). We use the prime decomposition, i.e. the unique product of prime factors of a number (e.g. 10 = 2 · 5). For a tree topology corresponding to integer n, we apply Z to the exponents of all the prime factors of n + 1, and multiply the result (see Methods). For example Q + (19) = 2 Z (2) 5 Z (1) = 2 1 5 1 = 5/2. We denote this map Q ; it takes each integer to a unique rational number, and vice versa (bijective). Applying Q + to tree topologies maps them bijectively to the non-negative rational numbers. We can add or multiply trees' corresponding rational numbers to perform operations in the space of tree topologies. In particular, we can use the usual absolute value distance function to define a convex metric space of tree topologies T, d T : CC-BY-NC 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted July 11, 2016. ; https://doi.org/10.1101/054544 doi: bioRxiv preprint In this space we can find the average tree of a group of trees, and a 'direct path' between two trees. Given n trees, the average tree is: In other words, the average of a set of trees is the tree corresponding to the average of the trees' rational numbers under the map we have defined. Figure 4 illustrates this operation. There are infinitely many ways that we could map tree topologies to rational numbers. Any of them would give rise to a convex metric on the set of tree topologies. It would be most desirable if the resulting metric had some intuitive features -for example, if the trees lying directly between trees T 1 and T 2 (with n 1 and n 2 tips) had an intermediate number of tips between n 1 and n 2 . The convex metric we have constructed does not have this particular intuitive property. This convex metric also relies on the prime factorisation of the tree labels, which is a challenge if large labels are encountered.

Discussion
The labelling scheme we present comprises a complete characterization of rooted tree topologies, not limited to fully bifurcating trees. Trees from processes known to produce di↵erent topologies are well separated in the metric that arises naturally from the scheme. This suggests applications in inferring evolutionary processes and to detecting tree shape bias [50,24,4]. The structure and simplicity of this comparison tool carry a number of advantages. Metrics have good resolution in comparing trees because the distance is only zero if tree topologies are the same. Empirical distributions of sub-tree topologies can easily be found and compared. And as we have shown, the approach can be extended to convex 6 . CC-BY-NC 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted July 11, 2016. ; https://doi.org/10.1101/054544 doi: bioRxiv preprint metrics on tree topologies, allowing averaging as well as algebraic operations (addition, etc) in tree space. However, this approach does not seem likely to give rise to analytically tractable distributions of tree-tree distances, and in some cases, may not o↵er more useful resolution than a well-chosen collection of summary statistics.
Scalar measures of asymmetry are insu cient to characterize tree topologies. Here, imbalance measures do not distinguish between the continuous-time birth-death models with R0 = 2, 5 but are quite di↵erent between the random processes, whereas the metric distinguishes all cases. Matsen [51] developed a method to define a broad range of tree statistics. Genetic algorithms uncovered tree statistics that can distinguish between the reconstructed trees in TreeBase [52] and trees from Aldous' -splitting model, whereas imbalance measures do not [10]. However, the search-and-optimize approach is vulnerable to over-fitting, as the space of tree statistics is large. It is also reasonable to believe that due to ongoing decreases in the cost of sequencing, studies will increasingly analyze large numbers of sequences and reconstructed trees will have many tips. Any single scalar measure will likely be insu cient to capture enough of the information in these large trees to perform inference, motivating the development of metric approaches.
Large trees present a problem for many approaches to inference, including phylodynamic methods that rely on computationally intensive inference methods. In contrast, our scheme is better able to distinguish between groups of large trees than small ones (fewer than 100 tips). The tip-to-root traversal means that it is very e cient to construct the label set on very large trees (and the same traversal could, with little additional computation time, compute other properties that are naturally computed from tip to root, such as clade sizes, some imbalance measures and many of Matsen's statistics [51]). However, due to the large number of tree topologies, the labels themselves become extremely large even for relatively small trees. Our implementation used MD5 hashing to solve this problem, but hashing removes the ability to reconstruct the tree from its label. Also, there are 2 128 ⇡ 3 · 10 38 possible hashed strings, which while large is less than the number of possible tree topologies, even restricting to 500 tips. Alternative labelling schemes may partially alleviate this, for example by subtracting from the label the minimum label for n tips, and only comparing trees of size n or greater. A related approach was used by Furnas [53] in developing algorithms to sample trees.
The large size of the labels is also a challenge when they are mapped to Z, Q + or Q to define a tree algebra or a convex metric. Small changes in the label value can determine visible changes in the topologies. Because the bijective maps are sensitive to small perturbations, the implementation requires the full label, with no hashing compression. However, for trees with 500 tips, we encountered labels of about one million digits. Handling such large numbers with full accuracy required heavy and slow computation. The search for the average tree as found in Figure 4 was only possible for small trees, as the map requires the prime factorization of the label.
Perhaps as it should be, the dominant di↵erence in our scheme between a tree with ten tips and one with one hundred tips is the size of the tree. In this work we have chosen to detect di↵erences that are not simply a reflection of the size of the tree. If we relax this constraint, the largest contribution to the distances will result from comparing the number of instances of the label 1 (tip) in two trees; this is necessarily larger than any other label copy number, and furthermore, a tree with more tips can have more cherries, pitchforks and any other subtree than a tree with fewer tips. It is straightforward to modify the metric d 2 to be relatively insensitive to tree size (see Supporting Information).
Our scheme captures only the topology of the trees; there does not appear to be a natural way to incorporate branch lengths. One option is to add one or several terms to the distance function to incorporate more information (see Supporting Information). Linear combinations of our distances and other tree comparisons may turn out to be the most powerful approach to comparing unlabeled trees, allowing the user to choose the relative importance of scalar summaries, tree topology, spectra and so on while retaining the discriminating power of a metric. Ultimately, discriminating and informative tools for comparing trees will be essential for inferring the driving processes shaping evolutionary data.

Definitions
A tree topology is a tree (a graph with no cycles), without the additional information of tip labels and branch lengths. We use the same terminology as Mooers and Heard [9]. We consider rooted trees, in which there is one node specified to be the root. Tips, or leaves, are those nodes with degree 1. A rooted tree topology is a tree topology with a vertex designated to be the root. We use "tree topology", as we assume rootedness throughout. Typically, edges are implicitly understood to be directed away from the root. A node's descendants are the node's neighbors along edges away from the root. A multifurcation, or a polytomy, is a node with more than two descendants, and its size is its number of descendants (> 2). Naturally, a rooted phylogeny defines a (rooted) tree topology if the tip labels and edge weights are discarded. Phylogenies typically do not contain internal nodes with fewer than two descendants (sampled ancestors), but we allow this possibility in the tree topologies.

Labelling scheme
We label each tree topology according to the labels of the two clades descending from the root. In the simplest case (full binary trees), we call this label function 2 : The subscript 2 specifies that each node has a maximum of two descendants; the scheme can be extended to any fixed maximum number M of descendants, but then the explicit form of the label ( M ) is di↵erent.

Metrics on the space of rooted unlabelled shapes
There are several natural metrics suggested by our characterisation of tree topologies. Given two binary trees T a and T b , we can write Clear d 0 is symmetric and non-negative. The tree isomorphism algorithm and the above labelling clearly show that d 0 = 0 , T a = T b and the absolute value obeys the triangle inequality. However, it is not a particularly useful metric, in the sense that a large change in root label can result from a relatively "small" change, in intuitive terms, in the tree topology (such as the addition of a tip).
While each tree is defined by the label of its root, it is also defined (redundantly) by the labels of all its nodes. If the tree has n tips, the list of its labels contains n 1s, typically multiple 2s (cherries) and so on. Let L a denote the list of labels for a tree T a : L a = {1, 1, 1, ..., 2, 2, ..., 2 (R a )}. The label lists are multisets because labels can occur multiple times. Define the distance d 1 between T a and T b to be the number of elements in the symmetric set di↵erence between the label lists of two trees: Intuitively, this is the number of labels not included in the intersection of the trees' label lists. Formally, the symmetric set di↵erence A B = (A[B)\(A\B) is the union of A and B without their intersection.
If A and B are multisets with A containing k copies of element x and B containing m copies of x, with k > m, we consider A\B to contain m copies of x (these are common to both A and B). A B has the remaining k m copies. Each tree's label list contains more 1s (tips) than any other label. Accordingly, this metric is most appropriate for trees of the same size, because if trees vary in size, the metric can be dominated by di↵erences in the numbers of tips. For example, if L a = {1, 1, 1, 1, 2, 2} (four tips joined in two cherries) and L b = {1, 1, 1, 2, 3} (three tips, i.e. a pitchfork), then L a L b = {1, 2, 3}, because there is a 1 and a 2 in L a in excess of those in L b , and a 3 in L b that is not matched in L a . Like d 0 , d 1 is a metric: positivity and symmetry are clear from the definition. The cardinality of the symmetric di↵erence is 0 if and only if the two sets are the same, in which case the root label is the same and the tree topologies are the same. That the symmetric di↵erence obeys the triangle inequality is readily seen from the property Another natural metric that the labelling scheme induces is the L2 norm of the di↵erence between two vectors counting the numbers of occurrences of each label. Let v a be a vector whose k 0 th element v a (k) is the number of times label k occurs in the tree T a . Define the metric Positivity, symmetry and the triangle inequality are evident, and again d 2 can only be 0 if T a and T b have the same number of copies of all labels (including the root label), which is true if and only if T a and T b have the same topology. This has a similar flavour to the statistic used to compare trees to Yule trees in [10], where the numbers of clades of a specific size were compared. We have used metric d 2 in the analyses presented in the Results. Figure S1 illustrates tree topologies together with their labels under the map Z . We use this map and a map to the rational numbers to define a convex metric on tree topologies. Define the following map from N to Q:

Mapping tree topologies to the integers and rationals
if n > 0, or 0 if n = 0. Here, p i are all the prime numbers and Q 1 i=1 p a i i is the unique prime decomposition of n + 1. Z is as defined above, mapping the positive integers to all integers by Z = n/2 if n is even and (n + 1)/2 if n is odd. For example Q + (11) = 2 Z (2) 3 Z (1) = 2 1 3 1 = 2/3. Q + is injective, from the uniqueness of the prime factorization and the injectivity of Z . Since any rational number is a ratio of integers and Z maps to all integers the map is also surjective, and hence bijective. Therefore Q + maps tree topologies bijectively to the non-negative rational numbers. In turn, T inherits all of the properties and structure of Q + . A distance metric d T on T can be defined from the usual distance | · | of Q: Because the absolute value is a convex metric in Q, this is a convex metric on unlabelled tree topologies. It can be used to find averages of a set of trees.

Simulations
We compared trees from di↵erent random processes and models. One of the most natural random processes modelling phylogenetic trees is the continuous-time homogeneous birth-death branching process, in which each individual gives rise to a descendant at a constant rate in time, and also risks removal (death) at a constant rate. With birth rate and death rate µ, the ratio /µ specifies the mean number of o↵spring of each individual in this process, and a↵ects the topologies and branching times of the . CC-BY-NC 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted July 11, 2016. ; https://doi.org/10.1101/054544 doi: bioRxiv preprint 1 Figure S1: Some trees and their associated integers using the map Z of Example 1. The numbering goes from -5 to 5, with the exception of 0 which corresponds to the "empty tree". resulting branching trees. In the epidemiological setting, the link to branching times has been used to infer the basic reproduction number R 0 from sequence data [54,55]. We computed the distances between trees derived from constant-rate birth-death (BD) processes simulated in the package TreeSim in R [56]. One challenge is that the number of tips in the BD process after fixed time is highly variable and depends on /µ. We aimed to detect shape di↵erences that were not dominated by di↵erences in the number of tips. Accordingly, we conditioned the processes to have 1500 taxa and then pruned tips uniformly at random to leave 700 tips remaining. There are several other random models for trees. The Yule model is a model of growing trees in which lineages divide but do not die; in terms of tree topology it is the same as the Kingman coalescent and the equal rates Markov models. In the 'proportional to distinguishable arrangements' (PDA) model, each unlabelled topology is sampled with probability proportional to the number of labelled trees on n tips with that unlabelled topology [57,9]. The "biased" model is a growing tree model in which a lineage with speciation rate r has descendant lineages with speciation rates pr and (1 p)r. The Aldous' branching model that we use here is Aldous' -splitting model with = 1 [58]; in this model a distribution determines the (in general asymmetric) splitting densities upon branching. The Yule, PDA, biased and Aldous = 1 models are available in the package apTreeshape in R [59]. We used p = 0.3 for the biased model, and sampled trees with 500 tips.

Data
We aligned data of HA protein sequences from human influenza A (H3N2) in di↵erent settings reflecting di↵erent epidemiology. Data were downloaded from NCBI on 22 Jan 2016. In all cases we included only full-length HA sequences for which a collection date was available. The USA dataset (n = 2168)

Extension to multifurcations and sampled ancestors
A polytomy, or multifurcation, is an internal node with more than two descendants. In extending the scheme to handle polytomies we also extend it to allow for internal nodes with only one descendant.
We first explicitly work out the case where the maximum-size multifurcation is 4. Let 0 be the empty tree. Nodes may have 0, 1, 2, 3, or 4 descendants, and we write a general tree as (k, j, l, m), where k, j, l and m are the labels of the four trees descending from the root. Some of these may be empty (0) as not every node is a four-fold polytomy. As in the binary case, we use the convention that k j l m, and sort the length-four strings lexicographically. Every possible tree T with a maximum-size multifurcation of four has a unique label L 4 (T ) in this list. We seek to find an explicit expression for the label L 4 (T ) -the order in the list -for the tree (k, j, l, m).
The number of possible labels in the scheme with four characters, starting with k and sorted lexicographically, is k+3 k . To see this, note that each (k, j, l, m) with k j l m can be thought of as a path on a lattice, starting on the left at height k and descending to height 0 after three horizontal steps. The path has a total length of k + 3 steps, and of these, three must be steps to the right and k must be downward. The number of such paths is the number of ways of placing three rightwards steps amongst k + 3 steps, ie. k+3 k . Extending this, we obtain the label L 4 of the tree (k + 1, 0, 0, 0), noting that L 4 (k, k, k, k) is the sum of the numbers of labels beginning with 1, 2, ... k. L 4 (k + 1, 0, 0, 0) = 1 + L 4 (k, k, k, k) (and we write 1 as 3 3 ): Rewriting the sum and making use of the identity P k+c y=0 y c = k+c+1 c+1 , we have To obtain L 4 (k, j, l, m), we note that L 4 (k, j, l, m) = L 4 (k, 0, 0, 0) + L 3 (j, 0, 0) + L 2 (l, m). Following the same logic, this is As in the binary case, the labels will grow unfeasibly large, but in principle this is a bijective map between trees whose maximum-size polytomy is four and the non-negative integers. Naturally, there is nothing special about size-four polytomies. If the maximum size is c, the scheme is

11
. CC-BY-NC 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted July 11, 2016. ; https://doi.org/10.1101/054544 doi: bioRxiv preprint

Extensions of the metric
As noted in the main text, the metrics d 1 and d 2 will be dominated by di↵erences in the sizes of trees. It may be desirable to construct unlabelled metrics that are useful in comparing trees of di↵erent sizes with respect to their proportional frequencies of sub-trees. This is straightforward. We based the metric d 2 on vectors whose i th components were the number of sub-trees of label i; we can divide these vectors by the number of tips in the tree:v a = 1 na v a and define a new metriĉ With small ✏,d will be small when the proportional frequencies of sub-tree are very similar, but will only be 0 if the trees have identical vectors and the same number of tips.
Furthermore, if there are particular labels i that are of interest -for example those with relatively few tips, for a "tip-centric" tree comparison, weights w can be chosen and applied to the vectors to emphasize some entries more than others : The same weighting can of course be applied tov ind 2 .
The labelling schemes induce natural metrics on tree topologies, which we have applied to random tree-generating processes known to give rise to di↵erent shapes, and to data from human influenza A. The metric's use of a bijective mapping to N + means that it extends to a convex metric in Q + . However, the nature of the scheme means that it does not capture the lengths of branches. These are biologically relevant in many examples, because they reflect the (inferred) amount of time or genetic distance between evolutionary events, although particularly for branches deep in the tree structure they may be di cult to infer accurately.
To date, we are unaware of a metric (in the sense of a true distance function) on unlabelled trees that captures branch lengths, but there are several non-metric approaches to comparing unlabelled trees. In particular, Poon's kernel method [39] compares subset trees that are shared by two input trees, after first "ladderizing" the trees (arranging internal nodes in a left-right order with branching events preferentially to one side). Using a kernel function, this approach can quantify similarity between trees. One challenge is that where branch length is included, di↵erences in overall scaling or units of the branch lengths can overwhelm structural di↵erences. Lengths can be re-scaled (for example such that the height of both trees becomes 1), but rescaling methods may be sensitive to outliers or to the height of the highest tip in the tree. Lengths could also be set to 1 to compare topologies only. Recently, Lewitus and Morlon (LM) [40] used the spectrum of a matrix of all the node-node distances in the tree to characterise trees; this is naturally invariant to any node and tip labels. They used the Kullback-Leibler divergence between smoothed spectra as a measure of distance. If the spectrum uniquely defined a tree this would be a metric, as it is non-negative and obeys the triangle inequality. As it uses all node-node distances, this approach, requiring the spectrum of a non-sparse 2n 1 ⇥ 2n 1 matrix for a tree of n tips, will become infeasible for large trees. Finally, it is always possible to compare summary features of trees, including the number of lineages through time, diversity measures, density of tip-tip distances, imbalance measures and other features of the topology.
These approaches can be combined with our metric to create novel metrics on unlabelled trees; as our metric satisfies d(T 1 , T 2 ) = 0 () T 1 = T 2 , any distance function of the form d(T 1 , T 2 ) = w 1 d i (T 1 , T 2 ) + w 2 C(T 1 , T 2 ) where C(T 1 , T 2 ) is the LM tree di↵erence, a kernel-based tree di↵erence (not similarity), a distance between vectors of summary features, or a weighted sum of these, and w i are positive, will be a metric.
. CC-BY-NC 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted July 11, 2016. ; https://doi.org/10.1101/054544 doi: bioRxiv preprint In this way we can extend the metric to incorporate branch lengths and to emphasize features of interest (ie those believed to be informative of an underlying process of interest), while retaining the advantages of a true distance metric.

Implementation
We have used R throughout and are developing an R package. Code is available on github at https://github.com/c The implementation assumes full binary trees and includes metrics d 1 and d 2 with the option of weighting.
. CC-BY-NC 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted July 11, 2016. ; https://doi.org/10.1101/054544 doi: bioRxiv preprint