A Linear Time Solution to the Labeled Robinson–Foulds Distance Problem

Abstract A large variety of pairwise measures of similarity or dissimilarity have been developed for comparing phylogenetic trees, for example, species trees or gene trees. Due to its intuitive definition in terms of tree clades and bipartitions and its computational efficiency, the Robinson–Foulds (RF) distance is the most widely used for trees with unweighted edges and labels restricted to leaves (representing the genetic elements being compared). However, in the case of gene trees, an important information revealing the nature of the homologous relation between gene pairs (orthologs, paralogs, and xenologs) is the type of event associated to each internal node of the tree, typically speciations or duplications, but other types of events may also be considered, such as horizontal gene transfers. This labeling of internal nodes is usually inferred from a gene tree/species tree reconciliation method. Here, we address the problem of comparing such event-labeled trees. The problem differs from the classical problem of comparing uniformly labeled trees (all labels belonging to the same alphabet) that may be done using the Tree Edit Distance (TED) mainly due to the fact that, in our case, two different alphabets are considered for the leaves and internal nodes of the tree, and leaves are not affected by edit operations. We propose an extension of the RF distance to event-labeled trees, based on edit operations comparable to those considered for TED: node insertion, node deletion, and label substitution. We show that this new Labeled Robinson–Foulds (LRF) distance can be computed in linear time, in addition of maintaining other desirable properties: being a metric, reducing to RF for trees with no labels on internal nodes and maintaining an intuitive interpretation. The algorithm for computing the LRF distance enables novel analyses on event-label trees such as reconciled gene trees. Here, we use it to study the impact of taxon sampling on labeled gene tree inference and conclude that denser taxon sampling yields trees with better topology but worse labeling. [Algorithms; combinatorics; gene trees; phylogenetics; Robinson–Foulds; tree distance.]

Gene trees, usually derived from gene sequence alignments, represent the phylogenetic relationships between the genes labeling the leaves of the tree. From such representation, we can infer the most plausible scenarios of evolutionary events leading to the observed gene family from an ancestral gene. For this purpose, reconciliation methods (reviewed in Boussau and Scornavacca 2020) embed a given gene tree T into a known species tree S. This process results in the labeling of the internal nodes of T with the type of events which gave rise to them. In this article, we address the problem of comparing such event-labeled trees, that is, trees with inner nodes labeled with the type of event at the origin of the bifurcation. Note that such an event-labeled tree does not fully represent a "reconciliation" R as defined in the literature, as it should also be indicated for each node of R the position in the species tree where the event took place. However, the event-labeling of T is sufficient to determine the orthology (genes related through speciation), paralogy (genes deriving from a duplication), or xenology (genes related through a horizontal gene transfer) relations between genes, with important functional implications (Gabaldon and Koonin 2013). For example, information on duplication and speciation node labeling is provided for the trees of the Ensembl Compara database (Vilella et al. 2009).
A large variety of pairwise measures of similarity or dissimilarity have been developed for trees with no labels on internal nodes. Among them are the methods based on counting the structural differences between the two trees in terms of path size, bipartitions or quartets for unrooted trees, clades or triplets for rooted trees (Estabrook et al. 1985;Critchlow et al. 1996;Cardona et al. 2010), or those based on minimizing a number of rearrangements that disconnect and reconnect subpieces of a tree such as nearest neighbor interchange (NNI), subtree-pruning-regrafting (SPR), or Tree-Bisection-Reconnection (TBR) moves (Jiang et al. 2000;Allen and Steel 2001;Hickey et al. 2008). While the latter methods are NP-hard (Lin et al. 2012), the former are typically computable in polynomial time. In particular, the Robinson-Foulds (RF) distance, defined in terms of bipartition dissimilarity for unrooted trees and clade dissimilarity for rooted trees (Mittal and Munjal 2015), can be computed in linear (Day 1985), and even sublinear time (Pattengale et al. 2007). For trees with unweighted edges, the RF distance is the most widely used distance, not only in phylogenetics but also in other fields such as in linguistics, for its computational efficiency, intuitive interpretation and the fact that it is a true metric. To address the distance's drawbacks, such as lack of robustness (a small change in a tree may cause a disproportional change in the distance) and skewed distribution, improved versions of the RF distance have also been developed (Lin et VOL. 71 similarity scores between bipartitions rather than simply counting the bipartitions that are different in the two trees (Smith 2020).
Classically defined in terms of bipartition or clade dissimilarity, the RF distance can also be defined in terms of edit operations on tree edges: the minimum number of edge contraction and extension needed to transform one tree into the other (Robinson and Foulds 1981). This formulation is closely related to the Tree Edit Distance (TED) defined on labeled trees (usually rooted, and sometimes with an order on nodes), that is, trees with nodes (including leaves) labeled on the same given alphabet , arising from many different applications in various fields (parsing, RNA structure comparison, computer vision, genealogical studies, etc). TED (Zhang and Shasha 1989) is defined in terms of a minimum cost path of node deletion (resulting in the deletion of the edge linking this node to its parent), node insertion (resulting in the creation of a new edge), and node change (label substitution). The general version of the problem on unordered labeled trees with a non-constant cost function on edit operations is NP-complete (Zhang et al. 1992), while most restrictions and variants that have been defined for that distance are solvable in polynomial time (Zhang 1993(Zhang , 1996Bille 2005;Schwarz et al. 2017).
However, the variants that have been considered for TED do not include the type of event-labeled trees we consider in this article with two different alphabets: an alphabet L for the leaves corresponding to the genes, and an alphabet for the inner nodes corresponding to the events. Moreover, leaves are not affected by edit operations. Notice that the problem is different from that of comparing two reconciliations of a gene tree. Exploring the reconciliation space of a given tree or comparing two different reconciliations, for example, with or without horizontal gene transfers, has been largely addressed in the reconciliation literature (Doyon et al. 2009(Doyon et al. , 2011Chan et al. 2015;Huber et al. 2018). Here, we address the problem of comparing different eventlabeled trees for the same gene family, in the sense that the trees can differ, not only in labels but also in topology.
We formulate our distance as an extension of the RF distance to inner node-labeled trees. Although the RF distance is not based on biological events, it is the most widely used distance for comparing phylogenetic trees for its intuitive definition in terms of clades or bipartitions of the trees, and its computational efficiency. Therefore, an extension of the RF distance to inner node-labeled trees is not expected to represent true evolutionary events, but should allow capturing the topological and labeling difference between two trees, and should be computable efficiently.
In Briand et al. (2020), we have presented ELRF, a first extension of RF for comparing inner node-labeled gene trees, expressed in terms of trees with a binary node labeling. ELRF is obtained by including a node flip operation, alongside edge contractions and extensions. While remaining a metric, ELRF turned out to be much more challenging to compute. As a result, we were only able to propose a heuristic to compute it efficiently.
In this article, we explore a different extension of RF in terms of edit operations on tree nodes rather than on tree edges, which is closer to the TED formulation. We show that, in contrast with ELRF, this new distance is computable in linear time, not only for two but for an arbitrary number of label types. We show that the new distance compares favorably to RF and ELRF by performing simulations on labeled gene trees of 182 leaves. Finally, we use our new distance in the purpose of measuring the impact of taxon sampling on labeled gene tree inference, and conclude that denser taxon sampling yields better predictions at the topological level but leads to worse evolutionary event labeling.

NOTATION AND CONCEPTS
Let T be a tree with node set V(T) and edge set E(T). Given a node x of T, the degree of x is the number of edges incident to x. We denote by L(T) ⊆ V(T) the set of leaves of T, that is, the set of nodes of T of degree one. Given a set L (species or genes), a tree T on L is a tree with a oneto-one relationship between L(T) and L. For simplicity of presentation, in this article, we make no difference between a leaf and the associated element of L.
A node of V(T)\L(T) is called an internal node. A tree with a single internal node x is called a star tree, and x is called a star node. An edge connecting two internal nodes is called an internal edge; otherwise, it is a terminal edge. Moreover, a rooted tree admits a single internal node r(T) considered as the root.
The root is said to be binary if it is of degree 2; any other internal node is said to be binary if it is of degree 3. The trees considered in this article are such that all internal nodes are of degree at least 3, except the root in the case of rooted trees, which is of degree at least 2.
We call N(x) ={y :{x,y}∈E(T)} the set of neighbors of an internal node x of T.
A subtree S of T is a tree such that V(S) ⊆ V(T), E(S) ⊆ E(T) and any edge of E(S) connects two nodes of V(S).
The bipartition of an unrooted tree T corresponding to an edge e ={x,y} is the unordered pair of clades L(T x ) and L(T y ), where T x and T y are the two subtrees rooted, respectively at x and y obtained by removing e from T. We denote by B(T) the set of nontrivial bipartitions of T, that is, those corresponding to internal edges of T.
The Robinson-Foulds distance (RF) It is defined for both rooted and unrooted trees. Notice however that computing the RF distance for two rooted trees can be reduced to computing the RF distance for the two corresponding unrooted trees obtained by grafting an edge linking the root to a dummy leaf (Briand et al. 2020). Conversely, the problem of computing the RF distance for two unrooted trees can be reduced to computing it for the rooted trees using an arbitrarily chosen leaf as the root (Day 1985). Notations and theoretical results of this article are given for unrooted trees (notice however that the algorithm presented below is implemented in terms of rooted trees). We denote by T L the set of unrooted trees on L. Given two unrooted trees T and T of T L , the Robinson-Foulds (RF) distance between T and T is the size of the symmetric difference between the bipartitions of the two trees. More precisely, The RF distance is equivalently defined in terms of an edit distance on edges (Robinson and Foulds 1981). However, as for node-labeled trees an additional substitution operation on node labels will be required, for the sake of standardization, we reformulate the edit operations to operate on nodes rather than on edges. Now, the Robinson-Foulds distance RF(T,T ) between T and T is the size of a shortest path of edge edit operations (i.e., node insertion and node deletion) transforming T into T .
Call a bad edge of T with respect to T (or similarly of T with respect to T; if there is no ambiguity, we will omit the "with respect to" precision) an edge representing bipartitions which are not shared by the two trees, that is, an edge of T defining a bipartition of B(T) which is not in B(T ). An edge which is not bad is said to be good. Terminal edges are always good.

GENERALIZING THE ROBINSON-FOULDS DISTANCE TO LABELED TREES
We are interested in comparing trees with information on internal nodes, which is the label of interest in this article. Therefore, in this article, we say that a tree T is labeled if each internal node x of T has a label (x) ∈ , being a finite set of labels. We denote by T L, the set of unrooted and labeled trees on L with labels from . For gene trees, labels are reasonably the type of event leading to the bifurcation, typically duplications, speciations, or horizontal gene transfers, but may also represent the mapping to the nodes of the species tree. Moreover, we mean by an "unlabeled tree" a tree with no labels on internal nodes (but yet with the labels L on leaves).
We generalize the RF distance to labeled trees by generalizing the edit operations defined above. This is simply done by introducing a third operation for node labels editing (Fig. 1).

Definition 2 (Labeled node edit operations). Three edit operations on internal nodes of a labeled tree T are defined as follows:
• Node deletion: Del(T,x,y) is an operation deleting an internal node x of T with respect to a neighbor y of x which is not a leaf, defined as in Definition 1.
• Node insertion: Ins(T,x,y,Z,) is an operation inserting an internal node x as a new neighbor of a nonbinary node y, and moving Z N(y) such that |Z|≥ 2, to be the neighbors of x, as defined in Definition 1. In addition, the inserted node x receives a label ∈ .
• Node label substitution: Sub(T,x,) is an operation substituting the label of the internal node x of T with ∈ .
For two trees T, T of T L, , we call the Labeled Robinson-Foulds distance between T and T and denote by LRF(T,T ) the size of a shortest path of labeled node edit operations transforming T into T . The two following lemmas state that, similarly to RF, LRF is a true metric. Moreover, LRF is exactly RF for unlabeled trees.
In the following, the unlabeled version of a tree T ∈ T L, is simply T ignoring its node labels. Lemma 1. The function LRF(T,T ) assigning to each pair (T,T ) ∈ T 2 L, the size of a shortest path of node edit operations transforming T into T defines a distance on T L, .
Proof for this lemma is available in the Appendix.
The next lemma directly follows from the fact that node substitutions are never applied in case of a label set restricted to a single label (i.e., for unlabeled trees).

Lemma 2. If is restricted to a single label, then for each pair
A previous extension of RF to labeled trees, based on edit operations on edges rather than on nodes, was introduced in Briand et al. (2020). This distance, which we call ELRF, was defined on three operations: • Edge extension Ext(T,x,X) creating an edge {x,y} and defined as a node insertion Ins(T,y,x,X,(x)) inserting a node y as a neighbor of x and assigning to y the label of x; • Edge contraction Cont(T,{x,y}) is equal to a node deletion Del(T,y,x) deleting y, but requires that (x) = (y), a condition which is not requested for LRF; • Node flip Flip(T,x,) assigning the label to x. We use the term "flip" to emphasize the fact that ELRF only supports two kinds of labels, unlike LRF.
Given two labeled trees T and T of T L, , ELRF(T,T ) is the size of the shortest path of edge extension, edge contraction, and label flip required to transform T to T . The following lemma makes the link between LRF and ELRF.

Lemma 3. For any pair
The proof for this lemma is available in the Appendix.
We now turn our attention to computing the edit distance LRF(T,T ) for a pair (T,T ) of trees of T L, .

Reduction to Islands
We define a subdivision of the two trees into pairs of maximal subtrees that can be treated separately.
While a good edge e in T has a corresponding good edge e in T (the one defining the same bipartition), a bad edge in T has no corresponding edge in T .
However, those bad edges may be grouped into pairs of corresponding islands (called maximum bad subtrees in Briand et al. (2020)), as defined below.

Definition 3 (Islands). An island of T is a maximal subtree I such that all its internal edges are bad edges of T, and all its terminal edges are good edges of T. The size of I, denoted (I), is its number of internal edges.
Notice that an island I of T may have no internal edge at all, i.e. it may be restricted to a star tree (if (I) = 0). Notice also that each bad edge of T belongs to a single island, while each good edge belongs to exactly two islands of T if it is an internal edge of T, or to a single island if it is a terminal edge of T.
The following lemma (similar to Lemma 3 in Briand et al. (2020)) shows that there is a one-to-one correspondence between the islands of T and those of T . The proof for this lemma is available in the Appendix.
For any island I of T, let I be the corresponding island of T . We call (I,I ) an island pair of (T,T ) (Fig. 3). Now, let I (T,T ) ={(I 1 ,I 1 ),(I 2 ,I 2 ),...,(I n ,I n )} be the set of island pairs of (T,T ). For 1 ≤ i ≤ n, let P i be a shortest path of labeled node edit operations transforming I i into I i . Then the path P obtained by performing consecutively P 1 ,P 2 ,...,P n (that we represent later as P 1 .P 2 .....P n ) clearly transforms T into T . Therefore we have As described in Briand et al. (2020), one major issue with ELRF is that good edge contractions may not be avoided in a shortest path of edit operations I' 4 I' 2 This representation highlights the subdivision of the two trees into the island pairs I (T,T ) ={(I 1 ,I 1 ),(I 2 ,I 2 ),(I 3 ,I 3 ),(I 4 ,I 4 )}. Notice that each dotted line is a terminal edge of its two adjacent islands.
transforming T into T , resulting in island merging. In other words, treating island pairs separately may not result in an optimal scenario of edit operations under ELRF, preventing the above inequality from being an equality. Interestingly, the equality holds for the LRF distance, as we show in the next section.

Computing the LRF Distance on Islands
We require an additional definition. Two trees I and I of an island pair are said to share a common label l ∈ if there exist x ∈ V(I) and x ∈ V(I ) such that (x) = (x ) = l. If I and I do not share any common label, then (I,I ) is called a label-disjoint island pair. Now let (I,I ) be an island pair. Transforming I into I can be done by reducing I into a star tree by performing a sequence of node deletions (if any, i.e., if I is not already a star tree) and then raising the star tree by inserting the required nodes to reach I . Only the unique node not deleted during the first step might require a label substitution; for all inserted nodes, the label can  be chosen to match that of I . However, if I and I share a common label l among their internal nodes, then the deletions can be done in a way such that the surviving node x of I is one with label (x) = l, thus avoiding the need for any substitution. The number of required operations is thus (I) deletions, followed by zero or one substitution, followed by (I ) insertions. Alternatively, the problem can be seen as one of reducing the two trees into star trees by performing (I)+(I ) deletions, in a way reducing the two islands into two star trees sharing the same label, if possible. Figure 4 depicts an example of such tree editing for a label-disjoint island pair.
The following lemma shows that reducing islands to star trees is optimal. The proof for this lemma is available in the Appendix.
We have obviously LRF(T,T ) ≤ (I,I )∈I (T,T ) LRF(I,I ). It remains to show that the symmetrical inequality also holds, that is, we cannot do better by merging islands, and thus pairs of islands can be considered separately.
The following lemma states that we can always find a sequence of operations, at each step maintaining or increasing the number of islands, that is, never merging islands. For a path P = (o 1 ,o 2 ,...o p ) transforming a tree T into a tree T and 1 ≤ k ≤ p, denote by T k the tree obtained from T after performing the sub-sequence of operations P k = (o 1 ,...o k ).
Lemma 6. Let T and T be two trees of T L, . There is a shortest path P = (o 1 ,o 2 ,...o p ) of edit operations transforming VOL. 71 FIGURE 5. A path P transforming T into T of the form P 1 .P 2 .P 3 .P 4 , each P i being a shortest path for the island pair (I i ,I i ). Here |P 1 |=6, |P 2 |=0, |P 3 |=1, and |P 4 |=0.
The proof for this lemma is available in the Appendix.
We are now ready to prove the equality leading to the efficient computation of the LRF distance of two trees (see Fig. 5 for an example).

Theorem 1.
Let I (T,T ) ={(I 1 ,I 1 ),(I 2 ,I 2 ),··· ,(I n ,I n )} be the island pairs of T and T . Then The proof for this theorem is available in the Appendix.
The next result directly follows from Lemma 5 and Theorem 1.

ALGORITHM
We present our algorithm for computing the LRF distance (Algorithm 1). The input is a pair of trees T 1 , T 2 of T L, . We show that LRF(T 1 ,T 2 ) can be computed in time O(n), where n =|L|.
The algorithm operates on rooted trees, without loss of generality. Indeed, recall that we can define a bijection turning an unrooted tree over n leaves into its corresponding rooted tree using an arbitrary leaf as a root (see Notation and Concepts section). Note also that there is a bijection between the bipartitions defined for each branch of the unrooted tree and the clades defined by each branch of its rooted counterpart. FIGURE 6. Our implementation of LRF in Python exhibits a linear runtime on random trees of up to 10,000 leaves. Even on these large trees, it takes less than a second to compute LRF on an AMD EPYC 7272 processor.
The key to devising an efficient algorithm is to index the clades of T 1 in O(1). This can be achieved using the data structure of Day (1985), who used it to introduce an algorithm to compute the conventional (unlabeled) Robinson-Foulds distance in O(n). To efficiently index the clades of T 1 , Day renumbers the leaves of T 1 using a postorder sequence. With this representation, each clade is defined by a contiguous sequence of leaves, for example, [5,6,7,8], which can thus be summarized by its smallest and highest values [5,8]. He stores the clades in a table X of n rows, where each clade [l,r] is stored either in row l or r. In this way, we can use X to check for the existence of clade [l,r] in constant time.
Our algorithm LRF (Algorithm 1) consists of four tree traversals, followed by one traversal of the aforementioned table X. The first tree traversal, of T 1 , renumbers the leaves of T 1 in postorder and stores the clades in X (line 1, buildX). The second tree traversal, of T 2 , identifies the clades of T 2 that are shared with T 1 using the efficient X lookup structure and marks the "good" clades (corresponding to good edges) as such in X (line 2, findGoodClades). The third and fourth tree traversals, of T 1 and T 2 , respectively (lines 3 and 4, getIslands), identify the islands that are separated by the good edges recorded in X, and update X with the size and labels present in the islands of T 1 and T 2 , respectively. Finally, we traverse X and use the size and labels of all matching islands to compute the LRF edit distance, using the formulae of Appendix-Corollary 1 (lines 8-9). Note that for a fixed number of labels, testing for the presence of a common label can be done in a constant time with respect to n. We provide more details on the implementation, pseudocode and complexity of the tree traversal in the Appendix. We provide an open source implementation of LRF in Python as part of the pyLabeledRF package (https://github.com/DessimozLab/pylabeledrf). To empirically confirm the linearity of the algorithm, we computed the LRF distance between random trees of size up to 10,000 leaves. The run time averaged over 100 trees per point was almost perfectly linear (Fig. 6).

EXPERIMENTAL RESULTS
To illustrate the usefulness of LRF, we performed two experiments. First, we compared LRF with RF and ELRF on a labeled gene tree with random edits. Second, we used LRF to tackle an open question in orthology inference: does labeled gene tree inference benefit from denser taxon sampling?

Empirical Comparison of LRF with RF and ELRF
To get a first sense of LRF's ability to measure the actual number of edits between two trees, we performed a simulation study alongside RF and ELRF. We retrieved the labeled tree associated with human gene NOX4 from Ensembl release 99 (Yates et al. 2020), containing 182 genes, including speciation and duplication nodes. Next, we introduced a varying number of random edits, with 10 replicates, as follows: with probability 0.3, the label of one random internal node was substituted (from a speciation label into a duplication one or vice versa); the remaining probability of 0.7 was evenly distributed among all internal edges (each implying a potential node deletion) and all nodes of degree > 3 (each providing the opportunity of a potential node insertion). For ELRF, consistent with its underlying model, we added the requirement that edge removal only affects edges with adjacent nodes with the same label.
For each of RF, LRF, and ELRF, we provide the distance as a function of the number of random edits (Fig. 7). As expected, the conventional RF distance returns the smallest values because it ignores labels; it however tracks quite well the expected number of node insertion and/or removal (dashed line). The two labeled RF alternatives performed similarly, but the heuristic for ELRF occasionally exceeded the true number of edit operations-a shortcoming that we do not have with LRF, as we have an exact algorithm for this distance. Both labeled RF variants tracked better the actual number of changes, until around 13 edits for LRF or ELRF, after which the minimum edit path starts to be often shorter than the actual sequence of random edits.

The Effect of Denser Taxon Sampling on Labeled Gene Tree
Inference We used LRF to assess the effect of species sampling for the purpose of labeled gene tree reconstruction. Consider the problem of reconstructing a labeled tree corresponding to homologous genes from 10 species. Our question is: is it better to infer and label the tree using these 10 species alone, or is it better to use more species to infer and label the tree, and then prune the resulting tree to only contain the leaves corresponding to the original 10 species? In principle, denser sampling allow to account for more information when resolving the relationship between genes, and it is known to improve unlabeled phylogenetic inference (Nabhan and Sarkar 2011). It is unclear however how the added species affects FIGURE 7. Empirical comparisons of the distance inferred for an increasing number of random edit operations (node insertion, deletion, and substitution) on the NOX4 gene tree (182 leaves), using the classical RF distance (top), the ELRF approximation (Briand et al. 2020;middle), and the LRF exact distance (bottom). the labeling, in one hand, improved sampling should make easier to solve difficult case like hidden paralogy (a duplication node appearing as a speciation node). In the other hand, an increased number of internal nodes may make the labeling more sensitive to errors in the topology.
First, using ALF (Dalquen et al. 2012), we simulated the evolution of the genomes of 100 extant species from a common ancestor genome containing 100 genes (Parameters: root genome with 100 genes of 432 nucleic acids each; species tree sampled from a birth-death model with default parameters; sequences evolved using the WAG model, with Zipfian gap distribution; duplication and loss events rate of 0.001). In the simulation, genes can mutate, be duplicated or lost. All the genes in the extant species can thus be traced back to one of these 100 ancestral genes and be assigned to the corresponding gene family. The 100 true gene trees, including speciation and duplication labels, are known from the simulation. In our run, the resulting gene trees had in average 100.11 leaves with a minimum of 65 and a maximum of 156.
To evaluate the inference process, among the 100 species, we randomly selected nested groups of 10,20,30,40,50,60,70,80, and 90 species. We considered the 10 species in the first group as the species of interest. All other species were used to potentially improve the reconstruction of the gene trees for the first 10 genomes. Then, for each group, we aligned the protein sequences translated from homologous genes using MAFFT L-INSi (Katoh and Standley 2013), inferred phylogenetic trees from the alignments using FastTree (Price et al. 2010), and annotated their nodes using either the species tree reconciliation or the species overlap algorithm ( Van der Heijden et al., 2007) as implemented in the ETE3 python library . Thus, the nodes of the tree were labeled as either duplication or speciation. Finally, we pruned both the inferred gene trees and the true trees to include only genes corresponding to the 10 species of interest. The resulting trees had in average 10.27 leaves. However, in our run, one pruned tree ended up containing only two genes (due to losses on early branches) and was thus excluded from the rest of the analysis. In order to assess the influence of the accessory species on the node labeling, we used a variant of the same strategy where the tree annotation step and pruning step were reversed, and the labeling of the simulated tree was only done based on the pruned tree with the 10 species of interest.
We used RF and LRF to assess the distance between the estimated and true labeled trees, for the various numbers of auxiliary genomes considered. For each number of accessory species and over all annotation scenarios, we computed the mean RF and LRF distance over all gene trees (Fig. 8, Fig. S1 of the Supplementary material available on Dryad at http://dx.doi.org/10.5061/dryad.2bvq83bpr).
The mean error either reflects the topological error (measured by the RF distance) or both topological and labeling errors (LRF distance). As the RF distance shows, topological error decreases as the number of auxiliary species increases. By contrast, the total error as measured by LRF does not substantially decrease when species reconciliation is performed on the augmented data set (i.e., including accessory species; Fig. 8 upper right). FIGURE 8. Denser taxon sampling decreases labeled tree estimation error: labeled gene trees reconstructed with an increasing number of auxiliary genomes (i.e., obtained by including the additional genomes during tree inference and labeling, followed by pruning) have a smaller RF (left column) and LRF (right column) distance to the true trees under different annotation strategy. Trees are labeled using the species tree reconciliation algorithm with (top row) or without (bottom row) the accessory species. The decrease of LRF is less important when these species are also used for labeling by species tree reconciliation methods (upper right) than when they are removed prior to reconciliation (lower right). Error bars depict 95% confidence intervals around the mean. This is due to an increase in labeling errors when the auxiliary species are included, which counterbalances the reduction in topological errors observed on the augmented trees. Thus, the best performance (lowest topological and labeling error) is obtained when we infer the tree using the augmented data set but perform the reconciliation after the additional species are removed (Fig. 8 lower right).
If we use the alternative approach of "species overlap" to infer the speciation and duplication nodes (Van der Heijden et al. 2007), we also observe this effect, albeit to a much lesser extent (Fig. S1 of the Supplementary material available on Dryad).
In conclusion, this simulation study indicates that denser species sampling generally improves gene tree inference but, if kept during reconciliation, these accessory species induce a higher event-labeling error if the true augmented species tree is provided as input. The best results are obtained by performing tree inference on the augmented data but reconciliation on the pruned trees. The improvement on the topological inference of the gene tree is in line with our expectations, confirming that increased sampling is generally beneficial even to solve relations between a finite number of genes and should be preferred whenever tractable. It is possible than a reduced "smart" sampling optimizing for diversity and to add resolution to the longest branches would be enough to reproduce most of benefit of a higher sampling. Accordingly, the higher topological improvement gained from accessory species is observed when adding the first 10 species. The observed effect of higher sampling on labeling gene trees is likely due to the sensitivity to errors of the species tree reconciliation algorithm. Indeed, even if the topology of the "target" species tree is improved by denser sampling, the absolute number of topological error on the larger tree is higher and single topological error can lead to drastic changes (principally overannotation of duplication events). The potential benefit of using higher sampling for labeling (e.g., case of hidden paralogy when the topology of the smallest gene tree is not enough for accurate labeling) seems to not be common enough in our simulation to offset this effect.

CONCLUSION
The LRF distance introduced here overcomes the major drawback of ELRF, namely the lack of an exact polynomial-time algorithm. Indeed, with ELRF, minimal edit paths can require contracting "good" edges, that is, edges present in the two trees (Briand et al. 2020). By 1400 SYSTEMATIC BIOLOGY VOL. 71 contrast, with LRF, we demonstrated that there is always a minimal path which does not contract good edges. Better yet, we proved that LRF can be computed exactly in linear time. The new formulation also maintains other desirable properties: being a metric, even for an arbitrary number of label types, and reducing to the conventional RF distance in the presence of trees with only one type of label.
The LRF distance is an extension of the RF distance, which is defined with a unit weight for the two edit operations, as otherwise there would be no correspondence with the definition using bipartitions and clades. Moreover, as edge or node insertion and deletion have no biological, but rather topological, meaning, it is hard to see on which criteria weights for these operations would be assigned. In the case of LRF however, changing a node label is related to an evolutionary event, and some substitutions from a duplication to a speciation node lead to an impossible evolutionary scenario. Banning those substitutions implies strong constraints on the intermediate allowed topologies. Alternatively, we may be interested in giving different weights to insertion/deletion versus substitution operations, penalizing node substitution more than node insertion/deletion. In that case however, our algorithm would not be exact anymore as merging islands may result in fewer substitution events. Accounting for the feasibility of substitution events or accounting for weighted events are interesting avenues for future work.
Finally, LRF constitutes a clear improvement over RF in the context of gene tree benchmarking, where trees inferred by various reconciliation models are compared using a distance measure (Altenhoff et al. 2016;Morel et al. 2020). Such an application was illustrated in the simulation study of the previous section, in which we observed that denser taxon sampling improves gene tree inference at the topological level but that it worsens tree reconciliation. This latter result could not have been obtained solely using RF, which serves to illustrate the biological relevance of LRF.

Proof of Lemma 1
Proof . The non-negative, identity, and triangular inequality conditions are obvious. For the symmetric condition, notice that we can reverse every edit operation in a path from T to T to obtain a path from T to T with the same number of events, and vice versa (insertions and deletions are symmetrical operations, and any substitution can be reversed by a substitution). We thus have LRF(T ,T) ≤ LRF(T,T ) and LRF(T,T ) ≤ LRF(T ,T), and equality follows.

Proof of Lemma 3
Proof . Let P be a path of edge edit operations and label flip transforming T into T such that |P|= ELRF(T,T ). Then, the sequence P obtained from P by replacing each edge extension by the corresponding node insertion, each edge contraction by the corresponding node deletion and each node flip by the corresponding node substitution is clearly a path of node edit operations of size |P |=|P|=ELRF(T,T ) transforming T into T . And thus LRF(T,T ) ≤ ELRF(T,T ). Figure 2 depicts an example where the inequality is strict.

Proof of Lemma 4
Proof . For 1 ≤ i ≤ k, let e i = (x i ,y i ), y i being a leaf of I, and let Y i = L(T y i ). As ∪ i Y i = L, any subtree I of T containing the set {e i } 1≤i≤k as I -terminal edges does not contain any other I -terminal edge. As T is a tree, for any 1 ≤ i = j ≤ k, there is only one possible path from x i to x j . Uniqueness follows. Suppose that such a subtree I is not an island. Then, it contains an internal good edge e = (x ,y ). In other words, there is a non-trivial bipartition of {Y i } 1≤i≤k which is also a bipartition in I. This contradicts the fact that I is an island of T. Finally, as all terminal edges of I are good edges of T , it follows that I is an island of T .

Proof of Lemma 5
Proof . The scenario depicted above for transforming I into I clearly requires (I)+(I ) node insertions and deletions, and an additional node label substitution in case I and I are label-disjoint. We can conclude that LRF(I,I ) ≤ (I)+(I ) if I and I share a common label and LRF(I,I ) ≤ (I)+(I )+1, if I and I are label-disjoint.
On the other hand, as all the edges of I are bad edges, they should be all removed, before reinserting those of I . Now, since an edit operation can remove or insert at most one edge, and the only operations removing an edge are node removal or node insertion, we clearly require at least (I)+(I ) node removals and insertions to transform the unlabeled form of the tree I into the unlabeled form of I . Furthermore, as deletions do not affect star nodes, at least one node in I should survive (i.e., not be affected by a node deletion). Thus, if the two trees are label-disjoint, then at least one node label substitution is required. We can then conclude that LRF(I,I ) ≥ (I)+(I ) if I and I share a common label and LRF(I,I ) ≥ (I)+(I )+1, if I and I are label-disjoint, which concludes the proof.

Proof of Lemma 6
Proof . Let P = (o 1 ,o 2 ,···o p ) be a shortest path transforming T into T . Denote (T k ,T ) = (I,I )∈I(T k ,T ) (I)+ (I ), and (T k ,T ) the number of label-disjoints pairs of I(T k ,T ).
Assume P contains an operation reducing the number of islands of T i−1 , and let o i be the last operation of that form, i.e. |I(T i−1 ,T )| > |I(T i ,T )|. Such an operation can only be a deletion Del (T i−1 ,x,y) where e ={x,y} is a good edge, thus merging the two islands I x , I y containing this good edge.
As, by assumption, o i is the last operation merging two islands, at that point each pair of islands is treated separately, and we deduce from the fact that P is a shortest path that LRF(T i ,T ) = (I,I ) On the other hand, there is a path from As o i is a deletion of a good edge e ={x,y}, it destroys the bipartition defined by this edge in T i−1 , consequently the corresponding edge in T becomes a bad edge. Therefore, On the other hand, let = (T i ,T )−(T i−1 ,T ) be the difference between the number of label-disjoint pairs of islands after performing the operation o i merging two pairs of islands (I 1 ,I 1 ) and (I 2 ,I 2 ).
• If both pairs (I 1 ,I 1 ) and (I 2 ,I 2 ) share a common label, then the merged pair also shares a common label, and thus = 0; • If both pairs (I 1 ,I 1 ) and (I 2 ,I 2 ) are label-disjoint, then after the merging the resulting pair of islands may or may not share a common label and thus −2 ≤ ≤−1; • If only one of the two pairs (I 1 ,I 1 ) and (I 2 ,I 2 ) share a common label, then after the merging, the resulting pair of islands may or may not share a common label and thus −1 ≤ ≤ 0.
Therefore, in all cases we have Therefore, replacing the sequence of operations (o i ,...o p ) on T i−1 by a sequence of operations solving each pair of islands separately leads to the same number of operations.

Proof of Theorem 1
Proof . LetP be a shortest path transforming T into T verifying the condition of Lemma 6, that is, not involving any removal of good edges. As islands can only share good edges, and good edges are never removed by any operation of P, islands are never merged during the process of transforming T into T , and thus can be treated separately. Let P i , 1≤ i ≤ n, be the subpath of edit operations transforming I i into I i . Each P i should be a shortest path from I i to I i as otherwise it can be replaced by a shortest path, contradicting the fact that P is a shortest path.

Algorithm Implementation
The algorithm LRF (Algorithm 1) consists of four tree traversals, followed by one traversal of a table X of n row, where n is the number of leaves of the trees T1 and T2. The first tree traversal (Algorithm 2) is essentially Day's BUILD(T,X) function. The only noteworthy difference is that each row of X contains additional attributes (i.e., additional columns): a Boolean variable to indicate whether a clade is shared among T 1 and T 2 (i.e., whether it is a "good" clade), and for good clades, the size and labels of the islands of T 1 and T 2 rooted in these clades.
The second tree traversal (Algorithm 3) is similar to Day's COMCLUST function. It traverses T 2 in postorder, renumbering the leaves in a way consistent with X (line 4), but keeping track not only of the minimum and maximum leaf number contained in each clade but also the number of such leaves. If a clade of T 2 is in T 1 , the set of leaves comprised in that clade will satisfy the property of forming a contiguous sequence L,L+1,..,R of length R−L+1 (l. 12). Furthermore, that clade will be stored Algorithm 4 getIslands(T,X,isT 1 ) 1: n = length(X) 2: function traverseT(t,parentIsland): for c ∈ t.children: 27: traverseT(c,islandId) 28: traverseT(T,n) 29: return in X either at row L (line 13) or R (l. 15); either way, we mark that clade as good (l. 14 or 16). For good clades, we store the clade information in each internal node of the tree (l. 15, l. 18); else we store a dummy clade [1,1] (l. 20). Like in Day (1985), all the operations performed at each node of the traversal are O(1), giving an overall runtime of O(n) for this second traversal as well.
Finally, the third and fourth tree traversals are performed using getIslands (Algorithm 4). As two islands are separated by a good edge, we can identify an island by the good edge (i.e., the good clade) in which it is rooted, and can thus store island information in the row of X which corresponds to that good edge. We thus traverse the tree, keeping track for each island of (i) the set of labels found in each island (attributes islandLabels1 and islandLabels2); (ii) the number of bad edges in each island (attributes islandSize1 and islandSize2). Lines 2-27 define the recursive function used to traverse the tree. We check whether a particular node is attached to the previous island by a good edge or a bad edge (l. 4-9). If it is a good clade (l. 10-17), we have just transitioned to a new island and thus initialize the island label sets and island size (l.13-14 for T 1 or l. 16-17 for T 2 ). If it is a bad clade (l. 18-25), then we are still part of previous island, so we just update the set of leaves of the previous island (l. 21 or l. 24), and increment its size counter by one (l. 22 or l. 25). All operations performed at each internal node are constant time, and the number of internal nodes is O(n), so the time complexity of the tree traversal is done in time O(n).

Theorem 2. The time complexity of algorithm 1 is O(n).
Proof . As described above, the algorithm consists of four tree traversal, each running in O(n), and one traversal of the table X containing n rows, with a constant number of operation per row. This gives an overall time complexity of O(n).