Gene tree and species tree reconciliation with endosymbiotic gene transfer

Abstract Motivation It is largely established that all extant mitochondria originated from a unique endosymbiotic event integrating an α−proteobacterial genome into an eukaryotic cell. Subsequently, eukaryote evolution has been marked by episodes of gene transfer, mainly from the mitochondria to the nucleus, resulting in a significant reduction of the mitochondrial genome, eventually completely disappearing in some lineages. However, in other lineages such as in land plants, a high variability in gene repertoire distribution, including genes encoded in both the nuclear and mitochondrial genome, is an indication of an ongoing process of Endosymbiotic Gene Transfer (EGT). Understanding how both nuclear and mitochondrial genomes have been shaped by gene loss, duplication and transfer is expected to shed light on a number of open questions regarding the evolution of eukaryotes, including rooting of the eukaryotic tree. Results We address the problem of inferring the evolution of a gene family through duplication, loss and EGT events, the latter considered as a special case of horizontal gene transfer occurring between the mitochondrial and nuclear genomes of the same species (in one direction or the other). We consider both EGT events resulting in maintaining (EGTcopy) or removing (EGTcut) the gene copy in the source genome. We present a linear-time algorithm for computing the DLE (Duplication, Loss and EGT) distance, as well as an optimal reconciled tree, for the unitary cost, and a dynamic programming algorithm allowing to output all optimal reconciliations for an arbitrary cost of operations. We illustrate the application of our EndoRex software and analyze different costs settings parameters on a plant dataset and discuss the resulting reconciled trees. Availability and implementation EndoRex implementation and supporting data are available on the GitHub repository via https://github.com/AEVO-lab/EndoRex.


Introduction
Genomics and cell biology investigations have revealed that all known eukaryotes descend from a common ancestral mitochondrial-containing cell that originated from the integration of an endosymbiotic a-proteobacterium into a host cell (Dyall and Johnson, 2000). After this early event, eukaryotic gene contents have been shaped by duplications, losses and Horizontal Gene Transfers (HGT) from one species to another, but also by Endosymbiotic Gene Transfers (EGT), mainly from the mitochondrion to the nucleus, in some cases leading to the total disappearance of the mitochondrion (Roger et al., 2017;Sloan et al., 2018).
Many questions regarding the ancestral mitochondrial proteome and gene content evolution remain open (Lang and Burger, 2012). One of the reasons is that, to date, comparative genomics studies have largely focused on multicellular eukaryotes, mainly animals and plants. While imprints of global evolutionary events at the genomic level are hardly visible on multicellular eukaryotes that have diverged too much from the Last Eukaryotic Common Ancestor (LECA), protists, known to have emerged close to the eukaryotic origin, are better candidates for such a comprehensive evolutionary study. Interestingly, a recent sequencing effort on jakobids (Gray et al., 2020) and malawimonads (Derelle et al., 2015) protist genomes have been undertaken by a consortium of protistologists (DeepEuk), suggesting that soon enough data will be available to allow further investigations on early-eukaryotic evolution.
In addition to having the appropriate datasets, understanding the concerted evolution of the eukaryotic mitochondrial and nuclear genomes also requires having the appropriate algorithmic tools. This problem can be seen as related to the host-parasite coevolution inference problem (Charleston and Perkins, 2006). Given a host tree and a parasite tree, cophylogenetic analysis consists in inferring a history of codivergence, parasite duplication, host switch or extinction events explaining the coevolution of hosts and parasites. However, nuclear and mitochondrial genomes can hardly be treated by the same kind of approach, as they evolve, through a different evolutionary model, together in the same species, and thus are related through the same species tree. Rather, inferring an endosymbiotic evolutionary history requires focusing on gene families and studying the movement of genes between the mitochondrial and nuclear genomes.
Inferring the evolution of gene families is the purpose of the gene-tree-species-tree-reconciliation field, seeking for a most parsimonious (El-Mabrouk and Noutahi, 2019;Goodman et al., 1979), i120 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Bioinformatics, 37, 2021, i120-i132 doi: 10.1093/bioinformatics/btab328 ISMB/ECCB 2021 or a most probable (Akerborg et al., 2009;Szö ll} osi et al., 2015) evolutionary scenario of gene gain and loss explaining the incongruence between a gene tree and a species tree. A most parsimonious reconciliation minimizing the number of Duplications (the D-distance) or the number of Duplications and Losses (the DL-distance) can be found in linear time using the LCA (Last Common Ancestor) mapping (Chen, 2000;Zhang, 1997;Zmasek and Eddy, 2001). Such an algorithm can actually be used to solve the cophylogenetic problem if operations are restricted to coevolution, duplication and extinction. Including HGT events (i.e. finding the DTL-distance) leads to an NP-hard problem if time-consistency is required, remaining polynomial otherwise (Bansal et al., 2012;Tofigh et al., 2011).
In this article, we introduce the reconciliation model accounting for EGT events, i.e. the special case of HGT events where genes are exchanged only between the mitochondrial and nuclear genomes of the same species. Although integration of the mitochondrial content into the nucleus is the most frequent event in the course of evolution of eukaryotes, the transfer from the nucleus to the mitochondrion has also been observed (Adams and Palmer, 2003). Here, we consider the exchange of genes in both directions. Moreover, we consider EGT events resulting in maintaining a gene copy in the source genome (EGTcopy), as well as those resulting in the removal or loss of function of the gene in the source genome (EGTcut).
Formally, given a gene tree for a gene family with a known mitochondrial or nuclear location for each gene copy, we seek for a most parsimonious sequence of Duplication, Loss and EGT (DLE) events explaining the tree given a known species tree. First, based on the DL-distance and on the Fitch algorithm for weighted parsimony, we present, in Section 3, a linear-time algorithm for computing the DLE-Distance, as well as an optimal reconciled tree for the unitary cost. We then develop, in Section 4, a general dynamic programming algorithm that can be used to output all optimal reconciliations, for an arbitrary cost of operations, including possibly a different cost for an EGT from the mitochondrion to the nucleus, or conversely. This algorithm is linear in the size of the gene tree. It can be seen as an adaptation of the quadratic-time DTL algorithm for dated trees (Doyon et al., 2010), which allows transfers between any co-existing species. We finally illustrate, in Section 5, the application of our EndoRex software on clusters of orthologous mitochondrial protein-coding genes (MitoCOGs) (Kannan et al., 2014) of plants, analyze different costs settings parameters and discuss the obtained reconciled trees.
For space reasons, some of the proofs are given in Appendix.

Preliminaries
All trees are considered rooted. Given a tree T, we denote by r(T) its root, by V(T) its set of nodes and by 'ðTÞ VðTÞ its leafset. A node x is a descendant of x 0 if x is on the path from x 0 to a leaf of T and an ancestor of x 0 if x is on the path from r(T) to x 0 ; x is a strict descendant (respectively strict ancestor) of x 0 if it is a descendant (respectively ancestor) of x 0 different from x 0 . Moreover, x is the parent of x 0 6 ¼ rðTÞ if it directly precedes x 0 on the path from x 0 to r(T). In this latter case, x 0 is a child of x. We denote by E(T) the set of edges of T, where an edge is represented by its two terminal nodes ðx; x 0 Þ, with x being the parent of x 0 . An internal node (a node which is not a leaf) is said to be unary if it has a single child and binary if it has two children. If not stated differently, the children of a binary node x are denoted x l and x r . Given a node x of T, the subtree of T rooted at x is denoted T½x.
A binary tree is a tree with all internal nodes being binary. If internal nodes have one or two children, then the tree is said partially binary.
The lowest common ancestor (LCA) in T of a subset L 0 of 'ðTÞ, denoted lca T ðL 0 Þ, is the ancestor common to all the nodes in L 0 that is the most distant from the root.
A tree R is an extension of a tree T if it is obtained from T by grafting unary or binary nodes in T, where grafting a unary node x on an edge (u, v) consists in creating a new node x, removing the edge (u, v) and creating two edges (u, x) and (x, v), and in the case of grafting a binary node, also creating a new leaf y and an edge (x, y). In the latter case, we say that y is a grafted leaf.
Species and gene trees: The species tree S for a set R of species represents a partially ordered set of speciation events that have led to R. In this article, we consider that each species of r 2 R has two genomes: r0 corresponding to its mitochondrial genome and r1 corresponding to its nuclear genome.
A gene family is a set C of genes where each gene x belongs to a given species s(x) of R. A tree T is a gene tree for a gene family C if its leafset is in bijection with C. We will make no distinction between a leaf of T and the gene of C it corresponds to. We call s(x) the species labeling of the leaf x. For a subset G C of genes, we write sðGÞ ¼ fsðgÞ : g 2 Gg as the set of species containing the genes of G.
Moreover, we assign to each gene x of C a Boolean value corresponding to the genome it belongs to. More precisely, b(x) ¼ 0 if x belongs to sðxÞ 0 and b(x) ¼ 1 if x belongs to sðxÞ 1 . In this article, we assume that the mitochondrial or nuclear location of each extant gene is known. We call b(x) the genome labeling of the leaf representing x.
An evolutionary history is represented by an event labeled tree, where the event labelẽðxÞ of an internal node x is its corresponding event. The event labeling of the internal nodes of a gene tree is obtained through reconciliation.

Reconciliation
Inside the species' genomes, genes undergo Speciation (Spe) when the species to which they belong do, but also Duplication (Dup) i.e. the creation of a new gene copy, Loss of a gene copy and Horizontal Gene Transfer (HGT) when a gene is transmitted from a source to a target genome. In this article, we consider special cases of HGTs, called EGTs, only allowing the transmission of genes from the mitochondrial genome to the nuclear genome of the same species, or vice-versa. Moreover, we consider two types of EGTs: EGTcopy and EGTcut defined as follows (see Fig. 1): • A gene x belonging to ri is copied (or transferred) by an EGTcopy event to rj for fi; jg ¼ f0; 1g if it is copied from ri and inserted in rj . • A gene x belonging to ri is transposed by an EGTcut event to rj for fi; jg ¼ f0; 1g if it is cut from ri and inserted in rj .
Thus, in this article, the set of considered events is: Notice that we do not consider general HGT events. To define a DLE-Reconciliation, assume that we are given a species tree S, a gene tree T, a mapping s from 'ðTÞ to 'ðSÞ and a mapping b from 'ðTÞ to f0, 1g. We need to define how to extend s and b to the internal nodes of T. Given an extension R of T (R can be equal to T) an extension of s is a functions from V(R) to V(S) such that, for each leaf x of T,sðxÞ ¼ sðxÞ. Moreover, an extension of b is a functionb from V(R) to f0, 1g such that, for each leaf x of T, bðxÞ ¼ bðxÞ. Fig. 1. The effect of an event on a node x of a gene tree representing the gene a belonging to the genome s i (denoted xðsiÞ), where s is a species and i 2 f0; 1g (for a species s, s o is the mitochondrial genome and s 1 the nuclear genome of s). The tree S up-right is the species tree, where u and v are the two species arising from the speciation of s. (Spe): Gives rise to a copy a u in u i and a v in v i ; (Dup): Preserves the copy a in s i and gives rise to a new copy b in s i ; (EGTcopy): Represents a transfer event from s i to s j , where j 2 f0; 1g and j 6 ¼ i, preserving the copy a in s i and giving rise to a new copy a j in s j ; (EGTcut): Represents a transposition event from s i to s j removing the copy a in s i and creating a copy a j in s j Definition 1(DLE-Reconciliation). Let C be a gene family where each x 2 C belongs to the genome b(x) of a species s(x) of R . Let T be a rooted binary gene tree for C and S be a rooted binary species tree for R . A DLE-Reconciliation is a quadruplet hR;s;b;ẽi where R is a partially binary extension of T,s is an extension of s andb is an extension of b such that: 1. Each unary node x with a single child y is such thatẽðxÞ ¼ EGTcut;sðxÞ ¼sðyÞ ¼ r andbðxÞ 6 ¼bðyÞ; x represents a transposition event with source genome rb ðxÞ and target genome rb ðyÞ .
2. For each binary node x of R with two children x l and x r , one of the following cases holds: a.sðx l Þ andsðx r Þ are the two children ofsðxÞ in S and bðx l Þ ¼bðx r Þ ¼bðxÞ, in which caseẽðxÞ ¼ Spe; b.sðx l Þ ¼sðx r Þ ¼sðxÞ ¼ r andbðx l Þ ¼bðx r Þ ¼bðxÞ in which casẽ eðxÞ ¼ Dup representing a duplication in rb ðxÞ ; c.sðx l Þ ¼sðx r Þ ¼sðxÞ ¼ r andbðx l Þ 6 ¼bðx r Þ in which caseẽðxÞ ¼ EGTcopy; let y be the element of fx l ; x r g such thatbðxÞ 6 ¼bðyÞ, theñ eðxÞ is a transfer with source genome rb ðxÞ and target genome rb ðyÞ .
A grafted leaf on a newly created node x corresponds to a loss insðxÞ.
As R is as an extension of T, each node in T has a corresponding node in R. In other words, we can consider that VðTÞ VðRÞ. In particular, the species labeling on R induces a species labeling on T.
Given a cost function c on DLE and a reconciliation R ¼ hR;s;b;ẽi, the cost cðRÞ is the sum of costs of the induced events. In this article, we assume a 0 cost for speciations and positive costs for all the other events.
We are now ready to formally define the considered optimization problem.

DLE-Reconciliation Problem:
Input: A species tree S for a set of species R, a gene family C on R, a gene tree T for C, a species labeling s and a genome labeling b of 'ðTÞ, and a cost function c on DLE.
In the next section, we first consider the case of a unitary cost, thus reducing the problem to minimizing the number of operations induced by a reconciliation. The cost DLE(T, S) of the most parsimonious DLE-Reconciliation for T and S in the case of a unitary cost c is called the DLE-Distance. We then extend the algorithmic developments to arbitrary costs, allowing in particular to consider an EGTcopy or an EGTcut event copying a gene from the mitochondria to the nucleus differently from a similar event copying a gene from the nucleus to the mitochondria.
In the following section, we will refer to the DL-Reconciliation of T and S. Recall that it is a triplet hR DL ;s;ẽi defined by only considering the cases of speciations, duplications and losses in Definition 1, and ignoring the binary assignment of genes. We denote by DL(T, S) the DL-Distance, i.e. the minimum number of duplications and losses induced by a DL-reconciliation. The DL-Reconciliation hR DL ;s;ẽi of cost DL(T, S) is unique and verifies, for any internal node x of VðR DL Þ \ VðTÞ: 1.sðxÞ ¼ lca S ðsð'ðT½xÞÞÞ; 2. ifsðxÞ 6 ¼sðx l Þ andsðxÞ 6 ¼sðx r Þ then v is a Speciation; otherwise x is a Duplication.
We finally need to make the link between the species labelings of an optimal reconciliation and the well-known LCA-Mapping. This is formally stated in the following lemma.
Note that in the above statement, VðTÞ \ VðRÞ ¼ VðTÞ, and thus the intersection is redundant. We write it this way to emphasize that x is a vertex of R (which happens to also be in T), i.e. the LCA-Mapping here applies to the reconciled trees, not to the original gene tree T.

A linear-time algorithm for the DLE-distance
In this section, we consider a unitary cost c on DLE.
Consider a given extensionb T of b to the internal nodes of T. We first present an algorithm for computing a DLE-Reconciliation hR;s;b;ẽi of minimum cost, under the condition thatbðxÞ ¼b T ðxÞ for each x 2 VðTÞ \ VðRÞ. We will then show how ab T minimizing the DLE-Distance can be obtained.
Algorithm 1 computes the DLE-Reconciliation hR;s;b;ẽi from the DL-Reconciliation hR DL ;s DL ;ẽ DL i (see Fig. 2 for an example).
Lemma 2 (Optimality of Algorithm 1). Given a binary assignmentb T of the nodes of T, Algorithm 1 outputs a DLE-Reconciliation hR;s;b;ẽi of minimum cost with the constraint thatbðxÞ ¼b T ðxÞ for x 2 VðRÞ \ VðTÞ.
It follows from Lemma 2 that ifb is known in advance for the nodes of T, a DLE-Reconciliation of minimum cost is obtained from Algorithm 1 withb as input. We now focus on finding such a labelingb.
Lemma 3 (Necessary condition forb) There exists a DLE-Reconciliation hR;s;b;ẽi of minimum cost DLE(T, S) such that, for any node x of T and its children x l and x r in T,bðxÞ ¼bðx l Þ orbðxÞ ¼bðx r Þ.
Proof. Assume hR;s;b;ẽi is a most parsimonious DLE-Reconciliation with a lowest node x not satisfying condition (1):bðxÞ ¼bðx l Þ or bðxÞ ¼bðx r Þ. Thus we should havebðxÞ 6 ¼bðx l Þ ¼bðx r Þ. Note that an EGTcut event must be present on at least one of the ðx; x l Þ or ðx; x r Þ branches. A reconciliation of lower or equal cost can be obtained by assigningbðxÞ ¼bðx l Þ ¼bðx r Þ and removing this EGTcut event, reducing the cost by one. Let p x be the parent of x in R (note that if x is the root, p x might not exist, in which case there is nothing else to do). If bðxÞ is now different frombðp x Þ, we add an EGTcut event between p x and x, yielding an alternate reconciliation of equal or lower cost.
We can reproduce the same transformation iteratively in a bottom-up fashion until condition (1) is satisfied for every node. h For a node x 2 VðTÞ, define d(x) ¼ 1 if x is a duplication in the DL-Reconciliation of minimum cost, and d(x) ¼ 0 otherwise. Letb be a binary labeling of V(T). For any node x of T, denote Db ðxÞ ¼ 0 if x 2 'ðTÞ, otherwisE Db ðxÞ ¼ maxð0; jbðxÞ Àbðx l Þj þ jbðxÞ Àbðx r Þj À dðxÞÞ and define: Roughly speaking, Db ðxÞ reflects the number of label changes between x and its children x l and x r in T, with the exception that a duplication is allowed a 'free' change since it can be turned into an EGTcopy node. For example, in Figure 2, costðT; S;bÞ ¼ 2 for the labelingb of T consistent with that of the left tree R (Algo1þFitch), and costðT; S;bÞ ¼ 1 for the labelingb of T consistent with that of the right tree R (Algo1þAlgo2), reflecting, for each one, the number of requested EGTcut.

Lemma 4. The minimum cost of a DLE-Reconciliation between a gene tree T and a species tree S is
DLEðT; SÞ ¼ DLðT; SÞ þ minb costðT; S;bÞ Proof. By Lemma 2, Algorithm 1 correctly infers a minimum cost DLE-Reconciliation for a givenb. Note that this DLE-Reconciliation is obtained from a DL-Reconciliation by turning some duplication nodes into EGTcopy nodes (which do not change the cost), and by grafting some EGTcut nodes. Thus, the latter are responsible for any possible change in cost from DL(T, S) to DLE(T, S). It follows that the cost of the returned DLE-Reconciliation is DL(T, S), plus the number of grafted EGTcut nodes.
Letb be a binary assignment of T that minimizes DLE(T, S) whenb is passed to Algorithm 1. By Lemma 3, we may assume that for any node x and its children x l and x r ,bðxÞ ¼bðx l Þ orbðxÞ ¼bðx r Þ. Thus Db ðxÞ 2 f0; 1g for every x. Furthermore, Db ðxÞ ¼ 1 if and only if x is a speciation node and an EGTcut node is grafted on the edge ðx; x l Þ (ifbðxÞ 6 ¼bðx l Þ) or on the edge ðx; x r Þ (ifbðxÞ 6 ¼bðx r Þ). In consequence, costðT; S;bÞ counts exactly the number of graftings of EGTcut nodes. h h Since the most-parsimonious DL-Reconciliation is unique, the DL(T, S) term in the above lemma is an invariant. Our goal is therefore to find the labelingb that minimizes costðT; S;bÞ. This can be achieved by a slight modification of the Fitch (1971) algorithm (Fitch, 1971) computing, for a given tree with leaf labels, all possible label assignments of internal nodes minimizing the number of label changes along the edges of the tree. We first need to recall some concepts on parsimony. Given a tree T on a leafset L of residues (generally nucleotides or amino-acids, but in this article L ¼ f0; 1g corresponding to the possibleb labeling), the weighted parsimony problem consists in assigning a residuebðuÞ 2 L to each internal node u of T in a way minimizing the total weight of the tree. More precisely, given a cost matrix M on residues, the weight of T is the sum of weights MðbðuÞ;bðvÞÞ for all ðu; vÞ 2 EðTÞ. An assignment of T refers to the assignment of a residue to each internal node of T.
The Sankoff and Cedergren (1983) algorithm (Sankoff and Cedergren, 1983) allows to compute, in quadratic time, the minimum cost minðTÞ of an assignment of T. Moreover, it allows to find all the assignmentsT of T leading to minðTÞ. When Mða; aÞ ¼ 0 for all a 2 L and Mða; bÞ ¼ 1 for a 6 ¼ b, weighted parsimony can be computed in linear time using the Fitch algorithm.
The Fitch algorithm consists of two phases. The first phase is recursive and reconstructs possible ancestral labels L(x) for each node x of T and the overall minimum number of label changes required as follows: For each node x of T in a bottom-up traversal, (1) if x is a leaf, then LðxÞ ¼ fbðxÞg and costðT½xÞ ¼ 0.
(2) Else, let x l and x r be the children of x. If Lðx l Þ \ Lðx r Þ ¼ 1, then LðxÞ ¼ Lðx l Þ [ Lðx r Þ and costðT½xÞ ¼ costðT½x l Þ þ costðT½x r Þ þ 1; else LðxÞ ¼ Lðx l Þ \ Lðx r Þ and costðT½xÞ ¼ costðT½x l Þ þ costðT½x r Þ. The second phase of the algorithm reconstructs an assignmentb of T that has a minimum cost, by computingbðxÞ as follows: For each node x of T in a top-down traversal, (1) if x is the root, assignbðxÞ to any label in L(x).
(2) Else, let x p be the parent of x. If The tree R DL up left, together with its node labeling, is the optimal DL-Reconciliation for the gene tree T represented by the plain edges of R DL and the species tree S up right. The two down trees are obtained by Algorithm 1 for two differentb labeling of internal nodes: the left labeling is obtained by the Fitch algorithm for weighted parsimony, while the right labeling is obtained by applying Algorithm 2. The left labeling gives rise to a non-optimal reconciliation with seven operations (two losses, one duplication, two EGTcopy and two EGTcut), while the right labeling gives rise to the DLE-Distance which is equal to six (two losses, three EGTcopy and one EGTcut). Rectangles represent duplications; triangles represent either EGTcopy or EGTcut events depending whether the labeled node is binary or unary; dotted lines represent losses; A leaf x i represent a gene x belonging to the genome i (0 for mitochondrial and 1 for nuclear) of species X b ðx p Þ 2 LðxÞ, then assignbðxÞ ¼bðx p Þ, else assignbðxÞ to any label in L(x).
The Fitch algorithm does not always find an optimalb assignment because of duplications that can be turned into EGTcopy events. Algorithm 2 modifies the first phase of the Fitch algorithm to compute the DLE-Distance and an assignmentb of T that leads to the DLE-Distance. The modification reflects the fact that a duplication node is allowed a 'free' change since it can be turned into an EGTcopy node (see Fig. 2 for an illustration).
Lemma 5. Algorithm 2 outputs, in linear time, the DLE-Distance DLE(T, S) and a binary assignmentb of T that leads to a most parsimonious DLE-Reconciliation.
Proof. It suffices to prove that the following statement holds for any node x of T: for any label b in L(x), there exists a binary assignmentb of T½x such thatbðxÞ ¼ b andb minimizes costðT½x; S;bÞ.
2. If x is not a leaf (Lines 6-20). Let x l and x r be the children of x, and assume that the statement holds for x l and x r . Let b 2 LðxÞ. Letb l and b r be two binary assignments of T½x l and T½x r that minimize costðT½x l ; S;b l Þ and costðT½x l ; S;b r Þ, respectively, and such that Letb be the binary assignment of T½x obtained by mergingb l andb r and extending it with bðxÞ ¼ b.
In both cases, Algorithm 1 computes a DLE-Reconciliation with minimum cost DLEðT½x l ; SÞ þ DLEðT½x r ; SÞ þ 1 with a minimum increment of 1 for a Dup node in case (1), or by making x an EGTcopy node in case (2), but no additional EGTcut node is required.
4. If x is a speciation node in the DL-reconciliation.
It is easy to see that both the first and the second phases of the algorithm have linear time complexity, thus the overall algorithm has a linear time complexity. h As for the Fitch Algorithm, Algorithm 2 does not allow to output all the solutions of the DLE-Reconciliation problem leading to the DLE-Distance. However, this can be achieved by adapting the Sankoff and Cedergren's dynamic programming algorithm. Rather, we choose to introduce, in the next section, a more general dynamic programming algorithm allowing to output all optimal solutions for an arbitrary cost of the DLE events, not only for the unitary cost.

Solving the DLE-reconciliation problem with arbitrary DLE costs
We now introduce a dynamic programming algorithm for general costs. We use d and k to denote the cost of a duplication and a loss, respectively. We use q 0 (respectively s 0 ) for the cost of an EGTcut (respectively EGTcopy) from the mitochondrial genome to the nuclear genome, and q 1 (respectively s 1 ) for the cost of an EGTcut (respectively EGTcopy) from the nuclear genome to the mitochondrial genome. Note that the subscripts of the EGT costs indicate the source of the switch. Also denotE Roughly speaking, q Ã 0 represents the minimum cost required to switch from mitochondrial to nuclear genome inside a branch of T, and q Ã 1 the minimum cost required in the other direction. The purpose of q Ã 0 and q Ã 1 is that a switch can be accomplished by an EGTcut event, but also by an EGTcopy event followed by a loss.
Let x 2 VðTÞ. Note thatsðxÞ does not need to be inferred, since by Lemma 1, we can assume thatsðxÞ ¼ lca S ðsð'ðT½xÞÞÞ. Our dynamic programming table only needs to store the optimal cost on T½x for each possiblebðxÞ 2 f0; 1g. This requires testing each of three possible eventsẽðxÞ at x, and the number of scenarios to consider at x is therefore constant [this is the main reason for the gain in time compared to the algorithm of Doyon et al. (2010), which requires adding a dimension to the table corresponding to all i124 Y. Anselmetti et al.
possible species at x]. Let b x 2 f0; 1g. We denote by D½x; b x the minimum cost of a DLE-Reconciliation hR;s;b;ẽi of T½x with S in whichbðxÞ ¼ b x (or 1 if no such reconciliation exists). Trivially, if x is a leaf of T, we havE Assume now that x is an internal node of T. Let x l , x r be the children of x. For s 1 ; s 2 2 VðSÞ, let pathðs 1 ; s 2 Þ denote the number of vertices on the path between s 1 and s 2 in S, including s 1 and s 2 . Then definE l x ¼ pathðsðxÞ;sðx l ÞÞ þ pathðsðxÞ;sðx r ÞÞ which counts the number of mandatory losses on the child branches of a node x of T.
To compute D½x; b x , we use three auxiliary values D½x; b x ; e x , where e x 2 fSpe; Dup; EGTcopyg represents the event label of x (note that e x cannot be an EGTcut event, since x has two children).
IfsðxÞ ¼sðx l Þ orsðxÞ ¼sðx r Þ, then D½x; b x ; Spe ¼ 1. Assuming this check has been performed, we havE EGTcopyÞ. The value of interest is minðD½rðTÞ;0;D½rðTÞ;1Þ. Theorem 1. For any x 2 VðTÞ and b x 2 f0; 1g, the value of D½x; b x , as defined above, is equal to the minimum cost of a DLE-Reconciliation hR;s;b;ẽi of T½x with S satisfyingbðxÞ ¼ b x .
Moreover, the minimum cost minðD½rðTÞ; 0; D½rðTÞ; 1Þ of a reconciliation of T with S can be computed in time OðjVðTÞj þ jVðSÞjÞ.
Let us note that once the D table is computed, a standard backtracking procedure allow to reconstruct every optimal DLE-Reconciliation.

Experimental results
We implemented the above dynamic programming procedure in python in a software called EndoRex, which supports arbitrary costs as input and returns a reconciled gene tree in Newick format. The python source can be accessed at https://github.com/AEVO-lab/ EndoRex. We then performed a variety of experiments on a dataset obtained from (Kannan et al., 2014), as described bellow.

Kannan et al. (2014) dataset
For the reconstruction of evolutionary histories with EGT events, we used a dataset from Kannan et al. (2014) available at ftp:// ftp.ncbi.nih.gov/pub/koonin/MitoCOGs. The dataset consists of 140 MitoCOGs extended with paralogs and nuclear protein-coding homologs from 2486 eukaryotes with complete mitochondrial genomes. MitoCOGs are clusters of orthologous genes for mitochondrial-encoded proteins generated using COG construction (Makarova et al., 2007;Yutin et al., 2009). Full description of the MitoCOG generation procedure is described in Kannan et al. (2014). Among the 140 MitoCOGs, 73 correspond to protein-coding gene families, 49 are hypothetical proteins and 18 are clusters for which the protein function is identified but not the gene name.
Among these 73 MitoCOGs, 13 are core-mitochondrial proteins that are shared by most of the 2486 mitochondrial genomes. Statistics on MitoCOGs of the Kannan et al. dataset are given in Table 1.
The 11 plant species are represented in 68 MitoCOGs with mitochondrial-encoded proteins and 41 MitoCOGs with nuclearencoded proteins. We selected the clusters for which there were mitochondrial and nuclear encoded genes, yielding 28 MitoCOGS containing 326 protein-coding genes, including 184 encoded in the mitochondria and 142 in the nucleus. All the 28 MitoCOGs correspond to gene names that are present in the mitochondrial gene content review of Sloan et al. (2018). Table 2 gives information about the 28 MitoCOGs of the 11 plants dataset specifying the gene name, the protein metabolic pathway and the number of genes and species for each MitoCOG.
For each MitoCOG, we applied a pipeline to infer the evolutionary history of EGTs with DLE-Reconciliation along the 11 plants species tree. The topology of the species tree was taken from Kannan et al. (2014). We added the species Micromonas sp. RCC299 as the sister species of Ostreococcus tauri as only these 2 among the 11 plants species belong to the Mamiellophyceae class. We also swapped the position between P. patens and S. moellendorffi according to (Puttick et al., 2018) (Fig. 3).
As for constructing gene trees, the first step of the pipeline was to align the protein sequences with MUSCLE (Edgar, 2004). In the second step, a maximum likelihood protein tree was infered using RAxML (v8.2.4) with the PROTGAMMAGTRX evolutionary model (Stamatakis et al., 2014). NOTUNG (v.2.9.1.5) was then used to root the trees by minimizing the cost of a duplication-loss reconciliation with default parameter (loss cost: 1.0 and duplication cost: 1.5) (Stolzer et al., 2012).
The rooted protein trees obtained with this pipeline and the 11 plants species tree were given as input of the EndoRex software to infer a most parsimonious DLE-Reconciliation allowing for arbitrary costs for duplications, losses and EGTs.

EndoRex evolutionary events cost setting
As a reminder, we consider six parameters corresponding to the different evolutionary event costs: d and k the cost of, respectively, a gene duplication and loss; q0 (respectively s0 ) the cost of an EGTcut (respectively EGTcopy) from the mitochondrial genome to the nuclear genome, and q1 (respectively s1 ) the cost of an EGTcut Note: Notice that MitoCOGs have been designed for mitochondrialencoded genes, and nuclear-encoded genes have been included later. This explains why all nuclear-encoded MitoCOGs, and the corresponding species, are included in the mitochondrial-encoded sets of MitoCOGs and species.
(respectively EGTcopy) from the nuclear genome to the mitochondrial genome.
We test five different cost settings for the application of EndoRex on the 11 plants dataset. The setting S1 corresponds to the default values for parameters, with a unitary cost for evolutionary events (allowing to compute the DLE-Distance). For setting S2, the gene loss and duplication costs are those used in NOTUNG for rooting the protein trees, and EGTcopy and EGTcut costs are set higher to reflect the fact that these evolutionary events are less frequent than gene duplications: k ¼ 1:0; d ¼ 1:5 and q 0 ¼ q 1 ¼ s 0 ¼ s 1 ¼ 2:0. In setting S3, we consider EGTcopy as less likely than EGTcut: k ¼ 1:0; d ¼ 1:5; q 0 ¼ q 1 ¼ 2:0 and s 0 ¼ s 1 ¼ 3:0. For setting S4, we differentiate the cost of the mitochondria to the nucleus from the nucleus to the mitochondria gene move, and account for the fact that, during the evolution of eukaryotes, mitochondrial genes are integrated into the nuclear genome, while the reverse is extremely rare: k ¼ 1:0; d ¼ 1:5; q 0 ¼ 2:0; q 1 ¼ 3:0; s 0 ¼ 3:0 and s 1 ¼ 4:0. Finally, setting S5 is the same as setting S4 except we make no difference between the costs of EGTcopy and EGTcut events: k ¼ 1:0; d ¼ 1:5; q 0 ¼ 2:0; q 1 ¼ 3:0; s 0 ¼ 2:0 and s 1 ¼ 3:0.
Applied to the 28 MitoCOGs trees, EndoRex infers the same DLE-Reconciliation with the five different settings for 21 of the 28 MitoCOGs.
All the seven MitoCOGs with more that one inferred DLE-Reconciliation, depending on the considered setting, lead to two different DLE-Reconciliations: for MitoCOG0014, MitoCOG0051 and MitoCOG0053, setting S1 gives a DEL-reconciliation different from the other settings; for MitoCOG0027, it is setting S3 that gives a different DEL-reconciliation; for MitoCOG0005 and MitoCOG0039, it is setting S4; and finally for MitoCOG0072, the settings S4 ans S5 give a DEL-reconciliation different from S1, S2 and S3. We analyzed the two DLE-Reconciliations of MitoCOG0014 (atp9), MitoCOG0027 (rpl2), MitoCOG0039 (rpl16) and MitoCOG0072 (rps10) to illustrate the dynamic of the score settings (see Fig. 4).
According to these case studies, it seems that setting S1 is inappropriate as it leads to the prediction of higher number of EGTs which are rare evolutionary events (see MitoCOG0014 in Fig. 4, and MitoCOGs 51 and 53 in Appendix Fig. A1). For MitoCOG0027, setting S3 leads to the prediction of numerous EGTs from the nucleus to the mitochondria, which is very unrealistic as a very few number of gene movements from the nucleus to the mitochondria have been described in the literature.

Conclusion
Investigating the origin, evolution and characteristics of gene coding capacity of eukaryotes has been among the central themes in the Life Sciences. In this context, the endosymbiotic origin of mitochondrial genomes and the gradual integration of the mitochondrial gene content to the nucleus are important evolutionary parameters expected to shed light on features of eukaryotic gene evolution and function.
From a computational point of view, detecting the footprint of endosymbiosis in the gene repertoires of the mitochondrial and nuclear genomes of eukaryotes requires new evolutionary prediction methods. This article is a first effort toward developing the appropriate algorithmic tools for analyzing the movement of genes inside a gene family between the mitochondrial and nuclear genome of the same species. We presented a linear-time algorithm computing a most parsimonious history of Duplication, Loss and EGT (DLE) events explaining a gene tree with leaves identified as mitochondrial or nuclear genes. We also presented a general dynamic programming algorithm, implemented in the EndoRex software, to compute all optimal DLE-Reconciliations for any arbitrary cost scheme of operations.
By applying EndoRex to a plant dataset, we showed that it is well-designed to infer the evolutionary histories of EGT events, considering a variety of cost settings. Some reconciled trees (not shown) of the 11 plants dataset produced evolutionary histories that could be considered unrealistic as leading to an unexpected high number of gene duplications and losses. As our algorithm is exact and thus Note: For the 'Nb of gene' column, the number of mitochondria-encoded (mito) and nucleus-encoded (nuc) gene are specified. Topology of the tree is based on (Kannan et al., 2014) i126 Y. Anselmetti et al.
guaranteed to infer the minimum number of events given a gene tree, this is likely due to errors in protein sequence alignment and/or gene tree inference, leading to erroneous gene trees (Hahn, 2007). A better gene tree inference pipeline should be designed in the future to get more accurate gene trees. In particular, gene trees have been rooted according to the DL-distance and standing on the default NOTUNG parameters. Instead, we could have rooted the trees according to our DLE-model, with the 5 considered cost settings. In addition, the obtained RAxML binary gene trees contain many weakly supported edges. Those edges may be contracted, and a polytomy resolution tool such as PolytomySolver (Lafond et al., 2016) may be used to better resolve multifurcations. On the other hand, simulations studies should also be conducted, in the future, to better evaluate the quality of the obtained solutions. In fact, our method relies on a deterministic parsimony approach to compute all optimal DLE-reconciliations given a cost scheme for DLE events. This model has many limitations. In particular, parsimony does not allow to model multiple state changes along a branch of the phylogeny, or uncertainty in phylogenetic reconstructions. An alternative is to rely on approaches using stochastic state mapping models such as the mutational mapping approach (Bollback, 2006;Huelsenbeck et al., 2003). Since our method outputs all optimal DLE-reconciliations, it can also be used to compute the probabilities of all possible events over all optimal solutions.
Future algorithmic extensions of the optimization problem considered in this article may concern extending the model to account for both EGT and HGT events, toward inferring a Duplication, HGT, loss and EGT (DTLE) evolutionary scenario for a gene family. Another direction would be to infer common episodes of EGT events for a set of gene families. This may be handled by generalizing the Super-Reconciliation (Delabre et al., 2020) model to account for segmental DLE events.
Future developments will define an EGT simulation model to provide EGT evolutionary histories to assess the accuracy of our algorithm. Some efforts have been made to provide EGT simulation model. Brandvain and Wade (2009) provides a model to explore the influence of population-genetic parameters (such as selection, dominance, mutation rates and population size with a rate of self-fertilization) on the rate and probability of functional gene transfer from mitochondrial genome (haploid) to nuclear genome (diploid). (Kelly, 2020) defines an EGT simulation model based on the ATP biosynthesis cost for the encoding of a mitochondrial/chloroplast gene in the nuclear genome and the import of the resulting in the organelle. These prior works provide useful insights to design a model for the simulation of EGT evolutionary histories that would be strongly inspired from existing model for the simulation of HGT evolutionary histories.
Future applications will also concern a thorough analysis of protein-coding genes involved in common metabolic pathways. As an example, the oxydative phophorylation (OXPHOS) is a series of protein complexes (I, II, III, IV and V) leading to an electrochemical proton gradient activating the ATP synthase (complex V) that produces ATP. These protein-coding genes involved in OXPHOS are expected to share common mitochondrial-nuclear movements, as nucleus and mitochondria are two compartments with different biological dynamics.
Finally, the recent sequencing effort conducted toward jakobids and malawimonads protists genomes known to have emerged close to the eukaryotic origin will provide a valuable dataset that can be analyzed with the new developed algorithms, helping to shed light on a number of important biological questions, among them resolving the root of the eukaryote tree. In fact, as EGTs are rare events, candidate topologies for which DLE-Reconciliations infer the lowest number of EGT events, may provide evidence for a correct rooting.
Financial Support: Natural Sciences and Engineering Research Council of Canada;Fonds de recherche Nature et Technologie, Québec.
Conflict of Interest: none declared. If present, we may assume without loss of generality that such an event occurs at x k , the parent of x l in R, since the timing of the switch does not affect the reconciliation cost. In this case,sðx k Þ ¼sðx l Þ ¼ lca S ðsð'ðT½x l ÞÞÞ. On the other hand,sðx 1 Þ ¼sðxÞ 6 ¼ lca S ðsð'ðT½xÞÞÞ. This implies that x 1 6 ¼ x k , and thus x 1 is not an EGTcopy or an EGTcut. It follows that x 1 is a node inserted because of a grafted loss, and sðx 2 Þ ¼ s 0 . In R 0 , we can remove x 1 and its loss leaf, and by doing so, the left child of x becomes x 2 . This preserves all properties of a valid reconciliation because both x and x 2 are mapped to s 0 . We can apply the same procedure on the path from x to x r .
In R 0 , we have created one loss above x, but have removed two losses on both sides of x. No other event labeling has changed. Since we assume that losses have a non-zero cost, R 0 has a strictly lower cost than R, a contradiction.
Proof of Lemma 2We first show that the reconciliation hR;s;b;ẽi obtained from Algorithm 1 is a valid DLE-Reconciliation. Note that the tree R returned by the algorithm is the same as R DL , but with some grafted unary nodes for EGTcut events where needed. Consider some x 2 VðR DL Þ. In R, we putẽðxÞ ¼ Spe ifẽ DL ðxÞ ¼ Spe, andẽðxÞ 2 fDup; EGTcopyg ifẽ DL ðxÞ ¼ Dup. If no additional node was grafted as a new child of x, all properties of reconciliation would be preserved since we keeps as ins DL . If some node x 0 was grafted as a new child of x, we ensure thatsðx 0 Þ is the same as the previous child of x, which ensures that we satisfy the properties of reconciliation. Therefore, we only need to check whether the tree R DL is modified in an appropriate way in the case of a differentb value for a node x of T and one of its two children x l or x r .
Lines 2-8 first ensure that the starting tree R is such that, for each node x of T,bðxÞ ¼b T ðxÞ, and for any edge (x, y) in T such that b T ðxÞ 6 ¼b T ðyÞ, the corresponding path ðx; v 1 ; v 2 ; . . . v n ; yÞ on R is such that for all i,bðv i Þ ¼bðyÞ. Subsequently, in the case of a differentb value for a node x of T and its child y, the node x is either modified to an EGTcopy node, ensuring that the switch betweenbðxÞ andbðv 1 Þ is correctly explained by this EGTcopy, or a new EGTcut node v is grafted on the edge ðx; v 1 Þ, also correctly explaining the switch betweenbðxÞ and bðv 1 Þ.
We now show that the DLE-Reconciliation output by Algorithm 1 is of minimum cost. First Note that, from the initialization done in Line 8, for each leaf x which is on R DL but not in T (lost gene), the algorithm ensures thatbðxÞ ¼bðp x Þ were p x is x's parent. Thus, grafted loss leaves never require an extra EGTcopy event on an 'inserted edge' of R DL .
Assume another reconciliation hR 0 ;s 0 ;b 0 ;ẽ 0 i has a strictly lower cost than hR;s;b;ẽi output by Algorithm 1. We first show that, for any node of T, the corresponding node in R and R 0 have the same event label. Assume this is not the case. Let x be the lowest node of T such that e 0 ðxÞ 6 ¼ẽðxÞ. Let x l and x r be its two children in T and v l and v r be the two non-unary descendant of x in R 0 the closest from x. Note that x l and x r do not necessarily correspond to v l and v r in R 0 . Rather, they may be strict descendants of these nodes in R 0 . 1. Ifẽ DL ðxÞ ¼ Dup, then from Algorithm 1,ẽðxÞ ¼ Dup ifbðx l Þ ¼bðxÞ andbðx r Þ ¼bðxÞ, andẽðxÞ ¼ EGTcopy otherwise. Asẽ 0 ðxÞ 6 ¼ẽðxÞ, we should haveẽ 0 ðxÞ 2 fSpe; EGTcopyg in the first case, orẽ 0 ðxÞ 2 fSpe; Dupg in the second case.
Assumeẽ 0 ðxÞ ¼ Spe. From Lemma 1, as hR 0 ;s 0 ;b 0 ;ẽ 0 i is a reconciliation of minimum cost,s 0 ðxÞ ¼ lca S ðsð'ðT½xÞÞÞ, and as x is a speciation node in R 0 , one of v l and v r should be mapped tosðxÞ l and the other tosðxÞ r . Assume w.l.o.g. thats 0 ðv l Þ ¼sðxÞ l ands 0 ðv r Þ ¼sðxÞ r . Now, as x is a duplication node in R DL , thensðx l Þ ¼sðxÞ orsðx r Þ ¼sðxÞ. Assume w.l.o.g. thatsðx l Þ ¼sðxÞ. As x l is a node of the subtree of R 0 rooted at v l , by definition of a reconciliation,s 0 ðx l Þ should be a descendant ofsðv l Þ, which is not the case ass 0 ðv l Þ ¼sðxÞ l is rather a strict descendant of sðxÞ ¼sðx l Þ ¼s 0 ðx l Þ. Therefore, x cannot be a speciation node in hR 0 ;s 0 ;b 0 ;ẽ 0 i. We deduce thatẽ 0 ðxÞ 2 fDup; EGTcopyg. Now assume thatbðx l Þ 6 ¼bðxÞ orbðx r Þ 6 ¼bðxÞ. In this case, the algorithm putsẽðxÞ ¼ EGTcopy and, as x is not a speciation, it should be a duplication node in hR 0 ;s 0 ;b 0 ;ẽ 0 i. But then an a unary EGTcut node v should be present in one of the two paths from x to x l or from x to x r in R 0 , contradicting the fact that hR 0 ;s 0 ;b 0 ;ẽ 0 i is a reconciliation of minimum cost, since labeling x as an EGTcopy node and removing v would reduce the cost of the reconciliation by one.
Finally, assume thatbðx l Þ ¼bðxÞ andbðx r Þ ¼bðxÞ. In this case, the algorithm putsẽðxÞ ¼ Dup and, as x is not a speciation, it should be an EGTcopy node in hR 0 ;s 0 ;b 0 ;ẽ 0 i, which induces, by definition of an EGTcopy event, that one of the two children y of x in R 0 is such that bðyÞ 6 ¼bðxÞ. Now, asbðxÞ ¼bðx l Þ ¼bðx r Þ, one unary EGTcut node v should change theb labeling of y to theb labeling of its descendant in fx l ; x r g. But then relabeling x as a duplication node would allow removing v and thus reducing the cost of the reconciliation by one, contradicting the fact that hR 0 ;s 0 ;b 0 ;ẽ 0 i is a reconciliation of minimum cost.
Asẽ 0 ðxÞ 6 ¼ẽðxÞ, we should haveẽ 0 ðxÞ ¼ Dup orẽ 0 ðxÞ ¼ EGTcopy. In both cases,sðv l Þ ¼sðv r Þ ¼ sðxÞ. This implies that x l 6 ¼ v l and x r 6 ¼ v r , and thus v l and v r are grafted because of losses. Since R 0 uses the LCAmapping by Lemma 1, we can remove v l , v r and their corresponding grafted loss leaves and make x a speciation, while preserving a valid reconciliation. This saves a cost of three (two losses and a Dup or EGTcopy event). In the worst case, we hadẽ 0 ðxÞ ¼ EGTcopy, in which case we can add an EGTcut event on the appropriate branch to enforce the same switch.
Thus replacing the Dup or EGTcopy label of x by a speciation reduces the cost of R 0 by at least two, contradicting the fact that R 0 is a reconciliation of minimum cost.
Since we have the same number of Dup and ETTr events as R 0 , it remains to show that we cannot graft less nodes than those induced by Algorithm 1. The grafted nodes are either binary nodes corresponding to losses, or EGTcut unary nodes. Suppose R 0 has less grafted nodes than R. Then there is an edge (x, y) in T such that the corresponding path P 0 x;y ¼ ðx; v 0 1 ; v 0 2 ; . . . v 0 n 0 ; yÞ in R 0 is shorter than the corresponding path P x;y ¼ ðx; v 1 ; v 2 ; . . . v n ; yÞ in R. We consider a lowest edge (x, y) of T verifying this condition, and we assume, without loss of generality, that y ¼ x l . Recall that by Lemma 1,sðxÞ ¼s 0 ðxÞ andsðyÞ ¼s 0 ðyÞ.
• Ifẽ DL ðxÞ ¼ Dup, then x is a duplication or an EGTcopy node in both R and R 0 . Then, by definition of a reconciliation,sðv 1 Þ ¼sðxÞ.
Moreover, from the fact that R is obtained from R DL , Algorithm 1 leads to a path P x;y with as many nodes as the path fromsðxÞ tosðx l Þ in S if x is a duplication node, and an additional EGTcut node if b T ðxÞ 6 ¼ b T ðx l Þ ¼ b T ðx r Þ. Moreover, it is easy to see that the number of losses crafted on (x, y) must be equal to the number of nodes on the path fromsðxÞ andsðyÞ, excludingsðyÞ, either in R or R 0 , and that the EGTcut event added by the algorithm cannot be avoided. And thus, the path P 0 x;y should be at least as long as P x;y , contradicting the hypothesis that P 0 x;y is shorter than P x;y .
• Ifẽ DL ðxÞ ¼ Spe, then x is a speciation node in both R and R 0 . Then, by definition of a reconciliation,sðv 1 Þ ¼s 0 ðv 1 Þ ¼sðxÞ l . Thus, from the fact that R is obtained from R DL , Algorithm 1 leads to a path P v1;y with as many nodes as the path fromsðxÞ l tosðx l Þ in S, with an additional EGTcut node ifbðxÞ 6 ¼bðx l Þ. Moreover, it is easy to see that no other operation (Spe, Dup, RGT or EGTcut) can allow making less losses or avoid the EGTcut event. And thus, the path P 0 v1 ;y should be at least as long as P v1;y , contradicting the hypothesis that P 0 x;y is shorter than P x;y .
Proof of Theorem 1 Let us first argue on the complexity of computing for any x, we can preprocess S by labeling each v 2 VðSÞ by its depth (i.e. its distance to the root). Then, pathðsðxÞ;sðx l Þ is simply the difference in depth betweensðxÞ andsðx l Þ (becausesðx l Þ must be a descendant ofsðxÞ). This difference can be obtained in constant time, and it follows that l x can be obtained in O(1). Therefore, each D½x; b x entry takes O(1) time to compute. Including the time to compute the preprocessing and the LCA-mapping, the total time of the algorithm is OðjVðTÞj þ jVðSÞjÞ.
Let us now argue that the algorithm is correct. Let x 2 VðTÞ, let b x 2 f0; 1g, and let R ¼ hR;s;b;ẽi be a DLE-Reconciliation of minimum cost between T½x and S that satisfiesbðxÞ ¼ b x . The proof is by induction on the height of T½x. If x is a leaf, it is easy to see that D½x; b x . DLE-Reconciliations obtained forMitoCOG0005, MitoCOG0051 and MitoCOG0053 with the EndoRex scores settings S1, S2, S3, S4 and S5. The blue part of the tree indicates that the genetic material is located in the mitochondrion, while the red part indicates location in the nucleus. The shape of an internal node represents its associated event, as represented in Figure 1 (circle for a speciation, rectangle for a duplication and triangle for an EGT event). Loss events are not represented. Genes are formatted as follow: [species name]__[gene-encoding location]__[gene id]. Moreover, 0 indicates a location in the mitochondrion, while 1 indicates a location in the nucleus is correct. Assume that x is an internal node with children x l and x r . We may inductively assume that D½x l ; b l and D½x r ; b r are computed correctly for b l ; b r 2 f0; 1g.
In what follows, let R l ¼ hR l ;s l ;b l ;ẽ l i be the reconciliation between T½x l and S obtained by taking R½x l , and restrictings;b andẽ to VðR½x l Þ. Similarly, let R r be the reconciliation of T½x r with S obtained by taking R½x r and restrictings;b andẽ to R½x r .
We show two useful claims, the first being that these sub-reconciliations must be optimal with respect to their subtrees. Claim 1.1. cðR l Þ ¼ D½x l ;bðx l Þ and cðR r Þ ¼ D½x r ;bðx r Þ.
Proof. By induction and by the definition of D, we have D½x l ;bðx l Þ cðR l Þ. Moreover, in R we may replace the R½x l subtree by R l (more precisely, replace R½x l by R l , and uses l ;b l andẽ l for the vertices of R l ). Sinces l ðx l Þ ¼sðx l Þ andb l ðx l Þ ¼bðx l Þ, all conditions of a valid reconciliation are met after such a replacement. Furthermore, no additional loss, EGTcopy or EGTcut is required on the path between x to x l . If D½x l ;bðx l Þ < cðRÞ held, this transformation would yield a lower cost reconciliation and contradict the optimality of R. Therefore, D½x l ;bðx l Þ ! cðRÞ. It follows that D½x l ;bðx l Þ ¼ cðR l Þ. By a symmetric argument, D½x r ;bðx r Þ ¼ cðRÞ. h Claim 1.2. IfẽðxÞ ¼ Spe, then there are at least l x À 4 losses grafted on the ðx; x l Þ and ðx; x r Þ branches, and otherwise, there are at least l x À 2 such grafted losses.
Proof. IfẽðxÞ ¼ Spe, in R there must be a loss grafted on the ðx; x l Þ (respectively ðx; x r Þ) branch for each node of pathðsðxÞ;sðx l ÞÞ (respectively pathðsðxÞ;sðx r ÞÞ), excludingsðxÞ andsðx l Þ (respectivelysðx r Þ). The number of such losses is l x À 4 and induce a cost of kðl x À 4Þ. If eðxÞ 2 fDup; EGTcopyg, the required losses are the same, except that we do not exclude x from both paths, and thus l x À 2 losses are required for a cost of kðl x À 2Þ. h We now argue that D½x; b x cðRÞ. First assume that eðxÞ 2 fSpe; Dupg. We then consider the four possibleb labelings of x l and x r .
• IfbðxÞ ¼bðx l Þ ¼bðx r Þ, then no cost other than the losses is required on the ðx; x l Þ and ðx; x r Þ branches. Thus using claims 1.1 and 1.2, cðRÞ ! kðl x À 4Þ þ cðR l Þ þ cðR r Þ ifẽðxÞ ¼ Spe d þ kðl x À 2Þ þ cðR l Þ þ cðR r Þ ifẽðxÞ ¼ Dup Since for bothẽðxÞ 2 fSpe; Dupg; D½x; b x ;ẽðxÞ adds the losses, plus the minimum of D½x 0 ; b x and q Ã bx þ D½x 0 ; 1 À b x for each child x 0 2 fx l ; x r g, we see that D½x; b x D½x; b x ;ẽðxÞ cðRÞ. • IfbðxÞ ¼bðx l Þ andbðxÞ ¼ 1 Àbðx r Þ, then no additional cost is required on the ðx; x l Þ branch, but a switch is required on ðx; x r Þ. The minimum possible cost of such a switch is q Ã bx , and thus using the two claims as the previous case (we omit the step replacing cðR l Þ by D½x l ; b x and cðR r Þ by D½x r ; 1 À b x , which is implicit by claim 1.1), ifẽðxÞ ¼ Spe, we have cðRÞ ! kðl x À 4Þ þ D½x l ; b x þ q Ã bx þ D½x r ; 1 À b x and ifẽðxÞ ¼ Dup, we have cðRÞ ! d þ kðl x À 2Þ þ D½x l ; b x þ q Ã bx þ D½x r ; 1 À b x Again, the above expressions are considered by the minimization of D½x; b x ;ẽðxÞ, and so D½x; b x D½x; b x ;ẽðxÞ cðRÞ. • IfbðxÞ ¼ 1 Àbðx l Þ andbðxÞ ¼bðx r Þ, this case is symmetric to the previous one.
• IfbðxÞ ¼ 1 Àbðx l Þ and IfbðxÞ ¼ 1 Àbðx r Þ, then a switch with host b x is needed on both branches ðx; x l Þ and ðx; x r Þ. Thus, if eðxÞ ¼ Spe, we have cðRÞ ! kðl x À 4Þ þ q Ã bx þ D½x l ; 1 À b x þ q Ã bx þ D½x r ; 1 À b x and ifẽðxÞ ¼ Dup, we havE Again, these are considered in D½x; b x ;ẽðxÞ, and we get D½x; b x D½x; b x ;ẽðxÞ cðRÞ. In all cases, D½x; b x cðRÞ. It remains to show that this holds forẽðxÞ ¼ EGTcopy. In this case, a cost of s bx must be counted for the x node, plus the cost for l x À 2 losses by claim 1.2. Next, we consider all values ofbðx l Þ andbðx r Þ.
• ifbðx l Þ 6 ¼bðx r Þ, then as we argued cðRÞ ! s bx þ kðl x À 2Þ þ cðR l Þ þ cðR r Þ ¼ s bx þ kðl x À 2Þ þ D½x l ;bðx l Þ þ D½x r ;bðx r Þ The latter expression is among the expressions that D½x; b x ; EGTcopy minimizes and thus D½x; b x D½x; b x ; EGTcopy cðRÞ. • if b x ¼bðx r Þ ¼bðx r Þ, then since x is an EGTcopy event, one of the ðx; x l Þ or ðx; x r Þ branches must switch to 1 À b x , then switch back to b x , implying a an EGTcut from 1 À b x to b x of cost q Ã 1Àbx . In this situation, cðRÞ ! s bx þ kðl x À 2Þ þ q Ã 1Àbx þ D½x l ; b x þ D½x r ; b x which is considered among the expressions minimized by D½x; b x ; EGTcopy. Again, D½x; b x D½x; b x ; EGTcopy cðRÞ. • if b x ¼ 1 Àbðx l Þ ¼ 1 Àbðx r Þ, then one of the ðx; x l Þ or ðx; x r Þ branches stays in b x , and thus must switch to 1 À b x for a cost of q Ã bx . In this situation, cðRÞ ! s bx þ kðl x À 2Þ þ q Ã bx þ D½x l ; b x þ D½x r ; b x which is considered among the expressions minimized by D½x; b x ; EGTcopy. Again, D½x; b x D½x; b x ; EGTcopy cðRÞ. In every possible case, D½x; b x cðRÞ. We must now prove the complementary bound, i.e. that D½x; b x ! cðRÞ.
Let e 2 fSpe; Dup; EGTcopyg such that D½x; b x ¼ D½x; b x ; e. If e ¼ Spe, the expression D½x; b x ; Spe corresponds to making x a speciation (which is possible since we check that neither ofsðxÞ ¼sðx l Þ norsðxÞ ¼sðx r Þ holds) and adding the minimum number of mandatory losses on ðx; x l Þ and ðx; x r Þ. Let b l 2 f0; 1g that minimizes minðD½x l ; b x ; q Ã bx þ D½x l ; 1 À b x Þ, and define b r for x r analogously. Thus consider the reconciliation R 0 in which x is a speciation, on which we graft the l x À 4 mandatory losses on ðx; x l Þ and ðx; x r Þ and then, for each of b l or b r that differs from b x , adds an EGTcut on the corresponding branch. Then, for T½x l subtree, take an optimal reconciliation R l for T½x l and for the T½x r subtree, take the optimal reconciliation R r for T½x r . By induction, R l and R r are of costs D½x l ; b l and D½x r ; b r respectively. Since all optimal reconciliations use the LCA-mapping, such a reconciliation is valid and its cost is as defined in D½x; b x ; Spe. It follows that D½x; b x ; Spe ¼ cðR 0 Þ ! cðRÞ (the latter inequality owing to the optimality of R).
If e ¼ Dup, the argument is exactly the same, except that to construct R 0 , we make x a duplication and add l x À 2 losses instead. Finally, assume that e ¼ EGTcopy. It is not hard to see that each expression that D½x; b x ; EGTcopy may choose when minimizing corresponds to a valid reconciliation. Indeed, consider the reconciliation R 0 whereẽðxÞ ¼ EGTcopy for a cost of s bx . We add l x À 2 mandatory losses on the ðx; x l Þ and ðx; x r Þ branches. Then, the first two cases of the minimization in D½x; b x ; EGTcopy correspond to having no additional switch needed, and hence we can use the optimal reconciliation for T½x l and T½x r . The third case corresponds to having both x l and x r mapped to b x , in which case we can choose to apply the EGTcopy on ðx; x l Þ, but need to switch back for a cost of q Ã 1Àbx . The last case corresponds to having both x l and x r mapped to 1 À b x , in which case the EGTcopy applies one switch, and we add an EGTcut for the other switch of cost q Ã bx . Since each possible case represents the cost of a valid reconciliation R 0 , we get D½x; b x ; EGTcopy ¼ cðR 0 Þ ! cðRÞ. Thus for every possible value of e, we have D½x; b x ¼ D½x; b x ; e ! cðRÞ.
To conclude, the two complementary bounds show that D½x; b x ¼ cðRÞ. h