Improved inference of tandem domain duplications

Abstract Motivation Protein domain duplications are a major contributor to the functional diversification of protein families. These duplications can occur one at a time through single domain duplications, or as tandem duplications where several consecutive domains are duplicated together as part of a single evolutionary event. Existing methods for inferring domain-level evolutionary events are based on reconciling domain trees with gene trees. While some formulations consider multiple domain duplications, they do not explicitly model tandem duplications; this leads to inaccurate inference of which domains duplicated together over the course of evolution. Results Here, we introduce a reconciliation-based framework that considers the relative positions of domains within extant sequences. We use this information to uncover tandem domain duplications within the evolutionary history of these genes. We devise an integer linear programming approach that solves our problem exactly, and a heuristic approach that works well in practice. We perform extensive simulation studies to demonstrate that our approaches can accurately uncover single and tandem domain duplications, and additionally test our approach on a well-studied orthogroup where lineage-specific domain expansions exhibit varying and complex domain duplication patterns. Availability and implementation Code is available on github at https://github.com/Singh-Lab/TandemDuplications. Supplementary information Supplementary data are available at Bioinformatics online.

2. d uv for each pair of nodes u, v ∈ V (D).We fix d uv = 1 if u ≥ D v and 0 otherwise.That is, d uv encodes ancestry relationships between nodes in the gene tree.
3. e uv for each pair of nodes u, v ∈ V (D).We set e uv = 1 if u, v are eligible to be in the same tandem duplication set according to the eligibility matrix E.
4. dist T (u, v) for comparable nodes u, v ∈ V (T ), which corresponds to the distance in the tree T from node u to node v.We use distances for both G and D.
We encode the mapping and event assignments as a series of binary variables which the ILP optimizes over.Note that mapping variables are fixed for all u ∈ L(D).
1. M ui for each pair of nodes u ∈ V (D) and i ∈ V (G).The variables M ui define the mapping γ in the reconciliation and will be set to 1 if γ(u) = i and 0 otherwise.For u ∈ L(D), for all i ∈ V (G), M ui is fixed according to the input leaf mapping; that is, it is set as 1 if σ(u) = i and 0 otherwise.
2. Y uvij for nodes u, v ∈ V (D), i, j ∈ V (G).Y uvij is an indicator variable, and will be set to 1 if γ(u) = i and γ(v) = j.
3. B u for each node u ∈ I(D).The variable B u will be set to 1 if u is assigned to a tandem duplication set.
4. X uv for pair of nodes u, v ∈ I(D), u = v.X uv will be set to 1 if nodes u and v are part of the same tandem duplication; that is X uv = 1 if there is a tandem duplication set a ∈ ∆ such that u, v ∈ a.

5.
T uk for all nodes u ∈ I(D) and 1 ≤ k ≤ K max , where K max is the maximum allowed size of a tandem duplication.K max can be set to |L(D)/2| and this allows all possible tandem duplication sizes.T uk will be set to 1 if the size of the tandem duplication set that node u belongs to is k, and 0 otherwise.If u is not a member of a tandem duplication set, then ∀k, T uk = 0 Our ILP is shown in Figure S1.Our objective function and constraints are a direct translation of the cost function and constraints given in Section 2.2 of the main paper.

B Proofs
In this section, we provide proofs of the lemmas and theorems in Section 3 of the main paper.
Lemma 1.If P u < P v , and no losses occur to descendants of u, then under assumptions (1) and (2), Pu < Pv .
Proof.This is equivalent to showing that min ui∈L(Du) Pui < min vi∈L(Dv) Pvi .To do this, we examine the positions of u and v's descendants after duplication and loss events.Let u 1 and u 2 be the children of node u, and suppose that P u1 < P u2 .In other words, u 1 is the original domain while u 2 is the duplicated copy.From assumption 1, we see that immediately after duplication, P u1 = P u , and this implies P u1 < P v while P u2 > P u .We see that at least one child of u will occur before v, and by similar logic no child of v will have a position value less than P v .By assumption 2, domains may not swap position, so this will hold throughout the evolutionary history of the sequence of domains.So, if only duplications occur, then min ui∈L(Du) Pui < min vi∈L(Dv) Pvi .Even if losses occur, as long as no descendant of u is lost, the previous statement will hold as a loss in the subtree of v can only increase the value of Pv and losses of other domains cannot swap the relative order of any two domains.
Theorem 1.Let D be a domain tree containing all domains mapping to a single leaf gene.Let {u 1 , . . ., u k } be a set of internal nodes in D corresponding to domains involved in a tandem duplication, and suppose no losses occur among their descendants.Then, under assumptions (1) and (2), {u 1 , . . ., u k } is tandem duplication eligible.
Proof.To show this, we must show that any two nodes u i and u j in the set are pairwise eligible.Suppose w.l.o.g. that P ui < P uj , and let nodes a, b be children of u i with P a < P b and c, d be children of u j with P c < P d .By assumption 1, we know that immediately after the tandem duplication event, P a < P c < P b < P d .By lemma 1, this means that Pa < Pc < Pb < Pd .Therefore, u i and u j are pairwise eligible for all pairs, and {u 1 , . . ., u k } are tandem duplication eligible.
Theorem 2. Let D be a domain tree, and G be the containing gene tree.Let {u 1 , . . ., u k } be a set of internal nodes in D mapped to gene node g in G corresponding to domains involved in a tandem duplication.Suppose there exists some g i ∈ L(G g ) such that no losses occur in the domain subtrees of D[g i ] rooted at u 1 , . . ., u k .Then, under assumptions (1), ( 2) and (3), {u 1 , . . ., u k } is tandem duplication eligible.
Proof.First, note that if g is a leaf node, then Theorem 1 applies and {u 1 , . . ., u k } is tandem duplication eligible.If g is an internal node, then we examine the positions of domains in genes on the path from g to g i .Let u i , u j be any two domain nodes in the set of tandem duplication nodes such that P i < P j .We examine their subtrees in D[g i ].If no losses occur, then only duplication and co-duplication events are present in D[g i ].Because co-duplication events pass one copy of each domain present in a gene to each of its children, they appear as nodes with only a single child in By assumption 3, the order of domains is preserved in co-duplications, so these events can be effectively ignored.Without these events, tandem duplications in D[g i ] reduces to the single gene case, and by Theorem 1, {u 1 , . . ., u k } is tandem duplication eligible.
Theorem 3. Suppose nodes u and v in domain tree D are mapped to gene g and annotated by genes g 1 , . . ., g k , and no losses occur in their subtrees D u and D v .If u, v are marked as duplication nodes by the reconciliation framework and are pairwise eligible in D[g i ] for any g i , then u 1 , u 2 are pairwise eligible in D[g i ] for all g i .
Proof.First note that the children of u (and v) must also be mapped to g.Otherwise, either a co-duplication event would be lower cost or there would be at least one loss between u (v) and one of its children.This means that u, v, and their children u 1 , u 2 , v 1 , and v 2 are present in D[g i ]∀i.Position values for these nodes will be the same across all D[g i ], so if u, v are pairwise eligible in D[g i ] for any g i , then u 1 , u 2 are pairwise eligible in D[g i ] for all g i .

C Simulation Details
Our simulation consists of three components, a gene tree generator, a domain tree generator, and a sequence simulator.For the purpose of this work, we only use the first two components.Following the approach of other phylogenetic simulations [3], we generate gene trees and use these topologies to generate domain trees consistent with those gene level events.Our simulation adds the ability to have tandem duplications of multiple consecutive domains.

C.1 Gene Tree Generation
The inputs to our gene tree generation tool are the height h of the gene tree and the number of leaves n desired in the output topology.We use the createRandomTopology function from the Pyvolve [4] package to create a tree topology with n leaves.Branch lengths are assigned to branch in the gene tree such that the sum of distances on any root to leaf path is equal to h.Branch lengths are assigned top down, starting with the outgoing edges of the root, and ending with edges connected to the leaves of G.In the following, let D(u, v) be the sum of branch lengths of all branches on a path from node u to v, while d T (u, v) (as defined in the main paper) is the topological distance, or number of nodes, on the path from u to v. Let r be the root node of G.For any internal node u, the remaining branch length to be assigned to branches on a path from u to any of its leaf descendants is h − D(u, r).Let u 1 be a child of u, and v * be the child of u 1 that minimizes D(u 1 , v * ).If u 1 is a leaf node, then u 1 = v * .We assign branch (u, u 1 ) a length of h − D(u, r)/(1 + D(u 1 , v * )).After this process, we relax the branch lengths.For each branch we take the length l assigned by the process above and replace it with a value drawn from the normal distribution with mean l and standard deviation l/10.

C.2 Domain Tree Generation
Domain trees are generated with respect to a guest tree topology.We simulate duplication and loss events using a modified birth death process [5].A birth death process starts with a single progenitor node, and domain duplications and losses ("births" and "deaths") occur with rates λ and µ respectively.Along any branch (u, v) with length l of the gene tree, evolutionary events occur to the domains belonging to node v. Distances between evolutionary events are drawn from an exponential distribution with mean λ + µ.Let d be the "remaining distance" left on branch (u, v).Initially, d is set to l.A distance d i is drawn from the exponential distribution, and if d i ≤ d, then an evolutionary event occurs and d is set to d − d i .This process is repeated until at some step d i > d.At this point, all domains present in gene v undergo a co-duplication event, resulting in a bifurcation at each of these domain nodes.The children of these domains are the starting domains initially present in the children of v, and the process repeats until the domain tree covers all genes in the gene tree.
Loss events affected single domains, while duplications could be tandem duplications of multiple consecutive duplications.When a loss event occurred, an existing domain was chosen at random and removed from the domain tree.The parent of this node was also removed to ensure the domain tree remained a full binary tree.When a duplication event occurred, we first chose the size of the tandem duplication such that the probability of a duplication of size k occurring was 1  2 k .This allows for a variety of duplication sizes, while making shorter duplications more likely than longer ones.If a size k was drawn but the total number of duplications was less than k, then all domains on the gene duplicated simultaneously.Through this process, we keep track of the relative position of domains as follows: 1.The initial domain at the root of the gene tree has position 0 2. When the domain present in gene g at position i is lost, the positions of all domains in g with position greater than i are decremented by 1 3. If a tandem duplication of size k occurs in gene g affecting domains with positions i, i + 1, ..., i + k − 1, then one child of a duplicated domain with position j is assigned position j, while the other is assigned position j + k.For any domain that had a position l > i + k − 1 prior to the duplication, we assign it a position of l + k.
4. For any domain with position i involved in a co-duplication event, we assign both of its children the position i These rules are consistent with the tandem duplication assumptions laid out in Section 2.3 of the main paper.

C.3 Parameters Used in our Simulations
To test our methods, we generated 50 gene and domain tree pairs each at event distances of 0.01, 0.025, .05,.075,and 0.1.Our gene trees always had 8 leaves and a mean tree height of 0.3.Thus, the largest event distances resulted in very few domain level events, while the shortest ones frequently resulted in cascades of tandem duplications.For each event distance, we adjusted λ and µ based on the number of domains currently existing at the time of the event.We set these such that the probability of a duplication was higher when there were few domains, and the probabililty of a loss was higher once there were many (See Figure S6).This was done so that cascades of domains were more likely early one, followed by many losses.Due to the way position inference works, this represents the hardest scenario for our methods.
Figure S7 shows the sizes of the domain trees generated by our birth-death process at each event distance.At an event distance of 0.1, the average size of the domain tree is typically under 50 nodes, indicating that most genes contain no more than 2-3 domains total.Losses are rare at these event distances, so we see that very few duplications occur at this event distance.At larger event distances, we see much larger domain trees, indicating a much larger number of domains per gene.Here we expect many tandem duplication events across the tree, and several losses once the number of domains per gene crosses 10. Figure S8 looks specifically at the number of domains mapped to leaf genes.These represent the genes we see today, after evolutionary events have occurred across the gene tree.As we use only full binary trees, the number of leaf domains (representing contemporary domains in currently existing genes) per gene is at least half the total number of domains mapped to that gene.Again, we see that with a high event distance, the number of domains per leaf gene  showing all 18 species used in our analysis, along with the sequence used and the number of filamin domains present in each sequence.

Species
is typically only 1 or 2, while at the highest event distances over 20 domains can exist on a single gene.

D Filamin Dataset
We reanalyzed a filamin dataset from [2] to identify tandem duplications.18 sequences were taken from 18 species, shown in Table 1.A total of 435 domains were found across these sequences.A domain tree was built from these sequences using RAxML [6] with the PROTGAMMAJTT amino acid substitution model, and reconciled against the gene tree given in [2].As we have exactly one sequence from each species, the gene tree they give is identical to the species tree built from these species.This is shown in Figure S5.We take as input a domain tree and gene tree, represented as the red and black trees respectively.Leaf domains must be mapped to the genes they came from, and be annotated with their relative position on the gene with respect to other domains.In the figure, domains labelled X:Y are found in gene X with relative position Y.(b) We infer the position of each internal node in the domain tree based on the leaves in its subtree.For internal nodes whose subtrees contain leaf domains from multiple genes, we assign one position for each gene.Pairwise tandem duplication eligibility between any pair of internal nodes is then determined using the rules in Section 3 of the main paper.(c) In the first step of our heuristic, we create an initial mapping of internal domains to genes using a Lowest Common Ancestor (LCA) mapping.Although it is not guaranteed to be optimal for the TDL Reconciliation problem, in practice we expect very few nodes to be incorrectly mapped in this step.(d) For each gene node, a reduced ILP is run on the domain subtree(s) mapping to that node.If the mapping is correct, then this ILP will correctly identify a minimum cost solution to the TDL Reconciliation problem, including all tandem duplications and losses that occurred in that gene.Here, duplications are marked in green, while speciations/leaves are marked in black.(e) In the last step, we take the mapping previously found in step (c) and look for opportunities to remap tandem duplication sets to ancestral genes when doing so reduces overall cost.This can occur when two duplications can be merged into a single duplication, as in the figure.
Figure S3: Here we show an example of how a loss can result in the incorrect inference of tandem duplications.In row (a), we show an example of losses in domain evolution within a single gene.
The actual evolutionary events are shown in (a1).A single gene with two distinct domain instances (shown as the blue square and green circle) undergoes a tandem duplication event in which both domains are duplicated in tandem.Afterwards, the domain at position 3 is lost.(a2) shows the corresponding domain tree, and (a3) shows the duplication events inferred as green nodes.In this case, due to the loss of the second blue square domain, we no longer have the information required to identify the tandem duplication event, instead inferring that the ancestor of the domains in position 2 and position 3 underwent a single domain duplication.Row (b) shows how information from multiple genes can add robustness against these types of losses to our method.(b1) Here, a gene R initially contains two domains, which duplicate in tandem to result in 4 domains, two of each type.Through a speciation event, two copies of gene R are created, A and B. Afterwards, A independently loses domain 3, while B retains all four domains.(b2) shows the corresponding domain and gene trees in red and black respectively.Based on these trees, (b3) shows the reconciliation as well as the inferred duplication events (in green) and speciation events (in black).The gray box shows the inferred tandem duplication.In this case, because gene B retains the domains from the tandem duplication event, we are able to capture it even though gene A has a loss.Note that if the loss had occurred in gene R (and both genes A and B had inherited only 3 domains), then we would not be able to infer the existence of a tandem duplication event.For each gene/domain tree example at a given event distance, the number of domain nodes per leaf gene is counted.50 examples are present at each distance, and each gene tree has exactly 8 leaves, so the sizes of domain subtrees from 400 genes are shown at each distance.The total number of domains is typically between 1-3 at the highest event distance, while as many as 40 domains may be mapped to a single gene (indicating 20 coexisting domains) at the lowest event distance.

Figure S1 :
Figure S1: ILP Solution to the TDL Reconciliaton Problem

Figure S2 :
Figure S2: Full pipeline.(a)We take as input a domain tree and gene tree, represented as the red and black trees respectively.Leaf domains must be mapped to the genes they came from, and be annotated with their relative position on the gene with respect to other domains.In the figure, domains labelled X:Y are found in gene X with relative position Y.(b) We infer the position of each internal node in the domain tree based on the leaves in its subtree.For internal nodes whose subtrees contain leaf domains from multiple genes, we assign one position for each gene.Pairwise tandem duplication eligibility between any pair of internal nodes is then determined using the rules in Section 3 of the main paper.(c) In the first step of our heuristic, we create an initial mapping of internal domains to genes using a Lowest Common Ancestor (LCA) mapping.Although it is not guaranteed to be optimal for the TDL Reconciliation problem, in practice we expect very few nodes to be incorrectly mapped in this step.(d) For each gene node, a reduced ILP is run on the domain subtree(s) mapping to that node.If the mapping is correct, then this ILP will correctly identify a minimum cost solution to the TDL Reconciliation problem, including all tandem duplications and losses that occurred in that gene.Here, duplications are marked in green, while speciations/leaves are marked in black.(e) In the last step, we take the mapping previously found in step (c) and look for opportunities to remap tandem duplication sets to ancestral genes when doing so reduces overall cost.This can occur when two duplications can be merged into a single duplication, as in the figure.

Figure S4 :
FigureS4: Examples of pairwise eligibility computation on internal domain nodes with 0, 1, or multiple annotations.As in previous examples, each leaf domain is labelled by the gene it occurs in and its relative position on the sequence, separated by a colon.We also provide per-gene annotations for other nodes in the tree when those positions are necessary to compute pairwise tandem duplication eligibility.In all cases, we decide whether nodes X and Y (represented as the black dots in each figure) are pairwise eligible or not.At this stage in the pipeline, we do not know which gene an internal domain will be mapped to, so X and Y are names that do not correspond to a particular gene.(a) In this case, nodes X and Y are both annotated by a single gene, A. We see that the positions of the leaves of X and Y are interleaved, and therefore X and Y are tandem duplication eligible.(b) In this example, nodes X and Y share no leaf genes in common.Therefore, we have no way to determine whether they are tandem duplication eligible, so we assume that they are.(c) In this example, X and Y are annotated by genes A and B. According to gene A, X and Y are not pairwise eligible.On the other hand, they are pairwise eligible according to gene B.This could be due to losses in the domain tree, errors in tree reconstruction, or events such as domain swaps which are not considered by our model.In this case, we are permissive and label X and Y as pairwise eligible because they are according to at least one gene annotation.

Figure S5 :
FigureS5: Gene tree, consisting of one gene each from 18 species, that is used for reconciliation in the Filamin domain example.The number at each internal node represents the number of Filamin A domains inferred to be in that ancestral sequence.

Figure S6 :
FigureS6: Duplication/Loss ratio function used for our duplication and loss probabilities.The probability of a duplication (of any size) occurring in a gene with k domains is shown.With a low number of domains, the probability of a duplication is 0.7.With a high number of domains, the probability of a duplication drops to 0.1.This encourages a series of tandem duplications early on, followed by mostly single domain losses once several duplication events have taken place.

Figure S7 :
FigureS7: Boxplots of domain tree sizes.For each event distance, the corresponding boxplot shows the number of nodes in the domain tree for each example generated.A total of 50 gene/domain tree pairs were generated at each event distance.The size of the domain tree grows roughly linearly with decreasing event distance.

Figure S8 :
Figure S8: Boxplots of the number of domains in the domain subtrees mapped to leaf gene nodes.For each gene/domain tree example at a given event distance, the number of domain nodes per leaf gene is counted.50 examples are present at each distance, and each gene tree has exactly 8 leaves, so the sizes of domain subtrees from 400 genes are shown at each distance.The total number of domains is typically between 1-3 at the highest event distance, while as many as 40 domains may be mapped to a single gene (indicating 20 coexisting domains) at the lowest event distance.