Improved gene tree error correction in the presence of horizontal gene transfer

Motivation: The accurate inference of gene trees is a necessary step in many evolutionary studies. Although the problem of accurate gene tree inference has received considerable attention, most existing methods are only applicable to gene families unaffected by horizontal gene transfer. As a result, the accurate inference of gene trees affected by horizontal gene transfer remains a largely unaddressed problem. Results: In this study, we introduce a new and highly effective method for gene tree error correction in the presence of horizontal gene transfer. Our method efficiently models horizontal gene transfers, gene duplications and losses, and uses a statistical hypothesis testing framework [Shimodaira–Hasegawa (SH) test] to balance sequence likelihood with topological information from a known species tree. Using a thorough simulation study, we show that existing phylogenetic methods yield inaccurate gene trees when applied to horizontally transferred gene families and that our method dramatically improves gene tree accuracy. We apply our method to a dataset of 11 cyanobacterial species and demonstrate the large impact of gene tree accuracy on downstream evolutionary analyses. Availability and implementation: An implementation of our method is available at http://compbio.mit.edu/treefix-dtl/ Contact: mukul@engr.uconn.edu or manoli@mit.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Note that the actual implementation of TreeFix-DTL generates the proposals and implements the local search step in a particular way to avoid getting caught in local minima. Specifically, during each search iteration, we start with the current gene tree and propose a new tree by performing a random NNI or SPR rearrangement; this proposal is always accepted if it is statistically equivalent to the input gene tree and has lower reconciliation cost than the current optimal gene tree, and accepted with some predefined probability otherwise. The next proposal is then generated based on the, potentially new, current gene tree.
Complexity analysis. Let s denote the number of search iterations performed by the algorithm. Each of these iterations involves evaluating a constant number (100 in the current implementation) of tree topologies. The time spent on each tree is dominated by two additive components. First, the time complexity of performing the SH-test and of computing the likelihood for a given tree topology. And second, the time required to perform DTL-reconciliation. Let m and n denote the sizes of the gene tree and species tree, respectively, and let a denote the alignment length. In our method, the calculations for the first component are performed heuristically by RAxML. We denote this cost as r(a, m), since it depends on the alignment length and on the size of the input gene tree. The second component has a time complexity of O(mn). The total time complexity of our algorithm is thus O(s(r(a, m) + mn)).

S.2 Choice of Parameters for Simulated Datasets
The low-DTL, medium-DTL and high-DTL gene trees had, on average, 52.3, 70.4, and 91.3 leaf nodes, 1.2, 2.8, and 5.0 duplications, 2.2, 5.5, and 9.9 transfers, and 2.1, 2.3, and 2.9 losses, respectively (Table S2). To generate these datasets using the probabilistic model of gene duplication, transfer, and loss (Tofigh, 2009;Tofigh et al., 2011), we used the duplication, transfer, and loss rates given in Table S1. These rates represent the probability of a particular event type occurring per gene lineage per unit branch length on the species tree, and our 50-taxon species trees had an average branch length of 11.24 per tree (and 0.115 per edge). The ratio of duplications, transfers, and losses that we used was based on our analysis of a 4736 gene tree, 100 species dataset from David and Alm (2011), which consists of predominantly prokaryotic species sampled broadly from across the tree of life. In general, we chose the duplication, transfer, and loss rates for which the reconstructed RAxML trees yielded duplication, transfer, and loss counts in a ratio roughly similar to those observed on that dataset. Specifically, the reconstructed RAxML trees for our low-DTL, medium-DTL, and high-DTL datasets showed, on average, 1. 25, 2.91, and 4.89 duplications, 5.88, 10.04, and 15.49 transfers, and 4.88, 5.86, and 7.10 losses, respectively (see also the actual number of implanted events as shown in Table S2). Likewise, the choice of 173 for the amino acid alignment length is based on an analysis of the same biological dataset, where we observed that the median alignment length was 173. The choice of the other alignment length, 333, is motivated by the fact that the typical prokaryotic gene length is approximately 1000 base pairs (Koonin and Wolf, 2008). Our choice of mutation rates is meant to span the range from slow evolving (or less diverged) gene families to fast evolving (or highly diverged) gene families. The average branch lengths for our low-DTL datasets, with mutation rates 1, 3, 5, and 10, were 0.11, 0.33, 0.55, and 1.1, respectively (in terms of average number of mutations per site). The corresponding mutation rates for the medium-DTL datasets were 0.10, 0.3, 0.5, and 1.0, respectively, and for the high-DTL datasets they were 0.095, 0.29, 0.48, and 0.95, respectively. The average branch length in the biological dataset was 0.264.

S.3 Choice of Parameters for RAxML, NOTUNG, TreeFix, MowgliNNI, and AnGST
For reconstructing gene trees using RAxML we used the following command: raxmlHPC -f a -x 12345 -p 12345 -s input.fasta -m PROTGAMMAJTT -#100 NOTUNG and MowgliNNI require the input gene tree to be labeled with bootstrap supports at each edge. We inferred these bootstrap support values based on the 100 rapid bootstraps from the RAxML runs. Both NOTUNG and MowgliNNI also require a bootstrap cutoff percentage which specifies the edges in the gene tree can be modified; for NOTUNG we used the default value of 90% of the maximum bootstrap, and for MowgliNNI we used a cutoff of 80% based on the author recommendation from Nguyen et al. (2012). In addition, MowgliNNI requires that the input gene tree be rooted and that the species tree be fully dated. Our simulated species trees were fully dated and were used as input for MowgliNNI, and we ran MowgliNNI on all possible rootings of the input gene tree. AnGST requires as input a set of bootstrap replicates for the gene tree. Since, AnGST does not specify any default settings for the number of bootstrapped gene trees to use, we tried two settings: 10 bootstraps as previously used by the authors of AnGST, and a more natural value of 100 bootstraps more likely to be used by most users. We discovered that performance was significantly improved by using 100 bootstraps instead of 10 (results not shown). Thus, we used the 100 rapid bootstraps from RAxML as input to AnGST. TreeFix was run using a thorough search setting ("long" version as defined by Wu et al. (2013)), that corresponds to the (default) search setting used for TreeFix-DTL.

S.4 Scalability and Speed
To study the scalability and performance of the methods on larger datasets (with more taxa), we created datasets with 100-and 200-taxon species trees using the same methodology we used to create the 50-taxon datasets. We created these larger datasets for medium-DTL, mutation rates 1 and 5, and sequence length 333. The gene trees for the 100-taxon datasets had, on average, 144.3 leaf nodes, 5.3 duplications, 11.3 transfers, and 4.3 losses, while the 200-taxon datasets had, on average, 290.7 leaf nodes, 10.1 duplications, 21.3 transfers, and 9.1 losses per gene tree. We observed that the error rates of the TreeFix-DTL trees on these 100-and 200-taxon datasets are generally similar to those observed on the corresponding 50-taxon datasets (Supplementary Figure S2). This suggests that the performance of the method does not deteriorate as the number of taxa in the input trees increases. Furthermore, TreeFix-DTL is fast enough to be easily applied to gene trees and species trees with hundreds of leaves (Supplementary Table S1). For instance, on the 100-and 200-taxon datasets, TreeFix-DTL required 13.8 and 43.5 hours per tree, respectively, on average (including the time to build the initial RAxML tree), when executed on a compute cluster with each node consisting of an 800 MHz AMD Opteron processor and 4 GB of RAM. This compares favorably to the 4.1 and 15.0 hours of average runtime required to build just the initial RAxML trees themselves (using the thorough search settings as previously described). Thus, gene tree inference using TreeFix-DTL is only about three times as slow as doing a thorough sequence-only reconstruction using RAxML. AnGST is very efficient in general (assuming the bootstrap trees have already been computed), especially for the smaller (50-and 100-taxon) input instances; but, due to excessive memory requirements, we found it hard to run AnGST on datasets with more than 200 taxa (on a computer with 4 GB of RAM) (Table S1). Runtimes for MowgliNNI were roughly twice as high as those for TreeFix-DTL.
TreeFix-DTL relies on a local search strategy to find more accurate gene trees, and its performance depends on the number of local search steps allowed during the search. By default, TreeFix-DTL executes 1000 local search steps per run, providing a good trade-off between running time and accuracy for a wide range of tree sizes. To study the impact of using more local search steps, we also ran TreeFix-DTL with 5000 local search steps per run on the 100-and 200-taxon datasets (results not shown). As expected, we observed that accuracy improved with the use of more local search steps, with the more exhaustive version having, on average, a 15% smaller error-rate than the default parameterization of TreeFix-DTL. This additional increase in accuracy, however, comes at the cost of a five fold increase in running time. Nonetheless, the number of local search steps can be increased to obtain even better accuracy whenever accuracy is paramount, or when the number of leaves in the gene trees or species trees involved exceeds a few hundred.

S.5 Additional Experiments
Robustness to duplication, transfer, and loss ratios. Since the datasets considered in the basic experimental setup all have more transfers than duplications (as learnt from the biological dataset from David and Alm (2011), we also tested the performance of AnGST and TreeFix-DTL on datasets in which there were more duplications than transfers. The idea is to test if the performance of TreeFix-DTL depends on the ratio of events in the dataset. Specifically, we created new gene trees that had, on average, 5.2 duplications, 4.7 transfers, and 3.9 losses. We created these datasets for mutation rates 1 and 5, and sequence length 333. We observed that TreeFix-DTL is just as effective at inferring gene trees on these datasets as in our previous experiments, decreasing the error rate of the RAxML trees by 70.5% (Supplementary Figure S4A).
Performance on short genes (alignment length 75 amino acids). We observed that over 10% of the 4736 gene trees in the biological dataset from David and Alm (2011) had a multiple sequence alignment length of less than 75 amino acids. Thus, to test the ability of the methods to infer accurate gene trees on short alignments, we created datasets, using the 50-taxon species trees and the same basic experimental setup described earlier but with sequence length 75. We created these datasets for medium-DTL and mutation rates 1 and 5. On these datasets with very short alignments, the error rates of the inferred gene trees were, unsurprisingly, substantially higher than for gene trees inferred using the length 173 or 333 alignments (Supplementary Figure S4B). For instance, for the dataset with mutation rate 1, RAxML, AnGST, and TreeFix-DTL had NRFD of 0.185, 0.075, and 0.064, respectively. In spite of these higher absolute error rates, the average error rate for the TreeFix-DTL trees (8.2%) is much smaller than the average error rate for the RAxML trees (21.2%).

S.6 Event Recovery
We used the strictest definition of an event in our analysis; that is, for a proposed event to be correct, it must be inferred at the correct location in both the gene tree and species tree. In particular, we require that • for duplications, the same gene duplicates in the same species. That is, two duplications are identical if the duplication nodes yield the same children subtrees and the duplication nodes are mapped to the same species.
• for transfers, the same gene transfers from the same species. That is, two transfers are identical if the transfer nodes yield the same children subtrees, the transfer edges are the same, and the transfer node is mapped to the same (donor) species. Note that the recipient species is not tracked because our simulator did not retain this information.
• For losses, the same gene is lost in the same species. That is, two losses are identical if they occur along the same branch of the gene tree (this branch is required to exist, else the proposed loss is incorrect), and the losses occur in the same species.
For each gene tree, we compared the estimated events to the true events (known through simulation), applying the above definitions to classify each estimated event as correct or incorrect. For each event type (duplication, transfer, loss), we then divided the number of correct events (of that type) by either the total number of true events (to determine sensitivity) or by the total number of estimated events (to determine precision).
We point out that, when multiple optimal DTL reconciliations existed, event inference accuracy was based on a single solution selected at random from the set of multiple optimal solutions. Also, since RANGER-DTL seeks optimal, but not necessarily time-consistent, DTL reconciliations, it is possible that some of the recovered reconciliations are time-inconsistent. However, note that neither random optima nor time-inconsistency are expected to bias our results one way or another since our results are averaged over 2400 50-taxon gene tree/species tree pairs. Finally, we note that applications exist for which relaxed event definitions may be more appropriate. For example, to differentiate between orthologs, paralogs, and xenologs, we are only interested in the gene tree topology and the location of (respectively, speciation, duplication, and transfer) events in the gene tree. That is, the species mappings can be ignored.

S.7 Analysis of Cyanobacterial Gene Families
To analyze the cyanobacteria gene families, we obtained a species tree, multiple sequence alignments, and PHYLIP (Felsenstein, 1989) NJ gene trees from Zhaxybayeva et al. (2006). We then used RAxML and TreeFix-TL to infer gene trees. Finally, we applied RANGER-DTL to the NJ, RAxML, and TreeFix-DTL gene trees to estimate events (and also to reroot the gene trees for the unrooted NJ and RAxML trees). For all programs, we used the same parameters as in our simulation study.
Next, we considered the analysis of Stolzer et al. (2012), which considered Transfer-Loss (TL) and Transfer-Loss-ILS (TLI) reconciliation models, as well as Duplication-Transfer-Loss (DTL) and Duplication-Transfer-Loss-ILS (DTLI) models, and filtered gene families by removing any families that contained either temporally infeasible or conflicting multiple optimal solutions. Using this process, they obtained a set of 314 families in which they compared estimated event counts. By considering only TL and TLI models and using the same filtering criteria, we obtain a set of 769 gene families, or 2.45× that considered by Stolzer et al. (2012). With this larger set of families, applying the TLI reconciliation model to the original NJ trees decreased the number of estimated transfers and losses by only 4.48% and 4.60%, respectively, over the TL model ( Figure S3); this is a far smaller reduction than that reported by Stolzer et al. (2012) (15-18% decrease in duplications + transfers, up to 20% decrease in losses), though we note that the larger differences remain if we relax the filtering criteria (Table S5). With the more accurate TreeFix-DTL gene trees, the similarity between estimated events using the two models is even more dramatic, with the TLI reconciliation model decreasing the inferred number of transfers and losses by only 1.52% and 0.94%, respectively, over the TL model.  Table S1: Parameters used for simulated datasets. The table shows the duplication, transfer, and loss rates used with the probabilistic model of gene tree evolution (Tofigh, 2009;Tofigh et al., 2011) to generate the low-, medium-, high-, and veryHigh-DTL gene trees used in the simulation study.   Table S3: Average runtimes for inferring gene trees. The results for each category are averaged over all datasets generated for that category. RAxML was run with the command line option that produces 100 rapid bootstraps and executes the search heuristic 10 times. AnGST was run with 100 boootstrap trees as input, and TreeFix-DTL was run using its default settings. RAxML and TreeFix-DTL were run on a compute cluster where each node had an 800 MHz AMD Opteron processor with 6 cores and 4 GB of RAM (each run used a single core) and the times reported are CPU time. AnGST was run on a desktop computer with a 3.2 GHz Intel Core i3 processor and 4 GB of RAM and the times reported are wall time. Due to the different hardware used, runtimes for AnGST are not directly comparable to those of RAxML and TreeFix-DTL and are included only for completeness.

Program
Average error rate (NRFD) RAxML 0.088 TreeFix-DTL on true species trees 0.027 AnGST on species trees with 1 NNI 0.036 AnGST on species trees with 3 NNIs 0.054 TreeFix-DTL on species trees with 1 NNI 0.034 TreeFix-DTL on species trees with 3 NNIs 0.053 Table S4: Impact of species tree error. Gene trees are inferred using topologically incorrect species trees (constructed through 1 NNI and 3 NNI operations). The results for each level of species tree error (number of NNI operations) is averaged over the four datasets (50-taxon, medium DTL, mutation rates 1 and 5, sequence lengths 333 and 173) with 100 input instances each.   Figure S1: Error rate of the different methods on simulated datasets of 50 taxa. Error rates in terms of the NRFD are shown for gene trees inferred using RAxML, NOTUNG, TreeFix, MowgliNNI, AnGST, and TreeFix-DTL on the 24 simulated datasets. Each plot shows the results for a specific rate of duplication, transfer and loss (low-, medium-, or high-DTL), a specific sequence alignment length (173 or 333 amino acids), and for all four chosen rates of mutation (Rates 1, 3, 5, and 10).   Figure S4: Error rate on additional simulated datasets of 50 taxa. Error rates in terms of the NRFD for the (A) two simulated datasets with a different ratio of duplication, transfer, and loss, (B) two simulated datasets with sequence length 75, and (C) two simulated datasets with very high rates of duplication, transfer, and loss.

RAxML
AnGST TreeFix-DTL true counts Figure S5: Number of evolutionary events estimated for simulated datasets of 50 taxa. Duplications, transfers, and losses are estimated using DTL-reconciliation. Each plot shows the results for a specific rate of duplication, transfer and loss (low-, medium-, or high-DTL), a specific sequence alignment length (173 or 333 amino acids), and averaged over the four chosen rates of mutation (Rates 1, 3, 5, and 10). The true (implanted) counts are also shown.