HGTree: database of horizontally transferred genes determined by tree reconciliation

The HGTree database provides putative genome-wide horizontal gene transfer (HGT) information for 2472 completely sequenced prokaryotic genomes. This task is accomplished by reconstructing approximate maximum likelihood phylogenetic trees for each orthologous gene and corresponding 16S rRNA reference species sets and then reconciling the two trees under parsimony framework. The tree reconciliation method is generally considered to be a reliable way to detect HGT events but its practical use has remained limited because the method is computationally intensive and conceptually challenging. In this regard, HGTree (http://hgtree.snu.ac.kr) represents a useful addition to the biological community and enables quick and easy retrieval of information for HGT-acquired genes to better understand microbial taxonomy and evolution. The database is freely available and can be easily scaled and updated to keep pace with the rapid rise in genomic information.


INTRODUCTION
Vertical inheritance refers to the transfer of genetic information from parents to offspring. Vertically inherited genes typically show higher degree of similarity between species that are closely related than those that are distantly related. This aids in reliable recognition of species and understanding their classification and evolution. For example, ribosomal RNA (rRNA) genes have been historically used to determine the taxonomic structure of cellular life (1). However, vertical signal can sometimes be confounded by ac-quisition of genes from other sources such as environment, viruses, or via direct interactions between organisms. Recent advances in genomics have confirmed the existence of 'foreign' genes embedded in cellular genomes. For example, mammalian genomes are enriched with viral-like genetic elements, constituting up to 8% of the human genome (2). Similarly, many microbial genomes possess genes acquired from multiple sources (3). This phenomenon is referred to as horizontal gene transfer (HGT), which is a natural outcome given the numerous ways species interact with each other and occupy common habitats.
HGT allows gain of novel molecular functions and can provide selective evolutionary advantages to species. For example, transfer of antibiotic resistance and virulence factor genes between bacterial species poses significant challenges to human health (4). Similarly, transfer of genes involved in response to heat and cold shock and heavy metal and ultra-violet resistance facilitates bacterial adaptation to certain environments. While HGT is an important force driving the evolution of (especially) microbial organisms (5), it can complicate interpreting the true evolutionary history of species and can lead to erroneous interpretations regarding their classification and community interactions (3). Therefore, it is crucial to distinguish between vertically and horizontally acquired genes in genomes, especially when studying deep evolutionary relationships.
Accurate detection of HGT however remains a computational and conceptual challenge. Existing databases such as HGT-DB (6) and DarkHorse HGT Candidate Resource (7) use genomic signatures (i.e. GC bias, nucleotide composition and codon usage) and implicit phylogenetic methods (i.e. comparing the evolutionary distance inferred from sequence similarity) to detect HGT. Because genomic signatures of transferred genes may lose their 'distinctiveness' over long periods of evolutionary time and tend to be highly similar to host genomes in cases of HGT between very closely related organisms, these methods likely have a high rate of false-positive and negative predictions (8,9). Moreover, GC composition within the same genome may fluctuate considerably for different genomic regions (10,11) and (even) for some vertically inherited genes (e.g. ribosomal proteins) (12). In turn, implicit phylogenetic methods are limited by their reliance on similarity scores and underlying phylogeny. This poses another problem since statistically significant sequence similarity is not necessarily a result of vertical evolution (13,14). Because genes acquired from foreign sources typically do not show congruence to species trees, one way to detect HGT would be to reconcile gene trees against reference species trees. This principle is based on an explicit evolutionary model and is generally considered to be a reliable alternative to detect HGT events (15). However, its practical use has remained limited because reconciling trees is computationally intensive (13) and because tree incongruence can also arise from processes other than HGT (16) (see Discussion).
Here, we introduce HGTree (http://hgtree.snu.ac.kr) that provides putative genome-wide HGT information for 2472 completely sequenced prokaryotic genomes. HGTree defines HGT by comparing the gene tree for each orthologous gene set to the reference species tree. Conflict between gene and species trees is taken as indication of non-vertical evolution. Specifically, different hypotheses regarding the evolution of gene sets are evaluated and only those corresponding to HGT are kept and stored in the database. Results are displayed graphically for quick understanding. The friendly user-interface allows quick retrieval of already processed results for HGT analysis. Currently, three major services are provided: (i) HGT browser to display the molecular functions, gene family and phylogenetic relationships of HGT-acquired genes for all the genomes in the database, (ii) HGT analysis between and within (user selected) genomes and (iii) HGT analysis of user submitted gene and genome sequences. For each service, donor and recipient genomes are also graphically labeled for quick understanding. The database is freely available, does not require registration or login credentials and can be easily scaled and updated to keep pace with the continuous rise in genomic information. Importantly, HGTree represents the most complete existing resource for HGT-related information built on an explicit evolutionary model of tree reconciliation.

Functional annotation
A total of 7 748 306 genes in 2472 genomes were scanned against Clusters of Orthologous Genes (COG) database (20) using HMMER (ver. 3.0) (E-value < 10 −3 ) (8). Protein family level assignments were calculated using the local installation of PfamScan (ver. 1.3) following default parameters (21). RNammer (ver.1.2) (22) was used to detect 16S rRNA sequences in each genome. The set of orthologous genes in each species was mapped to corresponding 16S rRNA sequence and this information was used to determine the conflict between gene and species trees during downstream processing.

Orthology assignment
Ensembl homology prediction pipeline (23) was implemented to define homologous gene sets ( Figure 1A). First, pairwise BLASTP (24) search was conducted on each protein from every genome against the total set of proteins (both self and non-self species). For this step, BLAST hits were required to have alignment coverage of at least 80% for both query and subject as well as stringent E-value cutoff of 10 −6 . Second, a sparse graph was built that described relationships between genes based on BLAST results. Third, homologous clusters were generated using hcluster sg (25) program (ver. 0.5.1) that clusters sequences in an hierarchical manner by considering the mean distance between sequences. Fourth, based on homology information, orthologous gene sets were predicted using a modified version of Mestortho orthology detection algorithm (ver. 2.0) (26) optimized to work with large data sets. To improve precision, we removed following orthologous groups from the analysis: (i) gene sets containing >50% of the total genomes since their inclusion contributed towards greater computational load, (ii) gene sets with less than four operational taxonomic units (OTUs) since it is the minimum requirement to build an un-rooted phylogenetic tree and (iii) gene sets consisting of only one species due to the presence of several typestrains that could not be distinguished by 16S rRNA analysis.

Tree reconstruction
Multiple sequence alignment (MSA) of orthologous gene sets was generated using CLUSTAL Omega (ver.1.2.1) (27) under default settings ( Figure 1A). 16S rRNA sequences extracted from each genome were also aligned in a similar way and then combined into a profile alignment along with 18S rRNA sequence from Saccharomyces cerevisiae. The eukaryotic rRNA sequence was treated as outgroup to root the species tree and was removed once Newick trees were produced. Pair-wise distance matrices were calculated for MSAs of both orthologous gene sets and corresponding 16S rRNA sets. Orthologous gene sets where all pair-wise distances between proteins were close to zero (< 0.0001) were removed, as they do not provide enough information for reliable estimation of phylogenetic relationships. FastTree (ver. 2.0) was used to reconstruct phylogenetic trees for each orthologous gene set and the corresponding species tree (28). FastTree calculates approximate maximum-likelihood (ML) trees by first building a starting neighbor-joining (NJ) tree and then refining it by a combination of minimum evolution and maximum-likelihood approaches (28). It is Nucleic Acids Research, 2016, Vol. 44, Database issue D613 much faster than the standard ML-based programs such as PhyML 3 (29) and RAxML (30) and is optimized to work with large data sets while ensuring high accuracy (28). Species tree was re-rooted by the yeast sequence a posteriori using the Newick Utility (ver. 1.6) (31). The reliability of splits in phylogenetic trees was evaluated by 'local support values' based on Shimodaira-Hasegawa (SH) test (32) similar to 'SH-like local support' values in PhyML 3. RANGER-DTL-U (ver.1.0) (33) was used to detect putative HGT events by reconciling gene trees against rooted 16S rRNA reference species tree and to distinguish HGT events from gene duplication and loss events ( Figure 1A). All HGT events except those between same species were stored along with species and gene information.

Processing of user queries
User submitted sequences are processed in the following manner: (i) Prodigal (ver. 2.6) is used to detect proteincoding genes; (ii) orthologous groups are assigned to predicted genes by searching against already constructed orthologous gene sets using reciprocal-BLAST search ( Figure  1B). Several measures are taken to ensure reliable assignment of orthologous groups to user-provided sequences including alignment coverage of at least 80% between query and subject, stringent E-value threshold of < 10 −6 and enabling soft-masking (18); (iii) orthologous groups that contain user queries form updated orthology sets; (iv) in parallel, Rnammer is used to detect 16S rRNA sequences; (v) user-provided 16S rRNA sequences are searched against the 16S rRNA database constructed previously from 2472 prokaryotic genomes to predict the taxonomic structure of user-provided data; (vi) MSA, distance matrices, filtering and phylogenetic trees are calculated as described above using the updated orthology set and 16S rRNA information. However, users may opt to select a different outgroup taxon for rooting the species tree depending upon their preferences. This can be accomplished by providing the new 16S (18S) rRNA sequence and selecting the appropriate option on the online menu. By default, user data is processed via FastTree. However, ML based processing of user queries can be up to three times slower than NJ processing (Table  2). Therefore, we provide an alternative option to quickly process user queries using NJ tree reconstruction from precomputed distance matrices and (vii) HGT events corresponding to user sequences are extracted and returned to user by E-mail.

Statistical test to detect HGT enriched phyla
Fisher's exact test was performed to test the significance of the null hypothesis stating that HGT events of a particular phylum were not greater compared to other phyla. For this purpose, 2 × 2 contingency tables for each phylum were analyzed. Specifically, the counts of HGT-related and total genes for each phylum were compared with the counts of HGT-related and total genes in all other phyla. The odds ratio greater than one favored the alternate hypothesis stating that HGT events of a particular phylum were significantly greater than the HGT events of all other phyla.

Database server and user interface
The database server was developed using MariaDB (ver.10.0.13) (http://mariadb.org/) management system. The database consists of four tables with more than 13 million records. The web-based user interface was written in HTML5, PHP, CSS and JavaScript. User interface widgets were implemented using jQWidgets (ver.3.8.1) (http://www.jqwidgets.com) and jQuery (ver.1.11) (http:// jquery.com). Circular phylogenetic trees were generated by jsPhyloSVG-1.55 (34) and two way HGT relationships (donors and recipients) were dynamically generated using the SVG JavaScript library, D3 (35).  (Figure 2A). For each selected organism, genome size, GC content (%), GenBank and BioProject IDs and complete taxonomic information are also displayed (see also Supplementary Table S2). In addition, we provide an HGT-index that is a quantitative indicator of HGT influence in each genome. The index simply represents the total number of HGT-related genes (both donor and recipient) divided by the total number of genes in a genome. The table directly below lists all HGT events detected in the selected genome(s) (Figure 2A). For each event, several links provide access to Pfam and COG classifications along with basic description of gene function. HGT events and phylogenetic relationships can be visualized graphically to explicitly highlight the conflict between gene and species trees. For example, clicking 'see graphics' under 'HGT Relationship' column will return graphical representation of HGT relationships with other microbial genomes ( Figure 2B). Plots show donor and recipient genomes involved in each HGT event as well as gene and species trees ( Figure 2C). Trees can be displayed either in circular or rectangular representation. The latter also displays the local support values to provide a quick estimate for the reliability of phylogenetic splits. By default, HGTree displays ML gene and species trees as inferred by FastTree (28). However, we also reconstructed NJ trees separately for each gene and corresponding species set. All HGT-related information can be downloaded from the Downloads menu. While HGT Browser can be used to find genome-wide HGT events of a genome against all other genomes, HGT Analysis within Selected Genomes utility can display HGT events that have occurred only within selected genomes. For this analysis, users are prompted to select at least two different species regardless of their phyla classification. This exercise can be useful to quantify gene flow between species that maybe engaged in symbiosis-like relationships. In turn, Between-group HGT Analysis tool enables users to customize two groups of organisms. Users may add organisms from different phyla in each group. The analysis option then displays HGT events that have occurred between the userdefined groups. For example, obtaining group-wise HGT information can be useful to understand interactions between different microbial phyla in environmental samples. HGTree also offers users to detect HGT events in their own gene or genome sequences ( Figure 2D). For this purpose, users may upload FASTA formatted DNA sequences that are scanned against the pre-compiled data sets (as described above) for fast NJ reconstruction. Alternatively, users may opt to process results using FastTree approximate ML trees (28), as we have done throughout the database. However, ML-based processing is about 2-3 times slower relative to NJ reconstruction (Table 2) because the distance matrices required for NJ reconstruction are already pre-computed and simply need to be updated with user data. In addition, analysis time depends upon genome length, total number of proteins and number of genes matched to orthologous gene sets. On average analyzing 1MB genome and 1000 pro-teins roughly takes 9 and 17 minutes under NJ and ML processing, respectively, on background computing server equipped with 8 CPU cores (16 processors @ 2.60 GHz and 128 GB RAM) ( Table 2). For example, it took less than 2 minutes to process the smallest genome in our data set (Candidatus Nasuia; 110KB) using both NJ and ML. In turn, it took 57 and 88 minutes to process the largest genome (Sorangium cellulosum; 13.03 MB) under NJ and ML environments respectively (Table 2). Therefore, users can opt for either option depending upon their convenience. Results are processed on our servers and returned to users via E-mail. Finally, users may opt to study the evolution of a particular gene family. This can be accomplished by typing either the gene name or its function (e.g. CRISPR) using the Gene or Keyword Search utility provided by the Search menu.

Initial insights into microbial evolution
HGTree already provides preliminary insights into microbial evolution. The data suggest abundance of genetic exchange among microbial species (Figure 3). The HGTindex ranges from 0.03 (Candidatus Hodgkinia and Mycoplasma haemofelis) to 0.59 (Borrelia garinii) (Supplementary Table S2). While most microbial genomes showed linear relationship between the total number of genes and total number of horizontally transferred genes, some outlier genomes with significantly (P < 0.05) lower or higher HGTindex were also detected. Specifically, we focused on the 5% upper (HGT-index < 0.16) and lower (HGT-index > 0.42) percentiles of HGT-index as shown in the red-dotted line in Figure 3A (see also Supplementary Table S2). Among 247 outlier genomes, Chlamydia, Rickettsia and Mycobacterium genera belonged to the upper percentile, while Mycoplasma to the lower. Interestingly, these organisms are notable parasites of other species suggesting that symbiosis and parasitism lead to significant increase/decrease in horizontal exchange (36). Figure 3B gives a breakdown of HGT influence in each major microbial phylum. We observed that five phyla (Euryachaeota, Actinobacteria, Cyanobacteria, Proteobacteria and Chlorobi) and one unclassified archaeon had relatively higher HGT-index than the global median value of 0.3 (red dashed line in Figure 3B). Fisher's exact test confirmed that Euryachaeota (101 members), Actinobacteria (255), Cyanobacteria (68) and Proteobacteria (1041) were significantly (P-value < 0.05) enriched by HGT. In turn, Chlorobi (9 members) and the unclassified archaeon (1) were deemed 'statistically insignificant' despite HGTindex above the global median. Because the Fisher's test calculates HGT enrichment by relating the number of HGT-genes and total number of genes for one phylum against all other phyla (the background), poorly sampled lineages likely contributed little information relative to the background. Similarly, the HGT-index must also be interpreted with caution as it depends upon the total number of 'known' genes in each phylum that could be matched to orthologous gene sets. Because poorly sampled lineages likely contain many 'rare' genes with either no or few orthologs in sequence databases, their HGT-index was lower relative to well-studied lineages in our data set (albeit with some exceptions; Figure 3B). Importantly, the two most well sampled bacterial phyla (Proteobacteria and Firmicutes) had median HGT-index of 0.32 and 0.29, respectively. These numbers suggest that while HGT was likely underestimated for poorly sampled lineages, its median upper-bound still lies somewhere around 0.3 and 0.4. Previously, Dagan et al. (2008) estimated that on average 81 ± 15% genes in the genomes of 181 prokaryotic species had participated in horizontal exchange (37). In turn, our results reveal that HGTindex in most microbial phyla, especially those that are well studied and sampled, did not reach extremely high levels. In fact, HGT-index suggests that about 10-35% of genes in most microbial phyla are subject to horizontal exchange ( Figure 3B). The differences between two studies are likely due to two main reasons: (i) increased sampling of microbial genomes in this study (2472 versus 181) and (ii) an explicit evolutionary model backs the detection of HGTrelated genes. In turn, Dagan et al. (2008) did not consider phylogenetic discordance. This should be kept in mind when comparing the two studies. The results however confirm current understanding that HGT plays significant roles in the evolution of microbial organisms and must be closely monitored for both medical and economical purposes.

DISCUSSION
HGTree is based on an explicit evolutionary model i.e. conflict between gene and species trees is taken as indication of non-vertical evolution. In general, evaluating incongruence between gene and species trees holds promise to reliably detect HGT events (e.g. see (15)). However, its practical use has remained limited because, (i) the choice of tree reconstruction method (e.g. NJ, ML, parsimony) can influence HGT detection, (ii) accurate detection of orthology remains a challenge, (iii) conflicts between gene and species trees may also arise from processes other than HGT such as reductive evolution (16) and (iv) tree reconstruction followed by reconciliation are computationally intensive tasks. These considerations make it technically and conceptual challenging to globally infer HGT events (i.e. by reconciling trees for all gene families in hundreds of organisms). Below, we describe measures taken to ensure that HGTree was minimally affected from these challenges.
To ensure high speed and optimal accuracy in tree reconstruction, we implemented FastTree program to infer approximate ML phylogenies for each orthologous gene set and its corresponding species tree (28). FastTree is more than 100 times faster than the standard ML programs (PhyML 3.0 and RAxML 7) and is significantly more accurate than distance and parsimony based methods of tree reconstruction (28). It even outperforms the default implementation of PhyML 3 but is less accurate than PhyML and RaxML ran with subtree-pruning-regrafting (SPR) options. However, this is more than offset by the speedier execution of FastTree in handling large alignments containing hundreds of taxa. Moreover, disagreements between FastTree and SPR-based ML programs tend to be poorly supported (28). FastTree also provides local support values based on SH test (32) to quickly evaluate the reliability of obtained trees. These values correlate well with the SH-like support values provided by PhyML 3 (28) and can be used to quickly determine the reliability of inferred phylogenetic splits. In turn, running traditional bootstrap would considerably increase the processing time plus adding the time for tree rec-onciliation. These features identified FastTree as the optimal choice to rapidly and accurately reconstruct hundreds of phylogenetic trees in our data set.
To accurately define orthologs, we incorporated Mestortho, which is an orthology detection algorithm based on minimum evolution (26). To improve precision in orthology estimation, we filtered out gene sets exhibiting either high or low complexity (see Materials and Methods). To evaluate conflicting hypotheses regarding the evolution of gene sets, RANGER-DTL-U was used to reconcile unrooted gene trees against rooted species trees and to postulate gene duplication, transfer and loss events (commonly known as DTL reconciliation) (see (33) and references therein). The algorithm works by embedding each possible rooted version of gene tree inside the species tree and selecting the most parsimonious reconciliation amongst all rootings (i.e. explain the transformation of gene tree into species tree with minimum overall cost). Thus, RANGER-DTL is built on parsimony principle similar to most existing algorithms of tree reconciliation (e.g. (38)(39)(40)(41)), except (42) and (43) that utilize probabilistic framework. However, RANGER-DTL significantly outperforms others when dealing with huge data sets containing trees of hundreds of taxa (33). In a comparative exercise, it was sometimes 100 000 times faster than Mowgli (41) and AnGST (40), two other widely used advanced algorithms for DTL reconciliation. An alternative version of the program (RANGER-DTL-D) requires dated species trees (i.e. chronogram) for reconciliation. While, the alternative is biologically well founded and considers HGT to only occur between co-existing species, accurate estimation of dates for each and every phylogenetic tree currently remains challenging, especially for large trees (44). Moreover, it is relatively much slower for large data sets (33). In turn, most other available reconciliation algorithms consider duplication and loss but not transfers (e.g. (38,45,46)) and hence are not suitable for large-scale analyses of prokaryotic gene phylogenies. Therefore, RANGER-DTL-U is implemented in the current version of HGTree due to its speed, accuracy and compatibility in handling large data sets.
HGTree is a non-commercial public database developed to support various fields of research. It has a user-friendly interface allowing easy access to large amount of HGT information. To our knowledge, it is the most comprehensive available resource of HGT-related information generated by large-scale phylogenetic analyses. However, not all transfers can be detected by tree reconciliation. For example, transfers that occur between sister taxa do not yield topological incongruence. Similarly, an HGT-acquired gene in the common ancestor of two (or more) species may later be lost in only one (or more) of the descendants. Here, favoring either transfer or loss can be conceptually challenging because the gene has experienced both events. The decision to record the event either as HGT or to ignore (i.e. treat as loss) is based on the most parsimonious embedding of gene tree inside the species tree. In our opinion, recording such events as transfers is more appropriate because HGT followed by loss nullifies the first gain and restores the original state. Technically, such events could still be recorded as losses if they yield the most parsimonious reconciliation D618 Nucleic Acids Research, 2016, Vol. 44, Database issue and hence will be excluded from the HGTree repository. In other words, some true positives have likely been missed and false-positives included thanks to the biological complexity of the HGT-detection problem. We expect to quantify these rates in a future version. Because tree reconciliation method is susceptible to topologies of both species and gene trees, short branch lengths of species tree may also sometimes lead to incorrect estimation of HGT. However, no method for HGT detection can be 100% accurate. Therefore, genes output by HGTree should be taken as putative HGT-genes to aid further downstream analysis. Moreover, results are dependent upon the choice and accuracy of existing programs for tree reconstruction and reconciliation and will no doubt improve in precision with the availability of better alternatives in future.

Future work
The precision and use of HGTree can be improved with additional upgrades. In the present version, we removed orthologous gene sets containing >50% of total microbial genomes. While some widely distributed gene families may also be subject to HGT (e.g. aminoacyl-tRNA synthetase), their accurate detection via phylogenetic inferences can be more challenging. In turn, these transfers can perhaps be better detected via comparative genomics approaches, as shown in (47). Moreover, large gene families contributed to maximum computational load in HGT detection. Therefore, we plan to equip HGTree with surrogate measures of HGT detection to take care of these issues. Similarly, when there is consistent HGT signal between donor and recipient lineages, concatenating such genes may give better resolution. However, concatenated genes can be subject to other artefacts. For example, genes are composed of protein domains that can be gained, lost or rearranged in genes (48). Their inclusion in sequence alignments can increase the number of gaps and thus artificially influence phylogenetic inferences. We expect to reconcile concatenated gene phylogenies against individual gene phylogenies in the subsequent releases to better address this issue. Another issue related to reconciling trees is the existence of multiple optimal reconciliations that may be equally good. The similarities and differences between multiple optimal solutions were recently explored on a biological data set of ≈4700 gene trees reconciled against species tree (49). The authors confirmed that despite the existence of multiple optimal solutions, event assignments to gene nodes and mappings were fairly conserved across all optimal solutions (e.g. 93.1% and 73.15% chances for events and mappings respectively) (49). Unfortunately, exploring optimal search space and listing percentages of conserved events is not part of the current release of RANGER-DTL but an update is expected soon. Therefore, we expect to provide numeric confidence to each event assignment in the future releases of HGTree provided that search space can be explored in reasonable amount of time. The precision will also improve with the availability of high quality genome assemblies and sequencing of novel organisms. The future versions will also focus on detection of HGT-derived gene clusters in microbial genomes since transfer of gene clusters is a frequent event in microbial evolution (50). Viral genomes will also be added in the sub-sequent releases, as viruses often exchange/transfer genes between microbial species (51). Finally, HGT contribution of microbial species that are part of normal human microbiota will also yield useful insights into the complex ways organisms interact with each other (19). We expect to update HGTree at least twice a year to keep pace with the rising genomic information.