-
PDF
- Split View
-
Views
-
Cite
Cite
Walker Pett, Marcin Adamski, Maja Adamska, Warren R Francis, Michael Eitel, Davide Pisani, Gert Wörheide, The Role of Homology and Orthology in the Phylogenomic Analysis of Metazoan Gene Content, Molecular Biology and Evolution, Volume 36, Issue 4, April 2019, Pages 643–649, https://doi.org/10.1093/molbev/msz013
Close - Share Icon Share
Abstract
Resolving the relationships of animals (Metazoa) is crucial to our understanding of the origin of key traits such as muscles, guts, and nerves. However, a broadly accepted metazoan consensus phylogeny has yet to emerge. In part, this is because the genomes of deeply diverging and fast-evolving lineages may undergo significant gene turnover, reducing the number of orthologs shared with related phyla. This can limit the usefulness of traditional phylogenetic methods that rely on alignments of orthologous sequences. Phylogenetic analysis of gene content has the potential to circumvent this orthology requirement, with binary presence/absence of homologous gene families representing a source of phylogenetically informative characters. Applying binary substitution models to the gene content of 26 complete animal genomes, we demonstrate that patterns of gene conservation differ markedly depending on whether gene families are defined by orthology or homology, that is, whether paralogs are excluded or included. We conclude that the placement of some deeply diverging lineages may exceed the limit of resolution afforded by the current methods based on comparisons of orthologous protein sequences, and novel approaches are required to fully capture the evolutionary signal from genes within genomes.
Introduction
Resolving the phylogenetic relationships among animal lineages is crucial for understanding the evolutionary origin of key animal traits, such as a true gut, muscles, and a nervous system. With advancements in sequencing technology, and the associated increase in the availability of genomic data from these groups, the phylogeny of early animals has attracted renewed attention and has been the focus of many recent phylogenomics studies (summarized in Dunn et al. 2014; King and Rokas 2017). In particular, the phylogenetic position of Ctenophora has been disputed by numerous authors in recent years (Philippe et al. 2009; Pick et al. 2010; Nosenko et al. 2013; Pisani et al. 2015; Whelan et al. 2015; Feuda et al. 2017; Simion et al. 2017; Whelan et al. 2017). Previous studies focusing on deep metazoan phylogeny have relied mostly on the analysis of amino acid sequences from a large number of proteins, either concatenated into a single data matrix (Philippe et al. 2009; Ryan et al. 2013; Moroz et al. 2014; Whelan et al. 2015; Shen et al. 2017; Simion et al. 2017), or separately as individual gene family alignments (Arcila et al. 2017). These methods generally rely on the identification of sequences that are related by orthology (orthologs), whose common ancestor diverged as a result of speciation, rather than a duplication event (Fitch 2000). Genes arising from duplication events (paralogs) complicate the inference of a species tree. This is because the gene tree describing the relationships among paralogs may differ markedly from the species tree (Martin and Burg 2002; Struck 2013). Including paralogous sequences in a phylogenetic analysis of based concatenated sequence alignments can therefore lead to artifacts in reconstructing a species phylogeny, and various methods aimed at removing paralogs from amino acid data sets have been developed (Li et al. 2003; Fulton et al. 2006; van der Heijden et al. 2007; Gabaldón 2008; Pereira et al. 2014; Struck 2014). One of the most popular ortholog selection approaches is OrthoMCL (Li et al. 2003), which uses a reciprocal best hits (RBH) BLAST algorithm for ortholog identification, combined with the Markov clustering algorithm (MCL) (Van Dongen 2001; Enright et al. 2002) to find clusters of orthologous genes.
Besides amino acid sequences, the presence or absence of genes in genomes has been suggested as an alternative source of phylogenetically informative genomic characters (Fitz-Gibbon and House 1999; Lake and Rivera 2004; Ryan et al. 2013; Pisani et al. 2015). However, the problem of identifying genes related by paralogy and orthology is still an important consideration for the analysis of gene content, since the presence of a gene family in a species does not necessarily imply that all orthologous subfamilies are also present. Thus, we may consider independently both homolog (i.e., both paralogs and orthologs) and ortholog content in phylogenetic reconstruction.
Here, we present a phylogenetic analyses of animals using gene content inferred from complete genomes. We analyze data sets scoring both orthology groups and homologous protein families. We show that relationships inferred from both types of data are highly congruent, only differing in the position of Ctenophora, which emerges as the sister group of all animals other than sponges (Porifera-sister hypothesis) when analyzing presence/absence of orthology groups, or as the sister of Cnidaria (Coelenterata Hypothesis) when analyzing presence/absence of homologous protein families. Our results provide insights into the behavior of different gene content data sets that may help direct future investigations using this type of data.
Results and Discussion
Ortholog Content versus Homolog Content in Phylogenetic Reconstruction
Using a RBH algorithm, OrthoMCL aims to identify pairs of proteins related by either orthology, co-orthology, or in-paralogy. This is accomplished using a similarity graph based on BLASTP e-values, from which edges that OrthoMCL identifies as representing pairwise paralogous relationships are removed. Thus, since most pairs of homologous sequences are related by paralogy, the vast majority of edges may be removed from the full similarity graph using this algorithm. Indeed, in our analysis, the OrthoMCL RBH algorithm resulted in the removal of 91% of edges from both graphs (table 1). Moreover, removing these edges had a major impact on the granularity of the resulting MCL clusterings, with nearly twice as many clusters identified after paralogous edges had been removed (table 1). In addition, these clusterings differed by only ∼3% from a subclustering of the one obtained on the full graph (“% Sub. Diff.” in table 1). This is indeed the goal of OrthoMCL: to identify subclusters of orthologous sequences, which by definition outnumber clusters of sequence that are merely homologous.
Clustering Statistics for Homologous and Orthologous Gene Family Predictions in Two Data Sets.
| Data Set . | Method . | Edges . | Orphans . | Clusters . | Singletons . | N . | n1 . | % Sub. Diff. . |
|---|---|---|---|---|---|---|---|---|
| meta23 | Homologs | 50,096,494 | 74,021 | 22,679 | 10,596 | 12,083 | 84,617 | NA |
| meta36 | Homologs | 73,588,623 | 116,566 | 35,244 | 17,513 | 17,731 | 134,079 | NA |
| meta23 | Orthologs | 4,468,075 | 118,103 | 39,240 | 16,955 | 22,285 | 135,058 | 0.029 |
| meta36 | Orthologs | 6,445,128 | 175,913 | 57,056 | 25,458 | 31,598 | 201,371 | 0.033 |
| Data Set . | Method . | Edges . | Orphans . | Clusters . | Singletons . | N . | n1 . | % Sub. Diff. . |
|---|---|---|---|---|---|---|---|---|
| meta23 | Homologs | 50,096,494 | 74,021 | 22,679 | 10,596 | 12,083 | 84,617 | NA |
| meta36 | Homologs | 73,588,623 | 116,566 | 35,244 | 17,513 | 17,731 | 134,079 | NA |
| meta23 | Orthologs | 4,468,075 | 118,103 | 39,240 | 16,955 | 22,285 | 135,058 | 0.029 |
| meta36 | Orthologs | 6,445,128 | 175,913 | 57,056 | 25,458 | 31,598 | 201,371 | 0.033 |
Clustering Statistics for Homologous and Orthologous Gene Family Predictions in Two Data Sets.
| Data Set . | Method . | Edges . | Orphans . | Clusters . | Singletons . | N . | n1 . | % Sub. Diff. . |
|---|---|---|---|---|---|---|---|---|
| meta23 | Homologs | 50,096,494 | 74,021 | 22,679 | 10,596 | 12,083 | 84,617 | NA |
| meta36 | Homologs | 73,588,623 | 116,566 | 35,244 | 17,513 | 17,731 | 134,079 | NA |
| meta23 | Orthologs | 4,468,075 | 118,103 | 39,240 | 16,955 | 22,285 | 135,058 | 0.029 |
| meta36 | Orthologs | 6,445,128 | 175,913 | 57,056 | 25,458 | 31,598 | 201,371 | 0.033 |
| Data Set . | Method . | Edges . | Orphans . | Clusters . | Singletons . | N . | n1 . | % Sub. Diff. . |
|---|---|---|---|---|---|---|---|---|
| meta23 | Homologs | 50,096,494 | 74,021 | 22,679 | 10,596 | 12,083 | 84,617 | NA |
| meta36 | Homologs | 73,588,623 | 116,566 | 35,244 | 17,513 | 17,731 | 134,079 | NA |
| meta23 | Orthologs | 4,468,075 | 118,103 | 39,240 | 16,955 | 22,285 | 135,058 | 0.029 |
| meta36 | Orthologs | 6,445,128 | 175,913 | 57,056 | 25,458 | 31,598 | 201,371 | 0.033 |
However, by subdividing clusters of homologous sequences into orthologous groups, some phylogenetic signal regarding the presence or absence of homologous gene families may be removed from the final presence/absence matrix. Indeed, our estimates for the expected number of gain/loss events per gene family across metazoa (the tree length) were smaller in the analysis of homologs (posterior mean = 0.065) compared with orthologs (posterior mean = 0.172), confirming the expectation that homolog content is more strongly conserved than ortholog content in animals.
We began with a phylogenetic analysis of the reconstructed data set of Ryan et al. (2013) using OrthoMCL (Meta23-Ortho), and then compared it to the data set of homologous gene family content based on BLASTP similarity only (Meta23-Homo). Phylogenies constructed based on ortholog and homolog content differed only in the position of Ctenophora. As in Pisani et al. (2015), when the data matrix represented the presence or absence of orthogroups identified by OrthoMCL, strong support was obtained for the a tree with sponges as the sister group of all the other animals. In this tree, Ctenophora was found to represent the sister group of all the animals other than sponges (Porifera-sister hypothesis; supplementary fig. S1, Supplementary Material online). When the data represented presence or absence of homologous clusters (paralogous relationships had not been removed by OrthoMCL), sponges were still found to represent the sister of all other animals. However, this analysis recovered Ctenophora as the sister group of Cnidaria plus Bilateria, albeit with relatively weak support (supplementary fig. S2, Supplementary Material online).
We then expanded the taxon sampling of our data set to incorporate 13 additional genomes, including 5 nonbilaterian animals representing each major nonbilaterian phylum, as well as 8 nonmetazoan outgroup species, ranging in evolutionary distance from fungi to choanoflagellates. Similarly to Meta23-Ortho, this data set (Meta36-Ortho) gave strong support for sponges as the sister group of all other animals and Ctenophora as sister to all non-Poriferan animals (Porifera-sister hypothesis; supplementary fig. S3, Supplementary Material online). Differently, analysis of homolog content (Meta36-Homo) gave strong support for Ctenophora as the sister group of Cnidaria (the Coelenterata hypothesis; fig. 1). This result was not sensitive to the choice of outgroup, and was strongly supported even when six species of fungi were included.
Phylogenetic tree reconstructed from Bayesian analysis of homologous gene family content in 36 species of opisthokonts, including 26 animals (Meta36-Homo). Singletons were excluded, and an ascertainment bias correction was applied for the absence of singletons and gene families lost in all species. Numbers above nodes indicate posterior probabilities distinguishable from 1.0. Convergence statistics are listed in supplementary table S1, Supplementary Material online.
The placement of Ctenophora outside of Eumetazoa is found only by the analysis of ortholog content, which is consistent with other researchers’ previous findings that many eumetazoan-specific orthologs are absent in ctenophores (Ryan et al. 2010; Ryan et al. 2013; Moroz et al. 2014). Furthermore, the support for Coelenterata based on homolog content suggests that while many eumetazoan and coelenterate orthologs may have been lost in ctenophores, related paralogs specific to Ctenophora may have been retained in the same gene families. Alternatively, these putative paralogs may represent the missing eumetazoan orthologs which have evolved to such an extent that OrthoMCL cannot reliably identify them as orthologs anymore. This apparent absence of Eumetazoa-specific orthologs in Ctenophora may help to explain the great difficulty that previous phylogenomic studies based on concatenated amino acid sequences of orthologous proteins have had in resolving the position of ctenophores. If eumetazoan gene families are indeed represented in ctenophores mostly by sequences that are, or appear to be, paralogous to other eumetazoans, then these gene families will be systematically removed during the construction of orthologous amino acid sequence alignments, resulting in relatively little signal for a eumetazoan affinity of Ctenophora.
Singleton Gene Families
The issue of whether to include singleton gene families in our analysis of gene content is complicated by the fact that the number of observed singletons is directly correlated with the stringency of our definition of homology. By increasing the BLAST e-value or OrthoMCL percent match length cutoffs, an arbitrary number of orphan protein sequences can be excluded as lacking significant similarity to any other sequences, leading to undersampling of singletons in the presence/absence matrix. On the other hand, coding orphan sequences as singletons could lead to an overestimate of the actual number of singleton gene families. To demonstrate this effect, we estimated the posterior predictive distribution of the number of singleton gene families predicted by the binary substitution model after either including singletons and orphans or excluding both (figs. 2 and 3 and supplementary figs. S7 and S8, Supplementary Material online). In both cases, the predicted number of singleton families was much smaller than the observed number (coded singletons + orphans), suggesting that the number of orphans is indeed an overestimate of the actual number of uncoded singleton gene families. In fact, including orphans appears to introduce an upward bias in the estimation of the gene family loss rate, inflating the predicted number of gene families lost in all species (fig. 2) such that the predicted number of singletons actually decreases when singletons and orphans are included in the analysis (fig. 3). Faced with this inherent bias in estimating the number of singletons, it is perhaps unsurprising that our phylogenetic analysis including singletons recovered an unconventional tree with Placozoa as the sister group to all other animals (supplementary figs. S4 and S5, Supplementary Material online). We conclude that simply excluding all singletons and applying an appropriate correction provides less biased estimates of gene gain and loss rates, and consequently the tree topology.
Posterior predictive distribution of the number of homologous gene families lost in all species (ñ0), inferred from Meta36-Homo using a reversible binary substitution model, either including singletons (black) or excluding them (white). Note that the true value of n_0 cannot be observed.
Posterior predictive distribution of the number of homologous gene families present in only a single species (ñ1), inferred from Meta36-Homo using a reversible binary substitution model, either including singletons (black) or excluding them (white).
Weak Support for Irreversible Gene Content Evolution in Animals
We found that a reversible binary substitution model (Felsenstein 1992), in which a gene family may be gained more than once on the tree, provided a much better fit, based on the marginal likelihoods reported in table 2, to the Meta36-Homo data set than an explicitly irreversible Dollo-like model (Nicholls and Gray 2006; Alekseyenko et al. 2008), in which each gene family may be gained only once. Previous authors have interpreted similar results as evidence for horizontal gene transfer (HGT) in prokaryotes (Zamani-Dahaj et al. 2016). However, only a few cases of HGT among animals have been reported (Jackson et al. 2011; Boto 2014). Therefore, we find horizontal gene transfer an unlikely explanation for the stronger fit of a reversible model in our analysis of animal genomes. Furthermore, both the reversible and Dollo models recovered identical tree topologies (fig. 1 and supplementary fig. S6, Supplementary Material online, respectively). This suggests that the source of the reversible model’s improved fit is not underlying phylogenetic signal, but may instead represent noise related to errors in the prediction of gene family clusters, or in the prediction and assembly of protein sequence databases.
Marginal Likelihoods Estimated for the Meta36-Homo Data Set.
| Model . | Marginal Likelihood . |
|---|---|
| Reversible | −174,066 |
| Dollo | −179,849 |
| Model . | Marginal Likelihood . |
|---|---|
| Reversible | −174,066 |
| Dollo | −179,849 |
Marginal Likelihoods Estimated for the Meta36-Homo Data Set.
| Model . | Marginal Likelihood . |
|---|---|
| Reversible | −174,066 |
| Dollo | −179,849 |
| Model . | Marginal Likelihood . |
|---|---|
| Reversible | −174,066 |
| Dollo | −179,849 |
Materials and Methods
Proteome Data Acquisition
We began by reconstructing the gene content data set of Ryan et al. (2013), which included 23 complete genomes. Twenty one were from across metazoa, including one ctenophore (Mnemiopsis leidyi) and one sponge (Amphimedon queenslandica), as well as two complete genomes from unicellular relatives of animals (Capsaspora owczarzaki and Monosiga brevicollis). We obtained predicted complete proteomes for each of these genomes either from the Ensembl Metazoa database (Cunningham et al. 2015), or from the Origins of Multicellularity Database hosted by the Broad Institute (Ruiz-Trillo et al. 2007). In addition to this data set of 23 species (Meta23), we constructed an expanded data set which included 13 additional proteome predictions based on complete genomic data (Meta36), including the ctenophore Pleurobrachia bachei from the Neurobase genome database (Moroz et al. 2014), the homoscleromorph sponge Oscarella carmela from the Compagen database (Hemmrich and Bosch 2008), as well as four fungal species from the Ensembl database (Cunningham et al. 2015) and two species of fungi and two unicellular relatives of animals from the Origins of Multicellularity Database (Ruiz-Trillo et al. 2007).
Proteome Prediction
In addition, we included new proteome assemblies predicted from complete genomic and transcriptomic data for the calcareous sponge Sycon ciliatum, as well as the newly described placozoan species Hoilungia hongkongensis (Eitel et al. 2018). Details on Hoilungia hongkongensis genome sequencing and annotation can be found in Eitel et al. (2018).
The Sycon ciliatum transcriptome was assembled de novo from a comprehensive set of Illumina RNA-Seq libraries using Trinity pipeline (Grabherr et al. 2011). The libraries (PE, poly-A) were generated from S. ciliatum larvae, various developmental stages, and regenerating adult specimens, ENA submissions ERA295577 and ERA295580 (Fortunato et al. 2014; Leininger et al. 2014). Protein-coding sequences (CDS’es) were detected with Transdecoder (Haas et al. 2013) using Pfam database (Finn et al. 2016) with minimum length cutoff of 300 nucleotides. To remove potential microbial contaminations, the amino acid translations of the CDS’es were BLASTP searched against NCBI NR protein database. All hits with better score to archaea, bacteria, or viruses than to eukaryotes were removed. Remaining CDS’es were clustered with CDHIT (Li et al. 2001) with parameters -G 1 -c 0.75 -aL 0.01 -aS 0.5.
Gene Family Prediction
For each data set, we computed two clusterings based on either homology (as defined by a sequence similarity threshold) or orthology (as defined by OrthoMCL). This resulted in four final clusterings (Meta23-Homo, Meta36-Homo, Meta23-Ortho, and Meta36-Ortho). To build the clusterings, we performed an all versus all BLASTP query with all protein sequences, with a minimum e-value cutoff of 1e-5, keeping all hits and high-scoring segment pairs (HSPs) for each query sequence. These BLAST results were used directly as input to OrthoMCL (Li et al. 2003) for the prediction of orthologous gene clusters.
For the prediction of homologous gene clusters, we directly transformed the BLASTP output into an edge-weighted similarity graph, using a weighting algorithm identical to that used by OrthoMCL, with the only difference being that the RBH and normalization steps were skipped, the functions of which are to discriminate orthologous and paralogous pairwise relationships. We implemented this modified OrthoMCL algorithm in C++, the source code for which has been deposited on GitHub (http://www.github.com/willpett/homomcl). Exactly as in OrthoMCL, we computed edge weights as the average -log10 e-values of each BLAST query-target sequence pair, where e-values equal to 0.0 were set to the minimum e-value exponent observed across all pairs. Also as in OrthoMCL, we computed the “percent match length” (PML) for each pair as the fraction of residues in the shorter sequence that participate in the alignment, and used the default PML cutoff of 50% for each edge (see the OrthoMCL algorithm document for more details). Using these edge-weighted similarity graphs, clusterings were computed using MCL (Van Dongen 2001) with an inflation parameter of 1.5, which is the default suggested by OrthoMCL. The structures of resulting clusterings were compared using the clm info and clm dist commands in the mcl package. The percent divergence of each orthology clustering from a subgraph of the corresponding homology clustering is reported in table 1 (% Sub. Diff.).
Phylogenetic Analysis
For each of the four clusterings, a presence/absence matrix was constructed where each species was coded as present or absent in a cluster depending on whether at least one protein sequence from that species was contained in that cluster. Phylogenetic trees were reconstructed from the resulting binary data matrices in RevBayes (Höhna et al. 2015) using the reversible binary substitution model of (Felsenstein 1992; Ronquist et al. 2012). We also used an irreversible Dollo substitution model (Nicholls and Gray 2006; Alekseyenko et al. 2008), in which each gene family may be gained only once, and thereafter follows a pure-loss process. We computed the marginal likelihood for each model using stepping-stone sampling as implemented in RevBayes. In all cases, we used four discrete categories for gamma-distributed rates across sites, and we used a hierarchical exponential prior on the branch lengths. RevBayes scripts for these analyses have been deposited on GitHub (http://www.github.com/willpett/metazoa-gene-content). Convergence statistics were computed using the programs bpcomp and tracecomp in the PhyloBayes package (Lartillot et al. 2013), and are reported in supplementary table S1, Supplementary Material online.
Correcting for Unobserved Losses
Many sequences did not have significant BLAST similarity with any other sequences in the data set, which we define as “orphans.” Orphans were not represented in the all versus all BLAST results, and were therefore not considered when coding presence/absence of genes. Most gene families were coded as present in only a single species, which we define as “singletons.” In order to avoid introducing an ascertainment bias by including only the coded subset of singletons (Pisani et al. 2015; Tarver et al. 2018), we removed all singletons from each presence/absence matrix prior to phylogenetic analysis, and applied a correction for the removal of singletons in RevBayes (coding=nosingletonpresence). To further evaluate the impact of including/excluding singletons, in a separate phylogenetic analysis we coded orphans as additional singletons, and did not apply the “nosingletonpresence” ascertainment bias correction. Finally, in all analyses we applied a correction for the fact that genes absent in all of our species cannot be observed (coding=noabsencesites).
Posterior Predictive Simulations
Conclusion
The construction of any phylogenomic data set entails considerable difficulties in the identification of a sufficient number of homologous characters to infer a well-resolved species tree. Methods that utilize a single concatenated alignment of multiple loci to infer a species tree are further limited by the requirement for orthologous sequences, as this is the only way to ensure that a single tree describes their evolutionary history. Using gene content, or the presence or absence of gene families, as character data can circumvent this orthology requirement and may therefore allow inferences at deeper time-scales. Our phylogenetic analysis of homologous, but not orthologous gene family content data from 26 metazoan species and 10 nonmetazoans strongly support the classical view of animal phylogeny, with sponges as the sister group to all other animals. We conclude that the placement of some deeply diverging lineages, such as ctenophores, may exceed the limit of resolution afforded by traditional methods that use concatenated alignments of individual genes, and novel approaches are required to fully capture the evolutionary signal from genes within genomes. Along these lines, future studies may benefit from the use of methods that model features of gene content evolution beyond mere presence or absence, for example, by accounting for large-scale or whole genome duplications (Rabier et al. 2014), by allowing gain and loss rates to varying according to gene family size (Csűös 2010), or by reconstructing gene content evolution in a gene tree–species tree framework (Szöllősi et al. 2015). In addition, methods comparing the impact of different ortholog identification algorithms (Emms and Kelly 2015; Miller et al. 2018), will help shed light on the differing amounts of phylogenetic signal contained in ortholog versus homolog content data. Finally, our analysis shows that accounting for ascertainment bias in gene family size is of general importance in the future development of probabilistic models of gene family evolution.
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
Acknowledgments
W.P. was supported by the French National Research Agency (ANR) grant Ancestrome (ANR-10-BINF-01-01), and by grants from the National Science Foundation (DEB-1556615, DEB-1256993). D.P. was supported by a NERC BETR (NE/P013643/1) grant. G.W. was supported by LMU Munich’s Institutional Strategy LMUexcellent within the framework of the German Excellence Initiative, and the German Research Foundation (DFG) grant Wo896/19-1, and from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 764840 (ITN IGNITE).


