Abstract

Resolving the relationships of animals (Metazoa) is crucial to our understanding of the origin of key traits such as muscles, guts, and nerves. However, a broadly accepted metazoan consensus phylogeny has yet to emerge. In part, this is because the genomes of deeply diverging and fast-evolving lineages may undergo significant gene turnover, reducing the number of orthologs shared with related phyla. This can limit the usefulness of traditional phylogenetic methods that rely on alignments of orthologous sequences. Phylogenetic analysis of gene content has the potential to circumvent this orthology requirement, with binary presence/absence of homologous gene families representing a source of phylogenetically informative characters. Applying binary substitution models to the gene content of 26 complete animal genomes, we demonstrate that patterns of gene conservation differ markedly depending on whether gene families are defined by orthology or homology, that is, whether paralogs are excluded or included. We conclude that the placement of some deeply diverging lineages may exceed the limit of resolution afforded by the current methods based on comparisons of orthologous protein sequences, and novel approaches are required to fully capture the evolutionary signal from genes within genomes.

Introduction

Resolving the phylogenetic relationships among animal lineages is crucial for understanding the evolutionary origin of key animal traits, such as a true gut, muscles, and a nervous system. With advancements in sequencing technology, and the associated increase in the availability of genomic data from these groups, the phylogeny of early animals has attracted renewed attention and has been the focus of many recent phylogenomics studies (summarized in Dunn et al. 2014; King and Rokas 2017). In particular, the phylogenetic position of Ctenophora has been disputed by numerous authors in recent years (Philippe et al. 2009; Pick et al. 2010; Nosenko et al. 2013; Pisani et al. 2015; Whelan et al. 2015; Feuda et al. 2017; Simion et al. 2017; Whelan et al. 2017). Previous studies focusing on deep metazoan phylogeny have relied mostly on the analysis of amino acid sequences from a large number of proteins, either concatenated into a single data matrix (Philippe et al. 2009; Ryan et al. 2013; Moroz et al. 2014; Whelan et al. 2015; Shen et al. 2017; Simion et al. 2017), or separately as individual gene family alignments (Arcila et al. 2017). These methods generally rely on the identification of sequences that are related by orthology (orthologs), whose common ancestor diverged as a result of speciation, rather than a duplication event (Fitch 2000). Genes arising from duplication events (paralogs) complicate the inference of a species tree. This is because the gene tree describing the relationships among paralogs may differ markedly from the species tree (Martin and Burg 2002; Struck 2013). Including paralogous sequences in a phylogenetic analysis of based concatenated sequence alignments can therefore lead to artifacts in reconstructing a species phylogeny, and various methods aimed at removing paralogs from amino acid data sets have been developed (Li et al. 2003; Fulton et al. 2006; van der Heijden et al. 2007; Gabaldón 2008; Pereira et al. 2014; Struck 2014). One of the most popular ortholog selection approaches is OrthoMCL (Li et al. 2003), which uses a reciprocal best hits (RBH) BLAST algorithm for ortholog identification, combined with the Markov clustering algorithm (MCL) (Van Dongen 2001; Enright et al. 2002) to find clusters of orthologous genes.

Besides amino acid sequences, the presence or absence of genes in genomes has been suggested as an alternative source of phylogenetically informative genomic characters (Fitz-Gibbon and House 1999; Lake and Rivera 2004; Ryan et al. 2013; Pisani et al. 2015). However, the problem of identifying genes related by paralogy and orthology is still an important consideration for the analysis of gene content, since the presence of a gene family in a species does not necessarily imply that all orthologous subfamilies are also present. Thus, we may consider independently both homolog (i.e., both paralogs and orthologs) and ortholog content in phylogenetic reconstruction.

Here, we present a phylogenetic analyses of animals using gene content inferred from complete genomes. We analyze data sets scoring both orthology groups and homologous protein families. We show that relationships inferred from both types of data are highly congruent, only differing in the position of Ctenophora, which emerges as the sister group of all animals other than sponges (Porifera-sister hypothesis) when analyzing presence/absence of orthology groups, or as the sister of Cnidaria (Coelenterata Hypothesis) when analyzing presence/absence of homologous protein families. Our results provide insights into the behavior of different gene content data sets that may help direct future investigations using this type of data.

Results and Discussion

Ortholog Content versus Homolog Content in Phylogenetic Reconstruction

Using a RBH algorithm, OrthoMCL aims to identify pairs of proteins related by either orthology, co-orthology, or in-paralogy. This is accomplished using a similarity graph based on BLASTP e-values, from which edges that OrthoMCL identifies as representing pairwise paralogous relationships are removed. Thus, since most pairs of homologous sequences are related by paralogy, the vast majority of edges may be removed from the full similarity graph using this algorithm. Indeed, in our analysis, the OrthoMCL RBH algorithm resulted in the removal of 91% of edges from both graphs (table 1). Moreover, removing these edges had a major impact on the granularity of the resulting MCL clusterings, with nearly twice as many clusters identified after paralogous edges had been removed (table 1). In addition, these clusterings differed by only ∼3% from a subclustering of the one obtained on the full graph (“% Sub. Diff.” in table 1). This is indeed the goal of OrthoMCL: to identify subclusters of orthologous sequences, which by definition outnumber clusters of sequence that are merely homologous.

Table 1.

Clustering Statistics for Homologous and Orthologous Gene Family Predictions in Two Data Sets.

Data SetMethodEdgesOrphansClustersSingletonsNn1% Sub. Diff.
meta23Homologs50,096,49474,02122,67910,59612,08384,617NA
meta36Homologs73,588,623116,56635,24417,51317,731134,079NA
meta23Orthologs4,468,075118,10339,24016,95522,285135,0580.029
meta36Orthologs6,445,128175,91357,05625,45831,598201,3710.033
Data SetMethodEdgesOrphansClustersSingletonsNn1% Sub. Diff.
meta23Homologs50,096,49474,02122,67910,59612,08384,617NA
meta36Homologs73,588,623116,56635,24417,51317,731134,079NA
meta23Orthologs4,468,075118,10339,24016,95522,285135,0580.029
meta36Orthologs6,445,128175,91357,05625,45831,598201,3710.033
Table 1.

Clustering Statistics for Homologous and Orthologous Gene Family Predictions in Two Data Sets.

Data SetMethodEdgesOrphansClustersSingletonsNn1% Sub. Diff.
meta23Homologs50,096,49474,02122,67910,59612,08384,617NA
meta36Homologs73,588,623116,56635,24417,51317,731134,079NA
meta23Orthologs4,468,075118,10339,24016,95522,285135,0580.029
meta36Orthologs6,445,128175,91357,05625,45831,598201,3710.033
Data SetMethodEdgesOrphansClustersSingletonsNn1% Sub. Diff.
meta23Homologs50,096,49474,02122,67910,59612,08384,617NA
meta36Homologs73,588,623116,56635,24417,51317,731134,079NA
meta23Orthologs4,468,075118,10339,24016,95522,285135,0580.029
meta36Orthologs6,445,128175,91357,05625,45831,598201,3710.033

However, by subdividing clusters of homologous sequences into orthologous groups, some phylogenetic signal regarding the presence or absence of homologous gene families may be removed from the final presence/absence matrix. Indeed, our estimates for the expected number of gain/loss events per gene family across metazoa (the tree length) were smaller in the analysis of homologs (posterior mean = 0.065) compared with orthologs (posterior mean = 0.172), confirming the expectation that homolog content is more strongly conserved than ortholog content in animals.

We began with a phylogenetic analysis of the reconstructed data set of Ryan et al. (2013) using OrthoMCL (Meta23-Ortho), and then compared it to the data set of homologous gene family content based on BLASTP similarity only (Meta23-Homo). Phylogenies constructed based on ortholog and homolog content differed only in the position of Ctenophora. As in Pisani et al. (2015), when the data matrix represented the presence or absence of orthogroups identified by OrthoMCL, strong support was obtained for the a tree with sponges as the sister group of all the other animals. In this tree, Ctenophora was found to represent the sister group of all the animals other than sponges (Porifera-sister hypothesis; supplementary fig. S1, Supplementary Material online). When the data represented presence or absence of homologous clusters (paralogous relationships had not been removed by OrthoMCL), sponges were still found to represent the sister of all other animals. However, this analysis recovered Ctenophora as the sister group of Cnidaria plus Bilateria, albeit with relatively weak support (supplementary fig. S2, Supplementary Material online).

We then expanded the taxon sampling of our data set to incorporate 13 additional genomes, including 5 nonbilaterian animals representing each major nonbilaterian phylum, as well as 8 nonmetazoan outgroup species, ranging in evolutionary distance from fungi to choanoflagellates. Similarly to Meta23-Ortho, this data set (Meta36-Ortho) gave strong support for sponges as the sister group of all other animals and Ctenophora as sister to all non-Poriferan animals (Porifera-sister hypothesis; supplementary fig. S3, Supplementary Material online). Differently, analysis of homolog content (Meta36-Homo) gave strong support for Ctenophora as the sister group of Cnidaria (the Coelenterata hypothesis; fig. 1). This result was not sensitive to the choice of outgroup, and was strongly supported even when six species of fungi were included.

Fig. 1.

Phylogenetic tree reconstructed from Bayesian analysis of homologous gene family content in 36 species of opisthokonts, including 26 animals (Meta36-Homo). Singletons were excluded, and an ascertainment bias correction was applied for the absence of singletons and gene families lost in all species. Numbers above nodes indicate posterior probabilities distinguishable from 1.0. Convergence statistics are listed in supplementary table S1, Supplementary Material online.

The placement of Ctenophora outside of Eumetazoa is found only by the analysis of ortholog content, which is consistent with other researchers’ previous findings that many eumetazoan-specific orthologs are absent in ctenophores (Ryan et al. 2010; Ryan et al. 2013; Moroz et al. 2014). Furthermore, the support for Coelenterata based on homolog content suggests that while many eumetazoan and coelenterate orthologs may have been lost in ctenophores, related paralogs specific to Ctenophora may have been retained in the same gene families. Alternatively, these putative paralogs may represent the missing eumetazoan orthologs which have evolved to such an extent that OrthoMCL cannot reliably identify them as orthologs anymore. This apparent absence of Eumetazoa-specific orthologs in Ctenophora may help to explain the great difficulty that previous phylogenomic studies based on concatenated amino acid sequences of orthologous proteins have had in resolving the position of ctenophores. If eumetazoan gene families are indeed represented in ctenophores mostly by sequences that are, or appear to be, paralogous to other eumetazoans, then these gene families will be systematically removed during the construction of orthologous amino acid sequence alignments, resulting in relatively little signal for a eumetazoan affinity of Ctenophora.

Singleton Gene Families

The issue of whether to include singleton gene families in our analysis of gene content is complicated by the fact that the number of observed singletons is directly correlated with the stringency of our definition of homology. By increasing the BLAST e-value or OrthoMCL percent match length cutoffs, an arbitrary number of orphan protein sequences can be excluded as lacking significant similarity to any other sequences, leading to undersampling of singletons in the presence/absence matrix. On the other hand, coding orphan sequences as singletons could lead to an overestimate of the actual number of singleton gene families. To demonstrate this effect, we estimated the posterior predictive distribution of the number of singleton gene families predicted by the binary substitution model after either including singletons and orphans or excluding both (figs. 2 and 3 and supplementary figs. S7 and S8, Supplementary Material online). In both cases, the predicted number of singleton families was much smaller than the observed number (coded singletons + orphans), suggesting that the number of orphans is indeed an overestimate of the actual number of uncoded singleton gene families. In fact, including orphans appears to introduce an upward bias in the estimation of the gene family loss rate, inflating the predicted number of gene families lost in all species (fig. 2) such that the predicted number of singletons actually decreases when singletons and orphans are included in the analysis (fig. 3). Faced with this inherent bias in estimating the number of singletons, it is perhaps unsurprising that our phylogenetic analysis including singletons recovered an unconventional tree with Placozoa as the sister group to all other animals (supplementary figs. S4 and S5, Supplementary Material online). We conclude that simply excluding all singletons and applying an appropriate correction provides less biased estimates of gene gain and loss rates, and consequently the tree topology.

Fig. 2.

Posterior predictive distribution of the number of homologous gene families lost in all species (ñ0), inferred from Meta36-Homo using a reversible binary substitution model, either including singletons (black) or excluding them (white). Note that the true value of n_0 cannot be observed.

Fig. 3.

Posterior predictive distribution of the number of homologous gene families present in only a single species (ñ1), inferred from Meta36-Homo using a reversible binary substitution model, either including singletons (black) or excluding them (white).

Weak Support for Irreversible Gene Content Evolution in Animals

We found that a reversible binary substitution model (Felsenstein 1992), in which a gene family may be gained more than once on the tree, provided a much better fit, based on the marginal likelihoods reported in table 2, to the Meta36-Homo data set than an explicitly irreversible Dollo-like model (Nicholls and Gray 2006; Alekseyenko et al. 2008), in which each gene family may be gained only once. Previous authors have interpreted similar results as evidence for horizontal gene transfer (HGT) in prokaryotes (Zamani-Dahaj et al. 2016). However, only a few cases of HGT among animals have been reported (Jackson et al. 2011; Boto 2014). Therefore, we find horizontal gene transfer an unlikely explanation for the stronger fit of a reversible model in our analysis of animal genomes. Furthermore, both the reversible and Dollo models recovered identical tree topologies (fig. 1 and supplementary fig. S6, Supplementary Material online, respectively). This suggests that the source of the reversible model’s improved fit is not underlying phylogenetic signal, but may instead represent noise related to errors in the prediction of gene family clusters, or in the prediction and assembly of protein sequence databases.

Table 2.

Marginal Likelihoods Estimated for the Meta36-Homo Data Set.

ModelMarginal Likelihood
Reversible−174,066
Dollo−179,849
ModelMarginal Likelihood
Reversible−174,066
Dollo−179,849
Table 2.

Marginal Likelihoods Estimated for the Meta36-Homo Data Set.

ModelMarginal Likelihood
Reversible−174,066
Dollo−179,849
ModelMarginal Likelihood
Reversible−174,066
Dollo−179,849

Materials and Methods

Proteome Data Acquisition

We began by reconstructing the gene content data set of Ryan et al. (2013), which included 23 complete genomes. Twenty one were from across metazoa, including one ctenophore (Mnemiopsis leidyi) and one sponge (Amphimedon queenslandica), as well as two complete genomes from unicellular relatives of animals (Capsaspora owczarzaki and Monosiga brevicollis). We obtained predicted complete proteomes for each of these genomes either from the Ensembl Metazoa database (Cunningham et al. 2015), or from the Origins of Multicellularity Database hosted by the Broad Institute (Ruiz-Trillo et al. 2007). In addition to this data set of 23 species (Meta23), we constructed an expanded data set which included 13 additional proteome predictions based on complete genomic data (Meta36), including the ctenophore Pleurobrachia bachei from the Neurobase genome database (Moroz et al. 2014), the homoscleromorph sponge Oscarella carmela from the Compagen database (Hemmrich and Bosch 2008), as well as four fungal species from the Ensembl database (Cunningham et al. 2015) and two species of fungi and two unicellular relatives of animals from the Origins of Multicellularity Database (Ruiz-Trillo et al. 2007).

Proteome Prediction

In addition, we included new proteome assemblies predicted from complete genomic and transcriptomic data for the calcareous sponge Sycon ciliatum, as well as the newly described placozoan species Hoilungia hongkongensis (Eitel et al. 2018). Details on Hoilungia hongkongensis genome sequencing and annotation can be found in Eitel et al. (2018).

The Sycon ciliatum transcriptome was assembled de novo from a comprehensive set of Illumina RNA-Seq libraries using Trinity pipeline (Grabherr et al. 2011). The libraries (PE, poly-A) were generated from S. ciliatum larvae, various developmental stages, and regenerating adult specimens, ENA submissions ERA295577 and ERA295580 (Fortunato et al. 2014; Leininger et al. 2014). Protein-coding sequences (CDS’es) were detected with Transdecoder (Haas et al. 2013) using Pfam database (Finn et al. 2016) with minimum length cutoff of 300 nucleotides. To remove potential microbial contaminations, the amino acid translations of the CDS’es were BLASTP searched against NCBI NR protein database. All hits with better score to archaea, bacteria, or viruses than to eukaryotes were removed. Remaining CDS’es were clustered with CDHIT (Li et al. 2001) with parameters -G 1 -c 0.75 -aL 0.01 -aS 0.5.

Gene Family Prediction

For each data set, we computed two clusterings based on either homology (as defined by a sequence similarity threshold) or orthology (as defined by OrthoMCL). This resulted in four final clusterings (Meta23-Homo, Meta36-Homo, Meta23-Ortho, and Meta36-Ortho). To build the clusterings, we performed an all versus all BLASTP query with all protein sequences, with a minimum e-value cutoff of 1e-5, keeping all hits and high-scoring segment pairs (HSPs) for each query sequence. These BLAST results were used directly as input to OrthoMCL (Li et al. 2003) for the prediction of orthologous gene clusters.

For the prediction of homologous gene clusters, we directly transformed the BLASTP output into an edge-weighted similarity graph, using a weighting algorithm identical to that used by OrthoMCL, with the only difference being that the RBH and normalization steps were skipped, the functions of which are to discriminate orthologous and paralogous pairwise relationships. We implemented this modified OrthoMCL algorithm in C++, the source code for which has been deposited on GitHub (http://www.github.com/willpett/homomcl). Exactly as in OrthoMCL, we computed edge weights as the average -log10 e-values of each BLAST query-target sequence pair, where e-values equal to 0.0 were set to the minimum e-value exponent observed across all pairs. Also as in OrthoMCL, we computed the “percent match length” (PML) for each pair as the fraction of residues in the shorter sequence that participate in the alignment, and used the default PML cutoff of 50% for each edge (see the OrthoMCL algorithm document for more details). Using these edge-weighted similarity graphs, clusterings were computed using MCL (Van Dongen 2001) with an inflation parameter of 1.5, which is the default suggested by OrthoMCL. The structures of resulting clusterings were compared using the clm info and clm dist commands in the mcl package. The percent divergence of each orthology clustering from a subgraph of the corresponding homology clustering is reported in table 1 (% Sub. Diff.).

Phylogenetic Analysis

For each of the four clusterings, a presence/absence matrix was constructed where each species was coded as present or absent in a cluster depending on whether at least one protein sequence from that species was contained in that cluster. Phylogenetic trees were reconstructed from the resulting binary data matrices in RevBayes (Höhna et al. 2015) using the reversible binary substitution model of (Felsenstein 1992; Ronquist et al. 2012). We also used an irreversible Dollo substitution model (Nicholls and Gray 2006; Alekseyenko et al. 2008), in which each gene family may be gained only once, and thereafter follows a pure-loss process. We computed the marginal likelihood for each model using stepping-stone sampling as implemented in RevBayes. In all cases, we used four discrete categories for gamma-distributed rates across sites, and we used a hierarchical exponential prior on the branch lengths. RevBayes scripts for these analyses have been deposited on GitHub (http://www.github.com/willpett/metazoa-gene-content). Convergence statistics were computed using the programs bpcomp and tracecomp in the PhyloBayes package (Lartillot et al. 2013), and are reported in supplementary table S1, Supplementary Material online.

Correcting for Unobserved Losses

Many sequences did not have significant BLAST similarity with any other sequences in the data set, which we define as “orphans.” Orphans were not represented in the all versus all BLAST results, and were therefore not considered when coding presence/absence of genes. Most gene families were coded as present in only a single species, which we define as “singletons.” In order to avoid introducing an ascertainment bias by including only the coded subset of singletons (Pisani et al. 2015; Tarver et al. 2018), we removed all singletons from each presence/absence matrix prior to phylogenetic analysis, and applied a correction for the removal of singletons in RevBayes (coding=nosingletonpresence). To further evaluate the impact of including/excluding singletons, in a separate phylogenetic analysis we coded orphans as additional singletons, and did not apply the “nosingletonpresence” ascertainment bias correction. Finally, in all analyses we applied a correction for the fact that genes absent in all of our species cannot be observed (coding=noabsencesites).

Posterior Predictive Simulations

We evaluated the impact of including singletons in our phylogenetic analysis by simulating n0, the number of gene families present in zero species (lost in all species), and n1, the number of gene families present in one species, from their respective posterior predictive distributions, either conditioned on the inclusion or exclusion of singletons. Specifically, for each sample θ from the posterior, and given the number of gene families coded as present in more than one species N, we simulated n0 and n1 from the predictive distribution:
n˜0,n˜1|N, θ  NegativeMultinomial(N,p(k=0 | θ),p(k=1 | θ)
where p(k|θ) is the binary substitution model likelihood of observing a gene family present in k species. The observed values of n1 and N for each analysis are listed in table 1. We estimated the observed value of n1 as the number of coded singletons plus the number of orphans. Note that n0 cannot be observed. Simulations were implemented in biphy, a software package for phylogenetic analysis of binary character data (http://www.github.com/willpett/biphy).

Conclusion

The construction of any phylogenomic data set entails considerable difficulties in the identification of a sufficient number of homologous characters to infer a well-resolved species tree. Methods that utilize a single concatenated alignment of multiple loci to infer a species tree are further limited by the requirement for orthologous sequences, as this is the only way to ensure that a single tree describes their evolutionary history. Using gene content, or the presence or absence of gene families, as character data can circumvent this orthology requirement and may therefore allow inferences at deeper time-scales. Our phylogenetic analysis of homologous, but not orthologous gene family content data from 26 metazoan species and 10 nonmetazoans strongly support the classical view of animal phylogeny, with sponges as the sister group to all other animals. We conclude that the placement of some deeply diverging lineages, such as ctenophores, may exceed the limit of resolution afforded by traditional methods that use concatenated alignments of individual genes, and novel approaches are required to fully capture the evolutionary signal from genes within genomes. Along these lines, future studies may benefit from the use of methods that model features of gene content evolution beyond mere presence or absence, for example, by accounting for large-scale or whole genome duplications (Rabier et al. 2014), by allowing gain and loss rates to varying according to gene family size (Csűös 2010), or by reconstructing gene content evolution in a gene tree–species tree framework (Szöllősi et al. 2015). In addition, methods comparing the impact of different ortholog identification algorithms (Emms and Kelly 2015; Miller et al. 2018), will help shed light on the differing amounts of phylogenetic signal contained in ortholog versus homolog content data. Finally, our analysis shows that accounting for ascertainment bias in gene family size is of general importance in the future development of probabilistic models of gene family evolution.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online.

Acknowledgments

W.P. was supported by the French National Research Agency (ANR) grant Ancestrome (ANR-10-BINF-01-01), and by grants from the National Science Foundation (DEB-1556615, DEB-1256993). D.P. was supported by a NERC BETR (NE/P013643/1) grant. G.W. was supported by LMU Munich’s Institutional Strategy LMUexcellent within the framework of the German Excellence Initiative, and the German Research Foundation (DFG) grant Wo896/19-1, and from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 764840 (ITN IGNITE).

References

Alekseyenko
AV
,
Lee
CJ
,
Suchard
MA.
2008
.
Wagner and Dollo: a stochastic duet by composing two parsimonious solos
.
Syst Biol
.
57
(
5
):
772
784
.

Arcila
D
,
Ortí
G
,
Vari
R
,
Armbruster
JW
,
Stiassny
MLJ
,
Ko
KD
,
Sabaj
MH
,
Lundberg
J
,
Revell
LJ
,
Betancur-R
R.
2017
.
Genome-wide interrogation advances resolution of recalcitrant groups in the tree of life
.
Nat Ecol Evol.
1
:
0020.

Boto
L.
2014
.
Horizontal gene transfer in the acquisition of novel traits by metazoans
.
Proc Biol Sci
.
281
(
1777
):
20132450.

Csűös
M.
2010
.
Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood
.
Bioinformatics
26
:
1910
1912
.

Cunningham
F
,
Amode
MR
,
Barrell
D
,
Beal
K
,
Billis
K
,
Brent
S
,
Carvalho-Silva
D
,
Clapham
P
,
Coates
G
,
Fitzgerald
S
, et al. .
2015
.
Ensembl 2015
.
Nucleic Acids Res.
43
(
D1
):
D662
D669
.

Dunn
CW
,
Giribet
G
,
Edgecombe
GD
,
Hejnol
A.
2014
.
Animal phylogeny and its evolutionary implications
.
Annu Rev Ecol Syst
.
45
(
1
):
371
395
.

Eitel
M
,
Francis
WR
,
Varoqueaux
F
,
Daraspe
J
,
Osigus
H-J
,
Krebs
S
,
Vargas
S
,
Blum
H
,
Williams
GA
,
Schierwater
B
, et al. .
2018
.
Comparative genomics and the nature of placozoan species
.
PLoS Biol
.
16
(
7
):
e2005359
.

Emms
DM
,
Kelly
S.
2015
.
OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy
.
Genome Biol
.
16
:
157.

Enright
AJ
,
Van Dongen
S
,
Ouzounis
CA.
2002
.
An efficient algorithm for large-scale detection of protein families
.
Nucleic Acids Res
.
30
(
7
):
1575
1584
.

Felsenstein
J.
1992
.
Phylogenies from restriction sites: a maximum-likelihood approach
.
Evolution
46
(
1
):
159
173
.

Feuda
R
,
Dohrmann
M
,
Pett
W
,
Philippe
H
,
Rota-Stabelli
O
,
Lartillot
N
,
Wörheide
G
,
Pisani
D.
2017
.
Improved modeling of compositional heterogeneity supports sponges as sister to all other animals
.
Curr Biol
.
27
(
24
):
3864
3870.e4
.

Finn
RD
,
Coggill
P
,
Eberhardt
RY
,
Eddy
SR
,
Mistry
J
,
Mitchell
AL
,
Potter
SC
,
Punta
M
,
Qureshi
M
,
Sangrador-Vegas
A
, et al. .
2016
.
The Pfam protein families database: towards a more sustainable future
.
Nucleic Acids Res.
44
(
D1
):
D279
D285
.

Fitch
WM.
2000
.
Homology: a personal view on some of the problems
.
Trends Genet
.
16
(
5
):
227
231
.

Fitz-Gibbon
ST
,
House
CH.
1999
.
Whole genome-based phylogenetic analysis of free-living microorganisms
.
Nucleic Acids Res
.
27
(
21
):
4218
4222
.

Fortunato
SAV
,
Adamski
M
,
Ramos
OM
,
Leininger
S
,
Liu
J
,
Ferrier
DEK
,
Adamska
M.
2014
.
Calcisponges have a ParaHox gene and dynamic expression of dispersed NK homeobox genes
.
Nature
514
(
7524
):
620
623
.

Fulton
DL
,
Li
YY
,
Laird
MR
,
Horsman
BGS
,
Roche
FM
,
Brinkman
FSL.
2006
.
Improving the specificity of high-throughput ortholog prediction
.
BMC Bioinformatics
7
:
270.

Gabaldón
T.
2008
.
Large-scale assignment of orthology: back to phylogenetics?
Genome Biol
.
9
(
10
):
235.

Grabherr
MG
,
Haas
BJ
,
Yassour
M
,
Levin
JZ
,
Thompson
DA
,
Amit
I
,
Adiconis
X
,
Fan
L
,
Raychowdhury
R
,
Zeng
Q
, et al. .
2011
.
Full-length transcriptome assembly from RNA-Seq data without a reference genome
.
Nat Biotechnol
.
29
(
7
):
644
652
.

Haas
BJ
,
Papanicolaou
A
,
Yassour
M
,
Grabherr
M
,
Blood
PD
,
Bowden
J
,
Couger
MB
,
Eccles
D
,
Li
B
,
Lieber
M
, et al. .
2013
.
De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis
.
Nat Protoc
.
8
(
8
):
1494
1512
.

Hemmrich
G
,
Bosch
TCG.
2008
.
Compagen, a comparative genomics platform for early branching metazoan animals, reveals early origins of genes regulating stem-cell differentiation
.
Bioessays
30
(
10
):
1010
1018
.

Höhna
S
,
Landis
MJ
,
Heath
TA
,
Boussau
B
,
Lartillot
N
,
Moore
BR
,
Huelsenbeck
JP
,
Ronquist
F.
2015
.
RevBayes: Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language
.
Syst Biol
.
65
(
4
):
726
736
.

Jackson
DJ
,
Macis
L
,
Reitner
J
,
Worheide
G.
2011
.
A horizontal gene transfer supported the evolution of an early metazoan biomineralization strategy
.
BMC Evol Biol
.
11
:
238.

King
N
,
Rokas
A.
2017
.
Embracing uncertainty in reconstructing early animal evolution
.
Curr Biol
.
27
(
19
):
R1081
R1088
.

Lake
JA
,
Rivera
MC.
2004
.
Deriving the genomic tree of life in the presence of horizontal gene transfer: conditioned reconstruction
.
Mol Biol Evol
.
21
(
4
):
681
690
.

Lartillot
N
,
Rodrigue
N
,
Stubbs
D
,
Richer
J.
2013
.
PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment
.
Syst Biol
.
62
(
4
):
611
615
.

Leininger
S
,
Adamski
M
,
Bergum
B
,
Guder
C
,
Liu
J
,
Laplante
M
,
Bråte
J
,
Hoffmann
F
,
Fortunato
S
,
Jordal
S
, et al. .
2014
.
Developmental gene expression provides clues to relationships between sponge and eumetazoan body plans
.
Nat Commun
.
5
:
3905.

Li
L
,
Stoeckert
CJ
Jr
,
Roos
DS.
2003
.
OrthoMCL: identification of ortholog groups for eukaryotic genomes
.
Genome Res
.
13
(
9
):
2178
2189
.

Li
W
,
Jaroszewski
L
,
Godzik
A.
2001
.
Clustering of highly homologous sequences to reduce the size of large protein databases
.
Bioinformatics
17
(
3
):
282
283
.

Martin
AP
,
Burg
TM.
2002
.
Perils of paralogy: using HSP70 genes for inferring organismal phylogenies
.
Syst Biol
.
51
(
4
):
570
587
.

Miller
JB
,
Pickett
BD
,
Ridge
PG.
2018
. JustOrthologs: a fast, accurate, and user-friendly ortholog identification algorithm. Bioinformatics
35
(
4
):
546
552
.

Moroz
LL
,
Kocot
KM
,
Citarella
MR
,
Dosung
S
,
Norekian
TP
,
Povolotskaya
IS
,
Grigorenko
AP
,
Dailey
C
,
Berezikov
E
,
Buckley
KM
, et al. .
2014
.
The ctenophore genome and the evolutionary origins of neural systems
.
Nature
510
(
7503
):
109
114
.

Nicholls
GK
,
Gray
RD.
2006
. Quantifying uncertainty in a stochastic model of vocabulary evolution. In: Phylogenetic methods and the prehistory of languages (eds P. Forster & C. Renfrew), pp. 161–171. Cambridge, UK: McDonald Institute for Archaeological Research.

Nosenko
T
,
Schreiber
F
,
Adamska
M
,
Adamski
M
,
Eitel
M
,
Hammel
J
,
Maldonado
M
,
Müller
WEG
,
Nickel
M
,
Schierwater
B
, et al. .
2013
.
Deep metazoan phylogeny: when different genes tell different stories
.
Mol Phylogenet Evol
.
67
(
1
):
223
233
.

Pereira
C
,
Denise
A
,
Lespinet
O.
2014
.
A meta-approach for improving the prediction and the functional annotation of ortholog groups
.
BMC Genomics
15(Suppl 6)
:
S16.

Philippe
H
,
Derelle
R
,
Lopez
P
,
Pick
K
,
Borchiellini
C
,
Boury-Esnault
N
,
Vacelet
J
,
Renard
E
,
Houliston
E
,
Quéinnec
E
, et al. .
2009
.
Phylogenomics revives traditional views on deep animal relationships
.
Curr Biol
.
19
(
8
):
706
712
.

Pick
KS
,
Philippe
H
,
Schreiber
F
,
Erpenbeck
D
,
Jackson
DJ
,
Wrede
P
,
Wiens
M
,
Alié
A
,
Morgenstern
B
,
Manuel
M
, et al. .
2010
.
Improved phylogenomic taxon sampling noticeably affects nonbilaterian relationships
.
Mol Biol Evol
.
27
(
9
):
1983
1987
.

Pisani
D
,
Pett
W
,
Dohrmann
M
,
Feuda
R
,
Rota-Stabelli
O
,
Philippe
H
,
Lartillot
N
,
Wörheide
G.
2015
.
Genomic data do not support comb jellies as the sister group to all other animals
.
Proc Natl Acad Sci U S A
.
112
(
50
):
15402
15407
.

Rabier
C-E
,
Ta
T
,
Ané
C.
2014
.
Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach
.
Mol Biol Evol
.
31
(
3
):
750
762
.

Ronquist
F
,
Teslenko
M
,
van der Mark
P
,
Ayres
DL
,
Darling
A
,
Höhna
S
,
Larget
B
,
Liu
L
,
Suchard
MA
,
Huelsenbeck
JP.
2012
.
MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space
.
Syst Biol
.
61
(
3
):
539
542
.

Ruiz-Trillo
I
,
Burger
G
,
Holland
PWH
,
King
N
,
Lang
BF
,
Roger
AJ
,
Gray
MW.
2007
.
The origins of multicellularity: a multi-taxon genome initiative
.
Trends Genet
.
23
(
3
):
113
118
.

Ryan
JF
,
Pang
K
,
Comparative Sequencing Program
NISC
,
Mullikin
JC
,
Martindale
MQ
,
Baxevanis
AD.
2010
.
The homeodomain complement of the ctenophore Mnemiopsis leidyi suggests that Ctenophora and Porifera diverged prior to the ParaHoxozoa
.
Evodevo
1
(
1
):
9.

Ryan
JF
,
Pang
K
,
Schnitzler
CE
,
Nguyen
A-D
,
Moreland
RT
,
Simmons
DK
,
Koch
BJ
,
Francis
WR
,
Havlak
P
,
Smith
SA
, et al. .
2013
.
The genome of the ctenophore Mnemiopsis leidyi and its implications for cell type evolution
.
Science
342
(
6164
):
1242592
.

Shen
X-X
,
Hittinger
CT
,
Rokas
A.
2017
.
Contentious relationships in phylogenomic studies can be driven by a handful of genes
.
Nat Ecol Evol.
1
(
5
):
126.

Simion
P
,
Philippe
H
,
Baurain
D
,
Jager
M
,
Richter
DJ
,
Di Franco
A
,
Roure
B
,
Satoh
N
,
Quéinnec
É
,
Ereskovsky
A
, et al. .
2017
.
A large and consistent phylogenomic dataset supports sponges as the sister group to all other animals
.
Curr Biol
.
27
(
7
):
958
967
.

Struck
TH.
2013
.
The impact of paralogy on phylogenomic studies – a case study on annelid relationships
.
PLoS One
8
(
5
):
e62892.

Struck
TH.
2014
.
TreSpEx-detection of misleading signal in phylogenetic reconstructions based on tree information
.
Evol Bioinform Online.
10
:
51
67
.

Szöllősi
GJ
,
Tannier
E
,
Daubin
V
,
Boussau
B.
2015
.
The inference of gene trees with species trees
.
Syst Biol
.
64
(
1
):
e42
e62
.

Tarver
JE
,
Taylor
RS
,
Puttick
MN
,
Lloyd
GT
,
Pett
W
,
Fromm
B
,
Schirrmeister
BE
,
Pisani
D
,
Peterson
KJ
,
Donoghue
PCJ.
2018
.
Well-annotated microRNAomes do not evidence pervasive miRNA loss
.
Genome Biol Evol
.
10
(
6
):
1457
1470
.

van der Heijden
R
,
Snel
B
,
van Noort
V
,
Huynen
M.
2007
.
Orthology prediction at scalable resolution by phylogenetic tree analysis
.
BMC Bioinformatics
8
:
83.

Van Dongen
SM.
2000
. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht.

Whelan
NV
,
Kocot
KM
,
Moroz
LL
,
Halanych
KM.
2015
.
Error, signal, and the placement of Ctenophora sister to all other animals
.
Proc Natl Acad Sci U S A.
112
:
201503453.

Whelan
NV
,
Kocot
KM
,
Moroz
TP
,
Mukherjee
K
,
Williams
P
,
Paulay
G
,
Moroz
LL
,
Halanych
KM.
2017
.
Ctenophore relationships and their placement as the sister group to all other animals
.
Nat Ecol Evol.
1
(
11
):
1737
1746
.

Zamani-Dahaj
SA
,
Okasha
M
,
Kosakowski
J
,
Higgs
PG.
2016
.
Estimating the frequency of horizontal gene transfer using phylogenetic models of gene gain and loss
.
Mol Biol Evol
.
33
(
7
):
1843
1857
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Joel Dudley
Joel Dudley
Associate Editor
Search for other works by this author on:

Supplementary data