-
PDF
- Split View
-
Views
-
Cite
Cite
Benjamin J. Liebeskind, Claire D. McWhite, Edward M. Marcotte, Towards Consensus Gene Ages, Genome Biology and Evolution, Volume 8, Issue 6, June 2016, Pages 1812–1823, https://doi.org/10.1093/gbe/evw113
Close -
Share
Abstract
Correctly estimating the age of a gene or gene family is important for a variety of fields, including molecular evolution, comparative genomics, and phylogenetics, and increasingly for systems biology and disease genetics. However, most studies use only a point estimate of a gene’s age, neglecting the substantial uncertainty involved in this estimation. Here, we characterize this uncertainty by investigating the effect of algorithm choice on gene-age inference and calculate consensus gene ages with attendant error distributions for a variety of model eukaryotes. We use 13 orthology inference algorithms to create gene-age datasets and then characterize the error around each age-call on a per-gene and per-algorithm basis. Systematic error was found to be a large factor in estimating gene age, suggesting that simple consensus algorithms are not enough to give a reliable point estimate. We also found that different sources of error can affect downstream analyses, such as gene ontology enrichment. Our consensus gene-age datasets, with associated error terms, are made fully available at so that researchers can propagate this uncertainty through their analyses ( geneages.org ).
Introduction
From their inception, whole genome datasets have been used to infer the evolutionary history of gene families ( Mushegian and Koonin 1996 ). The age of a gene family, its provenance, and its evolutionary history, such as loss and duplication events, can inform us about its function ( Capra et al. 2013 ). For instance, gene age has been found to correlate with disease association ( Domazet-Lošo and Tautz 2008 ; Maxwell et al. 2014 ), evolutionary rate ( Wolf et al. 2009 ), and the number of associated protein-interaction partners ( Kim and Marcotte 2008 ), and a gene’s phylogenetic distribution can be used to infer aspects of its function ( Pellegrini et al. 1999 ). Gene ages can also be used to estimate the gene content of ancient organisms, such as the last universal common ancestor (LUCA, Mushegian and Koonin 1996 ), or the last eukaryotic common ancestor (LECA, Thiergart et al. 2012 ; Koumandou et al. 2013 ). Accordingly, an analysis of gene family ages on a genomic scale can inform the phylogenetic history of important phenotypes, such as eyes or the nervous system ( Rivera et al. 2010 ; Liebeskind et al. 2016 ). In more recent years, gene age has been used to annotate systems biology datasets ( Conaco et al. 2012 ; Alié et al. 2015 ; Wan et al. 2015 ), with the promise of elucidating the evolutionary history of core cellular machinery.
Such studies rely first and foremost upon the correct identification of homologs and/or orthologs. These two relationships form the basis of the gene-age determination in nearly all studies, with orthology being the more common criterion ( Gabaldón 2008 ; Maxwell et al. 2014 ). Orthology is a pairwise relationship between two genes that occurs when their most recent common ancestor (MRCA) lies at a speciation event in a phylogenetic gene tree. This is in contrast to paralogs, whose MRCA lies at a gene duplication event (nodes on gene trees represent either speciation or gene duplication events, barring horizontal gene transfer) ( Fitch 1970 , 2000 ). Orthologs tend to display higher functional conservation than paralogs (although perhaps only weakly, Chen and Zhang 2012 , – see Gabaldón and Koonin 2013 , for a review), hence their use as a basis of cross-species comparison. Typically, studies of gene age will consider an orthologous group to be all the descendent lineages of the deepest speciation node, or the divergence between the two most distant homologs, if that is the criterion being used, as in “phylostratigraphy” ( Domazet-Lošo and Tautz 2008 ). Then, the age of the gene group is defined as the MRCA of the species found in that group.
Inferring a gene family’s age thus relies on the accuracy of orthology assignment, but inferring correct orthologs is notoriously difficult, with no one of the more than 30 algorithms out-performing all others ( Sonnhammer et al. 2014 ; Altenhoff et al. 2016 ). In particular, algorithms differ strongly in the tradeoff between recall and precision ( Altenhoff et al. 2016 ). And many, perhaps all, algorithms may underestimate the size of orthologous groups due to inherent sensitivity limits in homology searches, thereby underestimating gene age ( Moyers and Zhang 2015 , 2016 ). Yet most studies on gene age rely on only one kind of algorithm, either using a pre-existing method or establishing an ad hoc protocol, most of which resemble one of the pre-existing algorithms ( Maxwell et al. 2014 ). Although methods for probabilistic orthology assignment do exist ( Ullah et al. 2015 ), available methods are not currently scalable to large genomic datasets using protein sequences, and at any rate still rely on a preliminary clustering step to infer gene families. Consensus algorithms also exist, some of which seem to substantially improve performance on established benchmarks ( Pereira et al. 2014 ; Maher and Hernandez 2015 ). However, these methods still give only a point estimate. Another approach is to propagate the uncertainty that necessarily arises in orthology inference through subsequent analyses. However, it is unclear what the relevant sources of uncertainty are in orthology inference, and most consensus algorithms do not keep track of the different sources of error.
To remedy this situation, we characterized the error structure of gene-age estimation using 13 popular orthology inference algorithms. In doing so, we identify common types of errors and, after correcting these, present consensus gene-age calls for several model organisms ( table 1 ). We provide these gene-age estimates along with a detailed analysis and annotation of the uncertainty associated with each age call so that this uncertainty can be propagated through future analyses, as we show for functional term enrichment. The consensus gene ages we calculate can be used for annotating genomic datasets in a variety of fields, and the analysis of error will help to prioritize genes for manual annotation and aspects of orthology inference for future study.
Species for which final consensus tables were constructed. Tables are available at https://github.com/marcottelab/Gene-Ages
| Common name . | Uniprot ID . | False-negative analysis . |
|---|---|---|
| Anopheles gambiae (Mosquito) | ANOGA | No |
| Bos taurus (Cattle) | BOVIN | No |
| Branchiostoma floridae (Lancelet) | BRAFL | No |
| Caenorhabditis elegans (Worm) | CAEEL | Yes |
| Candida albicans | CANAL | Yes |
| Canis lupus familiaris (Dog) | CANFA | No |
| Gallus gallus (Chicken) | CHICK | Yes |
| Ciona intestinalis (Tunicate) | CIOIN | No |
| Cryptococcus neoformans | CRYNJ | No |
| Danio rerio (Zebrafish) | DANRE | Yes |
| Drosophila melanogaster (Fly) | DROME | Yes |
| Homo sapiens (Human) | HUMAN | Yes |
| Ixodes scapularis (Tick) | IXOSC | No |
| Macaca mulatta (Rhesus macaque) | MACMU | No |
| Monosiga brevicollis (Choanoflagellate) | MONBE | No |
| Monodelphis domestica (Opossum) | MONDO | No |
| Mus musculus (Mouse) | MOUSE | Yes |
| Nematostella vectensis (Sea anemone) | NEMVE | No |
| Neurospora crassa | NEUCR | No |
| Ornithorhynchus anatinus (Platypus) | ORNAN | No |
| Pan troglodytes (Chimp) | PANTR | No |
| Phaeosphaeria nodorum | PHANO | No |
| Rattus rattus (Rat) | RAT | Yes |
| Saccaromyces cerevisiae (Budding yeast) | YEAST | Yes |
| Schistosoma mansoni (Blood fluke) | SCHMA | No |
| Schizosaccharomyces pombe (Fission yeast) | SCHPO | Yes |
| Sclerotinia sclerotiorum | SCLS1 | No |
| Takifugu rubripes (Pufferfish) | TAKRU | No |
| Ustilago maydis (Corn smut/Huitlacoche | USTMA | No |
| Xenopus tropicalis (Frog) | XENTR | No |
| Yarrowia lipolytica | YARLI | No |
| Common name . | Uniprot ID . | False-negative analysis . |
|---|---|---|
| Anopheles gambiae (Mosquito) | ANOGA | No |
| Bos taurus (Cattle) | BOVIN | No |
| Branchiostoma floridae (Lancelet) | BRAFL | No |
| Caenorhabditis elegans (Worm) | CAEEL | Yes |
| Candida albicans | CANAL | Yes |
| Canis lupus familiaris (Dog) | CANFA | No |
| Gallus gallus (Chicken) | CHICK | Yes |
| Ciona intestinalis (Tunicate) | CIOIN | No |
| Cryptococcus neoformans | CRYNJ | No |
| Danio rerio (Zebrafish) | DANRE | Yes |
| Drosophila melanogaster (Fly) | DROME | Yes |
| Homo sapiens (Human) | HUMAN | Yes |
| Ixodes scapularis (Tick) | IXOSC | No |
| Macaca mulatta (Rhesus macaque) | MACMU | No |
| Monosiga brevicollis (Choanoflagellate) | MONBE | No |
| Monodelphis domestica (Opossum) | MONDO | No |
| Mus musculus (Mouse) | MOUSE | Yes |
| Nematostella vectensis (Sea anemone) | NEMVE | No |
| Neurospora crassa | NEUCR | No |
| Ornithorhynchus anatinus (Platypus) | ORNAN | No |
| Pan troglodytes (Chimp) | PANTR | No |
| Phaeosphaeria nodorum | PHANO | No |
| Rattus rattus (Rat) | RAT | Yes |
| Saccaromyces cerevisiae (Budding yeast) | YEAST | Yes |
| Schistosoma mansoni (Blood fluke) | SCHMA | No |
| Schizosaccharomyces pombe (Fission yeast) | SCHPO | Yes |
| Sclerotinia sclerotiorum | SCLS1 | No |
| Takifugu rubripes (Pufferfish) | TAKRU | No |
| Ustilago maydis (Corn smut/Huitlacoche | USTMA | No |
| Xenopus tropicalis (Frog) | XENTR | No |
| Yarrowia lipolytica | YARLI | No |
Species for which final consensus tables were constructed. Tables are available at https://github.com/marcottelab/Gene-Ages
| Common name . | Uniprot ID . | False-negative analysis . |
|---|---|---|
| Anopheles gambiae (Mosquito) | ANOGA | No |
| Bos taurus (Cattle) | BOVIN | No |
| Branchiostoma floridae (Lancelet) | BRAFL | No |
| Caenorhabditis elegans (Worm) | CAEEL | Yes |
| Candida albicans | CANAL | Yes |
| Canis lupus familiaris (Dog) | CANFA | No |
| Gallus gallus (Chicken) | CHICK | Yes |
| Ciona intestinalis (Tunicate) | CIOIN | No |
| Cryptococcus neoformans | CRYNJ | No |
| Danio rerio (Zebrafish) | DANRE | Yes |
| Drosophila melanogaster (Fly) | DROME | Yes |
| Homo sapiens (Human) | HUMAN | Yes |
| Ixodes scapularis (Tick) | IXOSC | No |
| Macaca mulatta (Rhesus macaque) | MACMU | No |
| Monosiga brevicollis (Choanoflagellate) | MONBE | No |
| Monodelphis domestica (Opossum) | MONDO | No |
| Mus musculus (Mouse) | MOUSE | Yes |
| Nematostella vectensis (Sea anemone) | NEMVE | No |
| Neurospora crassa | NEUCR | No |
| Ornithorhynchus anatinus (Platypus) | ORNAN | No |
| Pan troglodytes (Chimp) | PANTR | No |
| Phaeosphaeria nodorum | PHANO | No |
| Rattus rattus (Rat) | RAT | Yes |
| Saccaromyces cerevisiae (Budding yeast) | YEAST | Yes |
| Schistosoma mansoni (Blood fluke) | SCHMA | No |
| Schizosaccharomyces pombe (Fission yeast) | SCHPO | Yes |
| Sclerotinia sclerotiorum | SCLS1 | No |
| Takifugu rubripes (Pufferfish) | TAKRU | No |
| Ustilago maydis (Corn smut/Huitlacoche | USTMA | No |
| Xenopus tropicalis (Frog) | XENTR | No |
| Yarrowia lipolytica | YARLI | No |
| Common name . | Uniprot ID . | False-negative analysis . |
|---|---|---|
| Anopheles gambiae (Mosquito) | ANOGA | No |
| Bos taurus (Cattle) | BOVIN | No |
| Branchiostoma floridae (Lancelet) | BRAFL | No |
| Caenorhabditis elegans (Worm) | CAEEL | Yes |
| Candida albicans | CANAL | Yes |
| Canis lupus familiaris (Dog) | CANFA | No |
| Gallus gallus (Chicken) | CHICK | Yes |
| Ciona intestinalis (Tunicate) | CIOIN | No |
| Cryptococcus neoformans | CRYNJ | No |
| Danio rerio (Zebrafish) | DANRE | Yes |
| Drosophila melanogaster (Fly) | DROME | Yes |
| Homo sapiens (Human) | HUMAN | Yes |
| Ixodes scapularis (Tick) | IXOSC | No |
| Macaca mulatta (Rhesus macaque) | MACMU | No |
| Monosiga brevicollis (Choanoflagellate) | MONBE | No |
| Monodelphis domestica (Opossum) | MONDO | No |
| Mus musculus (Mouse) | MOUSE | Yes |
| Nematostella vectensis (Sea anemone) | NEMVE | No |
| Neurospora crassa | NEUCR | No |
| Ornithorhynchus anatinus (Platypus) | ORNAN | No |
| Pan troglodytes (Chimp) | PANTR | No |
| Phaeosphaeria nodorum | PHANO | No |
| Rattus rattus (Rat) | RAT | Yes |
| Saccaromyces cerevisiae (Budding yeast) | YEAST | Yes |
| Schistosoma mansoni (Blood fluke) | SCHMA | No |
| Schizosaccharomyces pombe (Fission yeast) | SCHPO | Yes |
| Sclerotinia sclerotiorum | SCLS1 | No |
| Takifugu rubripes (Pufferfish) | TAKRU | No |
| Ustilago maydis (Corn smut/Huitlacoche | USTMA | No |
| Xenopus tropicalis (Frog) | XENTR | No |
| Yarrowia lipolytica | YARLI | No |
Methods
Data Collection
In order to fairly consider the range of orthology algorithms, we took advantage of the reference datasets managed by the Quest for Orthologs (QFO) consortium. QFO researchers have established community standards and benchmarks for orthology inference and have made their benchmarking results publicly available ( Sonnhammer et al. 2014 ; Altenhoff et al. 2016,) . Fifteen algorithms have submitted their orthology estimates on 66 reference proteomes ( http://www.ebi.ac.uk/reference_proteomes , last accessed 19 May 2016) to QFO’s benchmarking tool ( http://orthology.benchmarkservice.org/cgi-bin/gateway.pl , last accessed 19 May 2016) ( Altenhoff et al. 2016 ). Importantly, these algorithms are widely used and capture a variety of methods commonly used in the literature to infer orthology and gene age ( Vilella et al. 2009 ; Huerta-Cepas et al. 2008 ; Pryszcz et al. 2011 ; DeLuca et al. 2012 ; Mi et al. 2013 ; Altenhoff et al. 2015 ; Huerta-Cepas et al. 2016 ; Sonnhammer and Östlund 2015 ; Linard et al. 2015 ). It is, therefore, expected that nearly every study of gene age, regardless of the method used, will closely resemble the results of at least one of the algorithms we explore here. We omitted two of these because they either did not have full taxon coverage (RBH), or their results were so different from all the others that it dominated the variance in all downstream analyses (OMA_GETHOGS). Pairwise orthology calls for the 13 remaining algorithms were converted into tables for each gene, which were then used for subsequent analyses. The reference species tree was downloaded from SwissTree ( Boeckmann et al. 2015 ) on 2015 Jun 15 ( ftp://ftp.lausanne.isb-sib.ch/databases/SwissTree/speciestree.nhx , last accessed 19 May 2016) and was pruned to match the taxa in the Quest for Orthologs reference proteomes ( http://www.ebi.ac.uk/reference_proteomes , last accessed 19 May 2016). The results below are with reference to the human proteome, but the same methods were applied to a variety of model organism proteomes ( table 1 ).
Custom programs were written to perform the analyses below, and these are publicly available, as are iPython notebooks used for plotting. These, and the datasets supporting the conclusions in this article, are available at geneages.org, and the associated GitHub repository ( https://github.com/marcottelab/Gene-Ages , last accessed 19 May 2016) with the following commit id: fee8d009d4d5ee24c3ae3cb0763439c48d4705e6. Scripts relied heavily on the python packages dendropy ( Sukumaran and Holder 2010 ), BioPython ( Cock et al. 2009 ), and pandas ( McKinney 2013 ).
Protein Age Calls
We inferred the ages for each gene by mapping the species in that gene’s ortholog group onto a reference species tree from SwissTree, which was derived from a consensus of trees found in the literature ( Boeckmann et al. 2015 ). The age of a protein is calculated on the species tree by finding the MRCA node of the taxa that have orthologs of that protein. This node is the “node age.” We use the node age to calculate a simple statistic that captures the uncertainty around the age-call, called the “node-error.” This is the average number of branches (patristic distance, with equal weights of 1 for all branches) between the age calls any two algorithms. We also used the average node-error between pairs of algorithms as input for a heuristic search in PAUP ( Swofford 2003 ) to cluster algorithms by similarity (see below).
To simplify the comparison of algorithm performance on human genes, we broke the reference species tree into eight age categories – the “binned age” ( fig. 1 ). These categories form nested clades, with the exception of the category “Euk + Bacteria.” This non-phylogenetic category captures the substantial number of eukaryotic genes that were horizontally transferred from bacteria after eukaryotes diverged from the rest of archaea ( Méheust et al. 2015 ; Pittis and Gabaldón 2016 ), and is defined as genes present in eukaryotes and bacteria but not archaea. These “binned ages” conform to the interior labels given by SwissTree.
The reference species tree and age categories used for gene-age inference. This tree is based on SwissTree ( Boeckmann et al. 2015 ) and reflects a consensus of recent large-scale phylogenies. Tip names are Uniprot species identifiers.
The reference species tree and age categories used for gene-age inference. This tree is based on SwissTree ( Boeckmann et al. 2015 ) and reflects a consensus of recent large-scale phylogenies. Tip names are Uniprot species identifiers.
Filtering False Positives and Negatives
Before calculating a consensus, we flag algorithms that may have committed false-positive or false-negative errors on a per-gene basis. These algorithms are then removed from consideration of that gene’s age. False positives are orthology calls that are substantially more distant than orthology calls by other algorithms, and have the effect of driving age deeper in the tree. These are found as follows. For each algorithm and each protein: (1) the node age is calculated, (2) the number of taxa in the species tree descended from this node is found, and (3) the number of taxa containing orthologs of the focal protein is subtracted from the number of descendant taxa. This number is the number of taxa without the orthogroup that are descended from an ancestor that putatively had the orthogroup, and is, therefore, proportional, but not identical, to the number of inferred losses of the orthogroup. For each algorithm and each protein, if this number is two standard deviations above the pooled algorithm mean for the focal protein, that algorithm’s age call is considered a false positive and is thrown out.
False negatives are cases where an algorithm fails to make an orthology call, driving the inferred age to shallower nodes in the species tree. We identify one possible cause of this, which we call “over-splitting.” This is when a group of co-orthologs is not correctly recognized by an algorithm and only one or a few of its members are found as orthologs to a more distant species, while the others are split off into their own orthogroups. The members that are split off would then be called at an incorrectly young age. To identify these errors, we used PhylomeDB’s ( Huerta-Cepas et al. 2008 ) orthogroups as a standard. For each protein and each algorithm (except for PhylomeDB), if the focal algorithm called a younger age than PhylomeDB and a co-ortholog of the focal protein could be found where the focal algorithm called the same node age as PhylomeDB did on the focal protein, then this algorithm was considered to be over-splitting the focal protein, and was not considered in this protein’s age call. This error calculation was not performed on proteins where PhylomeDB was flagged as a false positive.
We also investigated the correlation between several intrinsic properties of proteins and their node error statistic. These included protein length, number of domains (as annotated by HMMER, Eddy 1998 ), and evolutionary rate. Rate was calculated for each protein by performing pairwise Needleman–Wunsch alignments for all one-to-one orthologs (annotated by PhylomeDB) between human and mouse, and human and yeast.
Consensus Ages
We generated consensus binned ages after removing algorithms flagged with false positives and negatives as described above. The number of algorithms favoring each binned age is counted and then normalized by the number of contributing algorithms to give a distribution over age calls. For subsequent analyses, we used the mode of this distribution as the consensus age.
Results
The Effect of Algorithm Choice on the Distribution of Human Gene Ages
We investigated the effect of algorithm choice on gene-age estimation by assigning each human gene to the “binned” age category in which the MRCA of the species in its orthogroup fall. We then calculated the distributions over the different age categories for the human proteome inferred by each algorithm. The algorithms fell into two distinct groups with respect to the distribution of age classes ( fig. 2 ). Hierarchically clustering the algorithms by the average number of branches between their per-gene age calls recapitulated this grouping, and we define the two groups based on the midpoint root of this tree. One group tended to find that most orthogroups could be traced to the MRCA of vertebrates, whereas the other group found a much older mode age dating back to LECA. We call these two groups, the “young” and the “old” group, respectively, although, of course, there are many more subtle and interesting distinctions between the algorithms.
Distribution of age categories in the human proteome inferred by 13 different orthology inference algorithms. Algorithms were clustered according to the average pairwise distance between their age-calls, counted in units of braches (patristic distance with equal weights of 1 for all branches). The distance tree is rooted at the midpoint. Algorithms are colored by the methods they use to infer orthology. They either use a graph-based or a gene tree-based strategy, either with, or without, the use of a species tree (dotted outline).
Distribution of age categories in the human proteome inferred by 13 different orthology inference algorithms. Algorithms were clustered according to the average pairwise distance between their age-calls, counted in units of braches (patristic distance with equal weights of 1 for all branches). The distance tree is rooted at the midpoint. Algorithms are colored by the methods they use to infer orthology. They either use a graph-based or a gene tree-based strategy, either with, or without, the use of a species tree (dotted outline).
Orthology inference algorithms are typically classed into graph-based and tree-based methods ( Sonnhammer et al. 2014 ). However, we found that even though tree-based methods tended to fall in the “old” group, this was not universally the case, or were all graph-based methods found in the “young” group. The use of species tree information was not a determining factor either ( fig. 2 ). The bimodal nature of the age calls, either “young” or “old”, is, therefore, not simply a reflection of the graph/tree distinction, although it is clearly correlated. What is the source of this bimodality? One obvious answer is systematic error in the “young” group algorithms, the “old” group, or both. Systematic error in the young group would be equivalent to false negatives, that is, missing orthology assignments, whereas systematic error in the old group is equivalent to false positives, or spurious orthology assignments. This would have the effect of pushing the age of the group away from or towards the root of the tree, respectively.
Identifying Systematic Error
We first investigated whether the bimodality of age-calls played out on the single gene level or whether the two groups apparent in figure 2 were due to the effects of averaging across genes, with error being randomly distributed among proteins. To do so, we calculated a simple statistic that captured how bimodal a protein’s age calls were between the two groups of algorithms (“old” and “young”). This statistic, which we call bimodality, is the difference between node-error within the two groups and between them, with more highly bimodal proteins having more variation between groups. Over 80% of proteins had some degree of bimodality corresponding to these two age groups, or none, as is expected given the hierarchical clustering in figure 1 . The remaining genes were anti-correlated with the “old”/“young” groupings. Furthermore, the degree of bimodality between the “young” and “old” algorithm groups correlates well with the amount of error associated with each protein (Spearman’s ρ : 0.69) ( fig. 3 ). That is, proteins with a large amount of error tend to be more bimodal. The bimodality between algorithms is, therefore, a systematic phenomenon and a major source of error in these datasets. Unfortunately, in the case of highly polarized genes, we cannot know a priori whether the “old” or “young” age is the correct one. It is, therefore, important to propagate this uncertainty through further analyses, and the bimodality statistic is included with our consensus age estimates.
Error statistics. (A) The distribution of average node error, a measure of disagreement among the algorithms for a given gene, is given, along with a plot of the average bimodaliy in each bin. Genes with more error tend to be more bimodal between “old” and “young” algorithms. (B) Example of a strongly bimodal and weakly bimodal gene with a few representative algorithms. The ages are given as categories for clarity, but the bimodality statistic is calculated according to the number of branches between node age-calls (see the Methods section).
Error statistics. (A) The distribution of average node error, a measure of disagreement among the algorithms for a given gene, is given, along with a plot of the average bimodaliy in each bin. Genes with more error tend to be more bimodal between “old” and “young” algorithms. (B) Example of a strongly bimodal and weakly bimodal gene with a few representative algorithms. The ages are given as categories for clarity, but the bimodality statistic is calculated according to the number of branches between node age-calls (see the Methods section).
We also investigated whether aspects of the individual proteins contributed to systematic error. For instance, it may be difficult to infer correct evolutionary relationships for small proteins, or those with many domains. At least one orthology inference algorithm uses this idea to “correct” for protein length ( Emms and Kelly 2015 ). However, we found that protein length has a weak positive correlation with age-call error, and that the number of domains also correlates weakly (Spearman’s ρ : < 0.2 in both cases). Several simulation studies have found that increased evolutionary rate can lead to false-negatives in phylostratigraphic studies ( Moyers and Zhang 2015 , 2016 ). We found that similarity of one-to-one orthologs and node error have a negative correlation, as predicted by these studies, but weakly for human–mouse orthologs (Spearman’s ρ = −0.13) and only slightly stronger for human–yeast orthologs (−0.36). We did find, however, that proteins in the youngest age class, “Mammalia,” tended to have a higher rate of evolution than older age-classes. We cannot, therefore, rule out the possibility that some of the “youngest” proteins may be due to the limited sensitivity of the homology searches (see also supplemental iPython notebook “errorCorrelation_plotting” ( Supplementary Material online) at https://github.com/marcottelab/Gene-Ages/tree/master/Notebooks ).
Systematic False Negatives
What are the causes of systematic false negatives and can we identify them without a priori knowledge of the true orthogroup? One clue comes from the different age-category distributions between PANTHER8_all and PANTHER8_LDO ( Mi et al. 2013 ). These two sets of orthology calls are based on the same set of gene trees, but differ in their definition of orthology. “LDO” stands for “least diverged ortholog,” and only considers the least diverged among a set of co-orthologs to be the true ortholog of an outgroup. This can be contrasted to the traditional phylogenetic definition of orthology where all co-orthologs are equally orthologous to the outgroup ( fig. 4 ; Fitch 2000 ). Although it may be useful to split co-orthologous groups, as the LDO definition does, in cases where orthology is being used for, for example, gene function annotation, it is inappropriate for defining the age of a gene or gene family because the age must be in reference to the topology of the phylogenetic tree. The fact that PANTHER8_LDO’s age category distribution resembled that of several graph-based methods, and the fact that it clustered with them based on its per-gene age calls ( fig. 2 ), suggests that these methods may be splitting up co-orthologous groups as well.
Determination of false negatives due to co-ortholog over-splitting. This tree compares the ages given by least derived orthology (LDO) and traditional, phylogenetic orthology (Phylo.). Given a group of co-orthologs in Species 1, LDO will give only the co-ortholog with the shortest distance to an outgroup (gene in Species 2) the status of ortholog to this outgroup (red box). All others are put in separate orthogroups. Hence, LDO produces more genes that are mapped (incorrectly) to a younger age (Y), whereas traditional, phylogenetic orthology (blue box) includes all co-orthologs to the orthogroup, thereby mapping more genes to the older age (O).
Determination of false negatives due to co-ortholog over-splitting. This tree compares the ages given by least derived orthology (LDO) and traditional, phylogenetic orthology (Phylo.). Given a group of co-orthologs in Species 1, LDO will give only the co-ortholog with the shortest distance to an outgroup (gene in Species 2) the status of ortholog to this outgroup (red box). All others are put in separate orthogroups. Hence, LDO produces more genes that are mapped (incorrectly) to a younger age (Y), whereas traditional, phylogenetic orthology (blue box) includes all co-orthologs to the orthogroup, thereby mapping more genes to the older age (O).
There is no gold standard set of co-orthologs in this dataset, so we used the database PhylomeDB as a reference for identifying co-ortholog over-splitting. PhylomeDB was chosen because it infers gene trees under Maximum-Likelihood with well-characterized models, and is, therefore, similar to how most researchers infer orthologs when analyzing one gene family at a time. We nevertheless recognize the limitation inherent in choosing a single, imperfect set as a reference. PhylomeDB summary files for 10 species in PhylomeDB’s model species collection (PhyC2) that overlapped with species in our tree ( table 1 ) were downloaded, and we determined groups of co-orthologs that were then used for the analysis. Briefly, for protein (A), if an algorithm called a younger age (Y) and PhylomeDB an older age (O), and if in the co-orthologs of (A) we could find a protein (B) which that algorithm called at age (O), then (B) was identified as the LDO, age (O) was assumed to be the true age, and that algorithm was determined to be over-splitting the co-ortholog group ( fig. 4 ). This was not carried out for proteins on which PhylomeDB’s age call was determined to be a false positive (see below). We note that this method for identifying co-ortholog over-splitting is not ideal, because it relies on a single, imperfect algorithm (PhylomeDB). It is conservative, however, because algorithms will only be trimmed if they give a member of the co-orthologs the exact same age on the species tree as that called by PhylomeDB on the focal gene. More thorough analyses of whether graph-based methods are consistently missing co-orthologs will be necessary in the future.
Identifying False Positives
If genes of distant organisms are incorrectly inferred to be part of an orthology group, it will drive the age of the orthogroup towards the root of the tree. Recent HGT events are a biological source of such errors, but some algorithmic error is expected to play a role as well. Such problems are perhaps more likely to occur in tree-based algorithms, where slight re-arrangements that do not strongly affect the likelihood of the tree can have an outsized effect on the inference of gene gains and losses ( Hahn 2007 ). In such cases, the large number of taxa that fall between the true in-group taxa and the false positive out-group taxa will be inferred to have lost the orthogroup. We used this criterion on a per-gene basis to identify algorithms that were likely to have false positives and genes that were likely to be the result of recent HGT events. Algorithms that had an outsized number of taxa missing from an orthogroup were considered false positives and removed from downstream analysis of that orthogroup’s age. After trimming these outliers, genes that were in the 95th percentile of inferred losses were flagged as being potential recent HGT events (i.e., horizontally transferred long after LECA). These potential HGT genes are an interesting set in themselves: 66% are from the Euk + Bacteria category, they are hugely enriched for metabolic genes (gProfiler P value = 9.08e −116 ), and several are associated with human diseases.
We found that, as expected, algorithms in the “old” group tended to commit more false positive errors, and algorithms in the “young” group committed more false negative errors ( fig. 5 ). Because PhylomeDB was used as a basis for identifying false negatives, its false negative rate could not be quantified.
Errors committed by different algorithms. False positives and negative are defined in the text. PhylomeDB was used as a standard for false negative, so its false negative count could not be determined.
Errors committed by different algorithms. False positives and negative are defined in the text. PhylomeDB was used as a standard for false negative, so its false negative count could not be determined.
Consensus
These analyses suggested a way to more robustly estimate consensus gene ages and to calculate a posterior distribution over the estimate. We used the methods described above to identify algorithms that may have committed false positive or false negative errors and then removed these algorithms from the consideration on a per-protein basis. After doing so, we generated consensus tables based on the remaining algorithms for the human proteome and for a number of other model eukaryotes ( table 1 ), and we make these tables available. Because our tree is best sampled within the opisthokonts (fungi, animals, and closely related protists), we restricted our analyses to this lineage. These tables contain a consensus age category for each protein based on the mode age call of non-trimmed algorithms. Older genes were found to be involved in key components of cell biology. Genes in the Euk + Bac group were found to be highly enriched for mitochondrial function, and genes that date back to the Euk_Archaea node were enriched for translational machinery, as has been shown previously ( Thiergart et al. 2012 ; Koumandou et al. 2013 ). Many of these older genes are also associated with hereditary diseases that represent a deficiency in a cell function associated with that evolutionary epoch. For instance, the cytoskeletal system and cilium date to LECA ( Koumandou et al. 2013 ), and genes in this age category are enriched for diseases affecting the cilium, such as primary ciliary dyskinesia and Bardet–Biedl syndrome ( fig. 6 ).
Enrichment of gene ontology terms and human disease terms (OMIM) in the different consensus age classes for human and budding yeast ( Saccharomyces cerevisiae ). The distribution of age classes are shown for each species. Older genes tend to be enriched for core cellular machinery and heritable diseases. Newer genes are associated with lineage-specific function, such as nervous system development and olfaction ( via G-protein coupled receptors) in mammals, and DNA integration in yeast. P values are derived from g:Profiler ( Reimand et al. 2007 ).
Enrichment of gene ontology terms and human disease terms (OMIM) in the different consensus age classes for human and budding yeast ( Saccharomyces cerevisiae ). The distribution of age classes are shown for each species. Older genes tend to be enriched for core cellular machinery and heritable diseases. Newer genes are associated with lineage-specific function, such as nervous system development and olfaction ( via G-protein coupled receptors) in mammals, and DNA integration in yeast. P values are derived from g:Profiler ( Reimand et al. 2007 ).
These enrichment terms are derived from the point estimates of consensus ages, but we also provide other data that can be used to propagate uncertainty to downstream analyses. For each gene, the distribution over age-calls from the non-trimmed algorithms is given, as well as the number of contributing algorithms and the entropy of the age call distribution. About 87% of human proteins had at least five algorithms contributing after trimming, and 59% had at least 10 out of a total of 13 original algorithms. In addition, the tables contain information on whether the protein was flagged as being a potential horizontal gene transfer event. Finally, we include the node error and bimodality statistics, both of which are measures of uncertainty that reference the reference species tree.
We note that in several cases, we have made ad hoc decisions during the building of the consensus. For instance, algorithms were flagged as false positives if the number of taxa inferred to have lost the orthogroup was two standard deviations above the mean of all algorithms. These decisions were informed by the underlying distributions of values. Nevertheless, we supply the source data files, scripts used for these analyses, as well as interactive iPython notebooks, and we invite researches to explore and change parameters if they desire ( https://github.com/marcottelab/Gene-Ages ).
Error Propagation
How can our error annotations be used in downstream analyses? Here we give an example of a simple stability analysis for gene ontology enrichment that uses these error terms. It has previously been shown that eukaryotic genes vertically acquired from archaea are enriched for translation and RNA processing, whereas genes acquired horizontally from bacteria at the root of eukaryotes are enriched for metabolic processes ( fig. 6;Thiergart et al. 2012 ; Koumandou et al. 2013 ). This conclusion relies on functional term enrichment, but what is the effect of different sources of error on these sorts of enrichment analyses? To investigate the robustness of this conclusion to different sources of error, we used the program g:Profiler ( Reimand et al. 2007 ) to perform functional enrichment analysis on the two age classes “Euk_Archaea” and “Euk + Bacteria” after filtering the datasets at varying levels of stringency ( fig. 7 A ). We found that removing genes that were flagged as a possible late HGT event had a strong effect on the average P values of functional annotation terms in the Euk + Bacteria age class but not the Euk_Archaea class ( fig. 7 B ). This may be due to these genes being more commonly lost or to many bacterial genes being more recent HGT events (and hence being filtered out). The latter possibility would mean that many genes in this age category could be misidentified as being present in LECA, so these genes are good candidates for manual curation. Notably, filtering on different error terms can increase or decrease the significance of different terms, and, depending on the filtering strategy, the significance ranking of terms can be switched ( fig. 7 C and D ). Analyses that rely on smaller test-sets of genes are likely to be much more strongly affected that these proteome-wide searches.
Effect of filtering on functional term enrichment analysis. (A) Datasets 1–6 were trimmed based on four sources of error: entropy of age-calls, whether the genes were flagged as potential horizontal gene transfer (HGT) events, the number of algorithms contributing to the final age call (after filtering algorithms, as described in the Methods and Results section), and the polarization of each gene. Dataset 6 was filtered on all four criteria (B) Negative log 10P values for the five datasets for two age categories (Euk_Archaea, and Euk + Bacteria) and two gene ontology terms (biological process, and cellular compartment). (C and D) Negative log 10P values across datasets for eight functional terms in the two age categories. These terms show a variety of ways that significance can be affected by filtering.
Effect of filtering on functional term enrichment analysis. (A) Datasets 1–6 were trimmed based on four sources of error: entropy of age-calls, whether the genes were flagged as potential horizontal gene transfer (HGT) events, the number of algorithms contributing to the final age call (after filtering algorithms, as described in the Methods and Results section), and the polarization of each gene. Dataset 6 was filtered on all four criteria (B) Negative log 10P values for the five datasets for two age categories (Euk_Archaea, and Euk + Bacteria) and two gene ontology terms (biological process, and cellular compartment). (C and D) Negative log 10P values across datasets for eight functional terms in the two age categories. These terms show a variety of ways that significance can be affected by filtering.
Discussion
Most studies of gene age use a single point estimate arising from one of a variety of methods. Given our analysis of some the most popular orthology inference algorithms, we find that point estimates of gene age will be wrong for (at least) thousands of genes in a human-sized proteome ( fig. 4 ). More troubling is the fact that algorithms appear to fall into two classes, each of which presumably has a systematic bias towards either false positives (“old group”) or false negatives (“young group”). This systematic bias happens on a per-gene basis, meaning that simple voting methods will not be able to resolve conflicts. Even with the ideal sampling of algorithms, which we approximate here by exploring a wide diversity of popular algorithms, the effective voting population will still drop to two on highly polarized genes.
Many areas of computational biology have faced a similar problem, namely, the need to keep track of error in several components of a workflow, and to correctly propagate this error through the whole analysis ( Guang et al. 2016 ). One illustrative example is multiple sequence alignment and phylogenetic inference. The former is a necessary precursor to the latter, and each involves estimation error. Methods have been developed to infer the posterior distributions of both steps simultaneously ( Suchard and Redelings 2006 ), which is computationally intractable for all but the smallest datasets, or to perform each step iteratively in a maximum likelihood framework ( Liu et al. 2009 ). We argue that, eventually, such steps will have to be taken with orthology inference and gene-age estimation. Using a point estimate at each step in the analysis makes the assumption that each inference step has no uncertainty associated with it, which we can clearly reject in the case of gene-age estimation.
Some methods for probabilistic orthology inference do exist ( Ullah et al. 2015 ). These use gene tree models with free parameters for gene duplication, loss, and sometimes HGT, which then contribute to the likelihood along with the multiple sequence alignment. However, these methods are in their infancy, and are not usually scalable to large datasets or widely used. In the meantime, it is important to have an understanding of common sources of error in gene-age estimation. We provide that information along with consensus age calls for a variety of model organisms so that researchers can incorporate error propagation into their analyses in a way that is appropriate to their question of interest.
Several error terms are likely to be important for a broad range of analyses. The first and most straightforward is the entropy of the age-call estimate after filtering false positives and negatives. This statistic gives a quick idea of how certain an age-call is, with higher entropies being less certain. It is defined with reference to our age categories, so if researchers need to use other age categories, they must use the node age of the gene, which we also provide. HGT events are also likely to affect some datasets, especially when genes originating in Bacteria are involved ( fig. 6 ). A large number of eukaryotic genes are likely transfers from Bacteria ( Thiergart et al. 2012 ), but these may have been transferred at any point on the phylogeny. We define one age category, Euk + Bacteria, to describe all genes transferred before LECA, with later transfers hopefully being caught by our flag. If researchers are primarily interested in HGT, we suggest a much fuller analysis, as our simple method is likely to miss many HGT events. Finally, the bimodality of the age-call between “young” and “old” algorithm types is a key statistic. The systematic biases in the different algorithm types mean that many datasets will be radically different and difficult to compare, and it may account for some of the differences between studies of ancient gene repertoires that used either graph or tree-based methods. Genes that are highly polarized are good candidates for manual curation, because it is unlikely that any ad hoc algorithm will differ substantially enough from those we sampled here to be decisive.
Our analysis also identifies several areas for further study. In particular, our analysis of systematic false-negatives suggests that the way in which orthology inference algorithms handle co-orthology is a major source of difference among them. We relied on co-ortholog groups from PhylomeDB because no gold-standard sets are available, and PhylomeDB’s method of Maximum-Likelihood gene tree inference without the use of reference species tree is the most similar to how researchers infer phylogenies on single gene families. However, a fuller analysis of co-ortholog oversplitting is clearly wanting, and we also provide final datasets without the false-negative filter that relies on PhylomeDB.
Although we have characterized only two components of a typical computational biology workflow, orthology inference, and gene-age estimation, it would be ideal to characterize error distributions for all the steps in an analysis, which has not been done with gene age data to our knowledge (but see Thompson et al. (2014) for an interesting example on gene-expression data, and Guang et al. (2016) for a general review). The datasets we provide here will hopefully help guide future research efforts aimed at a more formal, probabilistic way to handle error in gene-age estimation, perhaps even in the context of an entire workflow. Until such methods are available, we advocate using our error annotations or a similar analysis in any study incorporating gene-age data.
Supplementary Material
Supplementary material is available at Genome Biology and Evolution online ( http://www.gbe.oxfordjournals.org/ ).
Acknowledgments
The authors would especially like to acknowledge the Quest for Orthologs consortium and those who contributed their algorithms to the benchmarking tool for making their data freely available. B.J.L. was funded by NIH fellowship 1F32GM112504-01A1. E.M.M. acknowledges funding from the NIH, NSF, CPRIT, ARO (61789-MA-MUR), and Welch Foundation (F1515).
Literature Cited
Author notes
Associate editor: Eugene Koonin
Data deposition: This project has been deposited at Zenodo under the accession http://dx.doi.org/10.5281/zenodo.51708 .







