The evolutionary traceability of proteins

Orthologs document the evolution of genes and metabolic capacities encoded in extant and ancient genomes. Orthologous genes that are detected across the full diversity of contemporary life allow reconstructing the gene set of LUCA, the last universal common ancestor. These genes presumably represent the functional repertoire common to – and necessary for – all living organisms. Design of artificial life has the potential to test this. Recently, a minimal gene (MG) set for a self-replicating cell was determined experimentally, and a surprisingly high number of genes have unknown functions and are not represented in LUCA. However, as similarity between orthologs decays with time, it becomes insufficient to infer common ancestry, leaving ancient gene set reconstructions incomplete and distorted to an unknown extent. Here we introduce the evolutionary traceability, together with the software protTrace, that quantifies, for each protein, the evolutionary distance beyond which the sensitivity of the ortholog search becomes limiting. We show that the LUCA set comprises only high-traceable proteins most of which have catalytic functions. We further show that proteins in the MG set lacking orthologs outside bacteria mostly have low traceability, leaving open whether their eukaryotic orthologs have just been overlooked. On the example of REC8, a protein essential for chromosome cohesion, we demonstrate how a traceability-informed adjustment of the search sensitivity identifies hitherto missed orthologs in the fast-evolving microsporidia. Taken together, the evolutionary traceability helps to differentiate between true absence and non-detection of orthologs, and thus improves our understanding about the evolutionary conservation of functional protein networks.


Introduction
'How old is a gene?' is one of the fundamental questions in functional and evolutionary genetics (Capra, et al. 2013). The age of a gene is tightly linked to many of its functional properties.
Proteins encoded by old genes tend to evolve slightly slower than younger genes (Wolf, et al. protein sequence evolution, (3) the calculation of the traceability, and optionally (4) the display of the traceabilities on a reference tree. The detailed workflow is represented in fig. 1 and in supplementary fig. S1A.
Step 1 -Parameterization of the evolutionary process. First, ProtTrace infers the evolutionary characteristics of the seed-protein. We compile a group of orthologs, Oseed for the seed-protein. ProtTrace facilitates the use of pre-compiled orthologs from OMA (Altenhoff, et al. 2015), InParanoid (Ostlund, et al. 2010) and from orthoDB (Zdobnov, et al. 2017). Optionally, a targeted ortholog search with HaMStR (Ebersberger, et al. 2009) can be employed. In the next step, the orthologous sequences are aligned with MAFFT v7.304 (Katoh and Toh 2008), and a maximum likelihood tree, Tseed, is computed with RAxML v8 (Stamatakis 2014). The resulting tree and the multiple sequence alignment (MSA) are then used to determine the evolutionary parameters of the proteins. A maximum parsimony algorithm infers the seed protein-specific insertion and deletion (indel) rates. The procedure is In a phylogenomic setting the evolutionary parameters are inferred for many seed-proteins, e.g. all proteins encoded in a species' genome. To account for different absolute substitution rates between the individual seed-proteins, we introduce a rate scaling factor κseed (Eq. 1). We compute κseed for each seed-protein as = Median ( ≠ ) { seed ( , ) ̅ species ( , ) } (1) Steps 2 -3 -Simulation of protein sequence evolution and calculation of the traceability curve. Once the evolutionary model is fully parameterized, ProtTrace uses REvolver (Koestler, et al. 2012) to simulate the evolution of the seed protein in time steps of 0.1 substitutions per site. After each step, the simulated sequence serves as a query for a BLASTP (Altschul, et al. 1997) search against the full protein set of the species the seed-protein was derived from (seed species). If the seed-protein sequence is identified as one of the top five hits, the success is marked with a '1', otherwise a '0' is noted. Repeating the simulation 100 times yields for each time step ti a fraction of successes that approximates Ti(ti), the traceability index of the seed-protein as a function of evolutionary time t. We fit the inverse of a non-linear least square logistic growth curve to this data (Eq. 2) using the non-linear least square (nls) package in R.
Step 4 -Tree display. Ti(t) describes the traceability index of the seed protein as a function of time. To provide an intuitive overview for each seed protein, in which species the sensitivity of an ortholog search could become an issue, protTrace can display the traceability information along a species phylogeny (see supplementary fig. S3). Here, the colour of the leaf labels indicates the traceability of the seed-protein in the respective species.

Results and Discussion
The evolutionary traceability of the yeast gene set    (Thomarat, et al. 2004), than in human and Arabidopsis that belong to different kingdoms. This is an effect of the extraordinarily high substitution rate in the microsporidian lineage, which is among the highest across all eukaryotes (Slamovits, et al. 2004).
We next calibrated the traceability index such that it informs in real data about the evolutionary distance beyond which orthologs are too diverged to be detected with BlastP based ortholog predictors. We searched for orthologs in the 232 target species using each of the 6,352 proteins as a query and tabulated the number of query-species pairs in which at least one ortholog was found. In 95% of the cases where an ortholog was detected, the traceability was at least 0.75 (Fig. 2B). This observation leads us to hypothesize that, when the traceability is below 0.75, an ortholog search is bound to fail. If an ortholog exists, it has likely diverged beyond recognition. On the basis of this hypothesis, we distinguish two scenarios for the cases where no ortholog was identified (Fig. 2C). For the 53% of the cases where the traceability is 0.75 or above, we conclude that the ortholog is genuinely absent, as we should be able to detect it otherwise. For the remaining 47%, the traceabilities do not reach the threshold of 0.75, and these cases occur in almost all target species (Fig. 2D). In other words, in almost half of the cases where we do not find an ortholog for a yeast protein, we cannot distinguish between true absence or an insufficient search sensitivity.
We are aware of one previous study that performed an in-silico evolution of the yeast protein set (Moyers and Zhang 2016). In this study, the authors inferred their constraints on the evolutionary process for each yeast protein from the alignment of orthologs of five sensu stricto yeast species. Unfortunately, Moyers and Zhang (2016) did not link their findings to the actual phylogenetic profiles of the yeast proteins, making a comparison to our study hard.
We therefore reproduced their analysis in part. Moyers and Zhang (2016) used site specific substitution rate scaling factors inferred with TreePuzzle (Schmidt and von Haeseler 2007) as information to constrain the evolutionary process in a site-specific manner. We recreated these constraint vectors, once with the original approach by Moyers and Zhang (2016) using the five sensu stricto yeast sequences, and once with an alignment using orthologs selected from the full diversity of fungi. This revealed that the phylogenetic diversity of the input alignment has a strong effect on the constraint pattern. When using the sensu stricto yeast orthologs, on average 80 % of the alignment sites are assigned a relative rate of 0. Such positions remain unchanged during evolution. In contrast, when using the phylogenetically diverse training data, on average only about 15% of the alignment sites get assigned a relative rate of 0 (see supplementary figure S6). Thus, the evolutionary constraint information -and as a consequence the simulated change of the protein over time-changes with the underlying training data. This raises the question about the optimal collection of training data. In the particular case of the simulated yeast protein evolution (Moyers and Zhang 2016), it appears that the use of the closely related yeast sequences for inferring the site-specific rate scaling factors puts a too harsh constraint on the evolutionary process (see supplementary figure S6).
Using our terminology, this is bound to result in an overestimated traceability, an aspect that the authors have noted themselves (Moyers and Zhang 2017).

Unobserved domain constraints result in underestimated traceabilities
The integration of traceability and ortholog search for the yeast proteins reveales that we sometimes (5%) detect an ortholog although the traceability of the seed-protein predicts that we should not. Reducing the traceability cut-off has little effect on this number ( fig. 2B). The underlying causes that can explain such a discrepancy between the traceability estimate and the outcome of an ortholog search are diverse (see supplementary text). On the one hand, overestimates of the protein-specific evolutionary rates can artificially decrease the traceabilitiesalthough protTrace is considerably robust with respect to variation in the rate estimates (see supplementary fig. S7). On the other hand, spurious ortholog assignments can mimic the presence of an ortholog where there is none, an artefact that is obviously hard to control. One main factor determining a protein's traceability, however, is its Pfam domain content (Finn, et al. 2016), as protTrace exploits the characteristic sequence features of Pfam domains to deduce functional constraints on the evolutionary process (Koestler, et al. 2012).
In the yeast data, 1,255 out of 6,352 proteins do not harbour any Pfam domain. In the simulations that underlie the traceability estimates, these proteins evolve without positionspecific constraint, and correspondingly have low traceabilities (see supplementary fig. S8).
This implies that protTrace, if information concerning local constraints on the sequencespecific evolutionary process is not provided, can underestimate the traceability of a protein.

Traceability and sub-cellular localization are linked
Protein traceability informs whether or not the sensitivity of an ortholog search is sufficient to accurately determine the phylogenetic profile of a protein even in distantly related species.
ofInitial evidence that this measure can provide an alternative view on the interpretation of conservation patterns of orthologs across species comes from the analysis of proteins with different sub-cellular localization. It was recently reported that membrane proteins, and even more so extracellular proteins, have sparser phylogenetic profiles and fewer detected orthologs than intracellular proteins (Sojo, et al. 2016). Both was taken as evidence for a rapid evolutionary turnover of membrane and extracellular proteins. We addressed the same issue from a view point of protein traceability. We classified the yeast proteins into three groupsmembrane proteins, extracellular proteins and intracellular proteinsas described previously (Sojo, et al. 2016). For every protein, we computed the mean traceability across the 232 target species and found that extracellular proteins have overall lower traceabilities compared to the membrane-bound and intracellular proteins ( fig. 4). For the latter two groups the differences in the mean traceabilities were less pronounced. Yet, the distribution for the membrane proteins displays a higher fraction of proteins with mean traceabilities below 0.5. In the light of these results we expect that an ortholog search more often misses a distantly related ortholog for extracellular and membrane proteins than for intracellular proteins. This is perfectly in line with the differences in their observed evolutionary conservation of the three groups as reported by Sojo et al (2016). We therefore conclude that the particular evolutionary behaviour of proteins with different sub-cellular localizations, and their differences in the traceability together account for the observed differences in their evolutionary conservations. (Supek, et al. 2011)  Whatever essential tasks these 60 proteins have, it may be premature to mark them as Mycoplasma-specific inventions. Instead, we hypothesize that their low traceability blurs the evolutionary link to related proteins with the same function in other organisms. Given their participation in fundamental cellular functioning, it is tempting to speculate that these proteins can provide relevant hints towards the nature of the 'software' that appears missing in the current reconstructions of the LUCA gene set.

Protein traceability limits ortholog identification in the fast-evolving microsporidia
Microsporidia, intracellular parasites closely related to fungi (Corradi and Keeling 2009) are a hallmark example that a low traceability can indeed result in essential genes being overlooked. All microsporidia analysed so far share two characteristics: First, their genomes harbour between 2,000 and 4,000 genes, due to an ancient radical reduction in genome size (Slamovits, et al. 2004), and second their proteins evolve extraordinary fast. While the first characteristic makes it tempting to generally equate a non-detection of an ortholog to a yeast protein with a gene loss, the high evolutionary rate of microsporidia indicates that a low traceability may be another reason for the lack of orthologs. Katinka et al (2001) and Cuomo et al. (2012) showed that key metabolic functions, e.g. the fof1-ATPase complex, fatty acid synthesis, the tricarboxylic acid (TCA) cycle and the formation of peroxisomes are absent in microsporidia (Katinka, et al. 2001;Cuomo, et al. 2012). We determined the phylogenetic profiles for the corresponding yeast proteins and could confirm that for many proteins no ortholog was detectable in our microsporidian representatives ( fig. 6A and see supplementary   table S5). For most of these proteins, the traceabilities in microsporidia are in the range of 0.9 and above. This indicates that the corresponding genes have indeed been lost on the microsporidian lineage.
The situation is different for proteins involved in meiosis and recombination. Yeast, as well as most other eukaryotes, share a conserved set of 29 proteins involved in this process (Malik, et al. 2007 (Klein, et al. 1999). Thus, the identification of two microsporidian proteins with the Rad21_Rec8_N domains resembles the situation generally seen in fungi. However, at this step of the analysis, the precise identity of the microsporidian proteins remains unclear.
In the next step, we reconstructed the evolutionary relationships of a subset of fungal and nonfungal REC8 and MCD1 (SCC1) orthologs together with the microsporidian candidates ( fig.   6C). Although this tree is not well resolved and renders, for example, the fungal REC8 proteins paraphyletic, it already supports a grouping of the microsporidian sequences with fungal and animal REC8 orthologs. Subsequently, we rearranged the tree topology to reflect the accepted evolutionary relationships of fungi, microsporidia and animals. A topology test revealed that the likelihood of the rearranged tree is with a LogLikelihood = 25.7 not significantly worse than the maximum likelihood tree (SH test: p>0.05 Shimodaira and Hasegawa (1999)). The data is therefore compatible with the hypothesis that microsporidian In summary, the REC8 example shows that missing orthologs in the quickly evolving microsporidia are not always an effect of the rampant gene loss that is characteristic for this taxonomic group (Corradi and Slamovits 2011). Here, we provide for the first time convincing evidence that REC8 orthologs are widespread among microsporidia. The meiotic cohesin complex might therefore function in microsporidia as described for yeast. It should be noted, however, that we find no trace of MCD1 (SCC1), the mitotic counterpart of REC8. As this protein has a high traceability in the microsporidia, we propose a genuine gene loss of the Table 6). In this context, it is intriguing that we observe two paralogous REC8 proteins in the microsporidia, whose emergence via a gene duplication can be dated to the last common ancestor of the microsporidia. Notably, six out of ten microsporidian species harbour both paralogs. It is tempting to speculate that the apparent loss of the Mcd1 (Scc1) gene on the microsporidian lineage was compensated by a duplication of Rec8.

Conclusion
Orthologs form the essential basis to propagate functional annotations between proteins of different species and to reconstruct the evolutionary past. So far, it has largely remained a matter of speculation to what extent limitations in the sensitivity of ortholog searches have influenced insights gained from these reconstructions. Here, we have presented a software, protTrace, facilitating a simulation-based procedure to assess the evolutionary traceability of a seed protein over time when using standard ortholog searches. In contrast to existing approaches, protTrace can infer constraints on the evolutionary sequence change of the seed protein from the presence of Pfam domains. This has two main advantages: The constraint estimates are independent from the availability and the phylogenetic diversity of orthologs to the seed protein; and the constraint pattern for a protein depends only on its Pfam domain composition and not on the species it was derived from. On the example of MRS2 we have shown that relying on Pfam for inferring the evolutionary constraints bears also a certain risk.
Those sequences harbouring evolutionarily conserved regions that are not (yet) represented by a Pfam model will assume to evolve free of constraint. However, as Pfam becomes more comprehensive, the problem will ameliorate. Moreover, protTrace facilitates the inference of custom constraint patterns from a user-provided alignment of orthologs to the seed protein.   . S5).

Maximum likelihood distance estimation.
We computed pairwise maximum likelihood distances between proteins using TreePuzzle v5.225 (Schmidt, et al. 2002). To arrive at an average maximum likelihood genetic distance between any pair of species, we extracted and aligned all pairwise orthologs for the two species from the OMA database (Altenhoff, et al. 2015). In the case of 1 to many ortholog groups, we considered all induced pairwise orthology relationships. The alignments were then concatenated and served as input for TreePuzzle to compute an average maximum likelihood distance. The procedure was repeated for all species pairs in the reference tree to obtain an all-against-all maximum likelihood distance matrix. set. An E-value cut-off of 10E-3 was applied. Significantly enriched GO terms were then visualized using Revigo (Supek, et al. 2011). Data availability All data that supports the finding of this study are available from the corresponding author upon request.
Supplementary Information is available in the online version of the paper.