RNA metabolism, broadly defined as the compendium of all processes that involve RNA, including transcription, processing and modification of transcripts, translation, RNA degradation and its regulation, is the central and most evolutionarily conserved part of cell physiology. A comprehensive, genome-wide census of all enzymatic and non-enzymatic protein domains involved in RNA metabolism was conducted by using sequence profile analysis and structural comparisons. Proteins related to RNA metabolism comprise from 3 to 11% of the complete protein repertoire in bacteria, archaea and eukaryotes, with the greatest fraction seen in parasitic bacteria with small genomes. Approximately one-half of protein domains involved in RNA metabolism are present in most, if not all, species from all three primary kingdoms and are traceable to the last universal common ancestor (LUCA). The principal features of LUCA’s RNA metabolism system were reconstructed by parsimony-based evolutionary analysis of all relevant groups of orthologous proteins. This reconstruction shows that LUCA possessed not only the basal translation system, but also the principal forms of RNA modification, such as methylation, pseudouridylation and thiouridylation, as well as simple mechanisms for polyadenylation and RNA degradation. Some of these ancient domains form paralogous groups whose evolution can be traced back in time beyond LUCA, towards low-specificity proteins, which probably functioned as cofactors for ribozymes within the RNA world framework. The main lineage-specific innovations of RNA metabolism systems were identified. The most notable phase of innovation in RNA metabolism coincides with the advent of eukaryotes and was brought about by the merge of the archaeal and bacterial systems via mitochondrial endosymbiosis, but also involved emergence of several new, eukaryote-specific RNA-binding domains. Subsequent, vast expansions of these domains mark the origin of alternative splicing in animals and probably in plants. In addition to the reconstruction of the evolutionary history of RNA metabolism, this analysis produced numerous functional predictions, e.g. of previously undetected enzymes of RNA modification.
Received November 1, 2001; Revised December 20, 2001; Accepted January 2, 2002.
All cells synthesize a vast array of RNAs, using DNA or RNA templates, through a nucleoside polymerization reaction catalyzed by RNA polymerases ( 1 ). The mRNAs are templates for the ribosomal synthesis of proteins. Ribosomal RNAs are central to the functions of the ribosome, such as recognition and positioning of the mRNA and formation of the peptide bond during protein synthesis, whereas tRNAs are adaptors that deliver aminoacyl units to the site of protein synthesis and read the genetic code during translation through complementary pairing with codons in mRNA. In addition to these ubiquitous RNAs that are embedded in the Central Dogma of molecular biology ( 2 ), there is a plethora of other RNAs whose occurrence ranges from universality to a presence in only one of the terminal lineages of life. These include, among others, the ubiquitous signal recognition particle RNA involved in secretion, the nearly universal RNase P ribozyme, the small guide RNAs of eukaryotes and archaea that participate in processing and modification events to produce mature mRNAs, rRNAs and tRNAs, the bacterial tmRNA involved in protein degradation, the telomerase RNA from eukaryotes that acts as the template for the synthesis of chromosomal termini, the guide RNAs of trypanosomes involved in RNA editing, the small temporal (st) RNA, such as Lin-4, implicated in post-transcriptional regulation in animals, and the animal RoX1/2 and XIST RNAs, which contribute to chromosomal organization ( 1 , 3 – 8 ). From the time a RNA chain is elongated by the RNA polymerase to its ultimate destruction by ribonucleases, it undergoes interactions with numerous proteins that either form a variety of ribonucleoprotein (RNP) complexes or catalyze various reactions that modify the RNA’s composition or structure. This complex set of processes centered around RNA–protein and RNA–RNA interactions constitutes what can be termed ‘RNA metabolism’.
Thus defined, RNA metabolism is an integral part of the basal processes of molecular biology, namely transcription, translation and secretion, as well as numerous other cellular systems that employ RNAs in various capacities. The diversity of functional contexts notwithstanding, a number of computational analyses of proteins involved in RNA metabolism have brought out several unifying themes in the form of domains that bind RNA molecules and/or catalyze reactions of RNA processing and modification across these different contexts. This justifies the above definition of RNA metabolism and calls for a synthetic treatment of the cellular processes that involve RNA. Several previous computational analyses have considered specific aspects of RNA metabolism and concentrated on the identification of previously undetected domains in proteins involved in these processes ( 9 – 23 ). The results from such studies cast light on the early evolution of life, the last universal common ancestor (LUCA), the events surrounding the divergence of the major lineages of life, and potentially even the transition from the ancient RNA world to the modern-type, protein-dominated cellular systems.
In order to obtain a comprehensive view of the evolution of RNA metabolism, we conducted a large-scale computational analysis of the proteins involved in RNA metabolism. This analysis was chiefly based on detection of statistically significant similarities through sequence and structure comparisons, determination of orthologous and paralogous relationships between proteins, and utilization of contextual information derived from domain fusions, operon organization and phyletic patterns. This allowed us to define the major transitions and relative temporal order in the evolution of the principal branches of RNA metabolism and to gain some insights into the earliest phases of life’s evolution. Using the parsimony principle, we reconstructed the probable repertoire of genes and functions related to RNA metabolism that were present in LUCA. The analysis also enabled systematization of the vast amounts of information on RNA metabolism that have become available through genome sequencing, and produced structural and functional predictions that might facilitate further experimental studies on RNA metabolism.
MATERIALS AND METHODS
Eighteen complete bacterial genomes [ Aquifex aeolicus (Aae), Bacillus subtilis (Bs), Borrelia burgdorferi (Bb), Campylobacter jejuni (Cj), Chlamydia trachomatis (Ct), Deinococcus radiodurans (Dr), Escherichia coli (Ec), Haemophilus influenzae (Hi), Helicobacter pylori (Hp), Mycobacterium tuberculosis (Mt), Neisseria meningitides (Nm), Pseudomonas aeruginosa (Pa), Rickettsia prowazekii (Rp), Synechocystis PCC6803 (Ssp), Thermotoga maritima (Tm), Treponema pallidum (Tp), Ureaplasma urealyticum (Uu) and Xylella fastidiosa (Xf)], seven complete archaeal genomes [ Aeropyrum pernix (Ap), Archaeoglobus fulgidus (Af), Halobacterium sp. NRC-1 (Hsp), Methanobacterium thermoautotrophicum (Mta), Methanococcus jannashii (Mj), Pyrococcus horikoshii (Ph) and Thermoplasma acidophilum (Ta)] and six (nearly) complete eukaryotic genomes [ Arabidopsis thaliana (At), Caenorhabditis elegans (Ce), Drosophila melanogaster (Dm), Homo sapiens (Hs), Saccharomyces cerevisiae (Sc) and Schizosaccharomyces pombe (Sp)] were investigated. The sequence data for all genomes were obtained using the Genome Division of the Entrez system at the National Center for Biotechnology (NCBI) ( http://www.ncbi.nlm.nih.gov/Entrez/Genome/main_genomes.html ).
The domains listed in Table 1 were included in this study. A set of representative sequences was chosen for each domain, and position-specific scoring matrices (PSSMs or profiles) were generated by running the PSI-BLAST program ( 24 – 26 ) against the non-redundant protein sequence database at the NCBI, with the expectation ( E ) value of 0.01 typically used as the cut-off for including sequences into the profile. PSI-BLAST searches ( E value = 0.01) using the constructed profiles were run against the protein sets from each of the genomes included in the study, and lists of all proteins containing the given domain were compiled. After verifying the presence of the target domain through examination of the conservation of the salient amino acid sequence and structural motifs, other domains present in the respective proteins were identified by running PSI-BLAST searches with these sequences as queries and by comparing them with libraries of domain-specific profiles using the NCBI CD-search ( http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml ) and an additional profile collection (L.Aravind, unpublished data) or by using hidden Markov models (HMMs) for conserved domains ( 27 ). The proteins were then clustered by sequence similarity using the BLASTCLUST program (I.Dondoshansky, Y.I.Wolf and E.V.Koonin, unpublished data). Multiple alignments, whenever deemed necessary, were constructed using the T_coffee program ( 28 ) and phylogenetic trees were constructed using the PHYLIP and MOLPHY packages ( 29 – 31 ). Protein secondary structure was predicted using the PHD program ( 32 ) and structure coordinate files were handled using the Swiss-PDB Viewer program ( 33 ).
RESULTS AND DISCUSSION
Scope and approach
A meaningful computational analysis of proteins involved in RNA metabolism requires a clear definition of the components under investigation so that it does not draw in every other cellular protein based on an indirect connection with such a pivotal class of molecules as RNA. We restrict our scope essentially to those proteins that more or less directly interact with every known type of RNA from the time it is synthesized by the RNA polymerase until its ultimate degradation by nucleases. Briefly, this includes (i) certain components of the transcription machinery itself that directly interact with the transcript in the process of elongation and termination; (ii) proteins involved in the processes that, at least in eukaryotes, occur shortly after transcription, namely polyadenylation and capping; (iii) various complexes and enzymes involved in the maturation of RNAs, including numerous enzymes that catalyze covalent modification of RNA (e.g. methylation) and the complex splicing machinery involved in pre-mRNA processing in eukaryotes; (iv) translation and its regulation; (v) post-transcriptional gene regulation (PTGR), which, in its simplest form, involves various RNAses that catalyze mRNA degradation [a more complex form of such regulation in eukaryotes is post-transcriptional gene silencing (PTGS)]; (vi) proteins that interact with diverse RNAs during maturation of functional complexes, such as the ribosome, the signal recognition particle, RNase P, SnuRPs, SnoRPs, telomerase, the SsrA particle and the Ro-yRNA particles. Not included in this analysis are proteins that regulate transcription through interaction with DNA and generic structural proteins of various complexes, such as those containing WD40 or tetratricopeptide (TPR) repeats, which have similar roles in protein–protein interaction in both RNA metabolism and other systems.
Proteins involved in RNA metabolism were collected through a systematic survey of the literature. This was followed by profile sequence analysis using PSI-BLAST to identify the domains present in these proteins. Once identified, these domains were categorized into two principal classes: (i) enzymatic domains and (ii) interaction domains. The latter class mainly consists of non-catalytic, RNA-binding domains (RBDs) and some protein–protein interaction domains that are predominantly associated with the formation of multisubunit complexes involved in RNA metabolism. Table 1 shows the list of the principal domains included in this analysis. One or more PSI-BLAST PSSMs that ensured complete coverage without inclusion of false positives were prepared for each of the domains and a representative set of complete proteomes (see Materials and Methods) sampled across the three primary kingdoms (bacteria, archaea and eukaryotes) was searched with these profiles to detect all occurrences of each domain. The proteins recovered from all the proteomes were then pooled together and potential orthologous sets were delineated by clustering with BLASTCLUST. These groups of orthologs were corrected and optimized using the symmetry of recovery in single-pass BLAST searches ( 34 ) and phylogenetic tree construction and analysis using the minimum evolution (least squares) and maximum likelihood methods ( 29 – 31 ). The domain architecture of each individual protein was then determined by using libraries of PSSMs and HMMs. Finally, we attempted to reconstruct the conservation patterns of functional complexes and pathways across the entire phyletic range of analyzed genomes by combining the results of protein domain analysis with experimental evidence extracted from the literature.
The most likely points of origin of domains and individual protein families involved in RNA metabolism were inferred from the patterns of phyletic distribution and phylogenetic tree topology and on the basis of the parsimony principle. If a particular domain or protein family is widely represented in all three primary kingdoms, bacteria, archaea and eukaryotes, the most parsimonious scenario of evolution points to its presence in LUCA. This conclusion is reinforced when the phylogenetic tree for the family in question family conforms to the ‘standard model’ topology, with a bacterial and archaeo-eukaryotic primary clades ( 35 , 36 ). Conversely, the derivation of a family in LUCA or earlier was considered less likely when a fundamentally different topology was observed, such as grouping of bacteria with eukaryotes. In such a case, a (pre)LUCA origin of the given family would require the extra assumption of displacement of the ancestral form with the bacterial one in eukaryotes, which makes a bacterial or archaeal origin with subsequent dissemination by horizontal gene transfer a viable alternative. Along similar lines, the parsimony principle dictates that, for example, when a domain or a family is widely represented in bacteria and eukaryotes, but is only sporadically encountered in archaea, the most likely scenario involves derivation within the bacterial kingdom and independent acquisitions by eukaryotes and archaea via horizontal gene transfer. Below, in the discussion of the evolution of domains and protein families, we follow these principles of phylogenetic inference, not necessarily referring to them explicitly. All conclusions arrived at with this approach are necessarily probabilistic, but this appears to be the best we can do when reconstruction of ancient evolutionary events is concerned.
In a number of cases, detection of homologs of proteins involved in RNA metabolism required additional correction because a subset of RNA-interacting domains are also involved in DNA binding. We utilized a variety of inputs from experimental studies, phylogenetic relationships and contextual information to exclude those domains and proteins that are primarily involved in DNA binding and metabolism. Nevertheless, a relatively small fraction of the detected domains and proteins either are indeed bifunctional, being involved in both RNA and DNA metabolism, or cannot be assigned specific function with confidence due to insufficient information; such proteins were included in the present analysis for the sake of completeness.
Phyletic patterns and genome-wide demography of protein domains involved in RNA metabolism
We delineated domains involved in RNA metabolism as described above and conducted a survey of their demography across the genomes of representative organisms considered in this study. This overall demographic survey revealed a number of general trends in the evolution of these domains (Fig. 1 A–D). The most notable, if not unexpected, feature was the separation of the three primary kingdoms by specific phyletic patterns of many domains. A large set of enzymes and interaction domains are present universally and, in all likelihood, are part of the LUCA inheritance (Fig. 1 A and C). However, another substantial fraction of the domains involved in RNA metabolism appear to have evolved in a particular superkingdom or lineage, with the greatest number of lineage-specific inventions found in eukaryotes (Fig. 1 D). Many eukaryote-specific domains belong to ancient folds, but acquired their RNA‐related function only in the eukaryotic lineage. Examples of such exaptation of ancient domains for functions in RNA metabolism include the mRNA-capping enzyme that was derived, at the onset of eukaryotic evolution, from the more ancient DNA ligases ( 37 – 39 ), and the lariat-debranching enzyme that was derived from the ubiquitous calcineurin-like phosphoesterases ( 40 , 41 ). Similarly, superfamily (SF)-I helicases were recruited for important RNA-related functions, such as nonsense-mediated decay, only in eukaryotes, although several such helicases function in bacterial DNA recombination and repair. Some eukaryote-specific enzymes, such as the RNA‐dependent RNA polymerase involved in PTGS and the Kem1/Rat1 family of 5′→3′ nucleases, have large, complex catalytic domains that so far could not be traced to any ancient enzymatic fold. Although structural innovation is less common in prokaryotes than it is in eukaryotes, there are a few enzymes, for example, the RNase domains of the RNaseE/G superfamily, that appear to be innovations of the bacterial lineage.
The interaction domains also show a strong trend of eukaryote-specific innovation, the most prominent one being the RNA recognition motif (RRM), which apparently was derived from a more ancient nucleic acid-binding fold with a characteristic four-stranded core found in diverse DNA- and RNA-binding domains (Table 1 ). Another theme seen in eukaryotes is the recruitment of α-helical superstructures, such as the TPR-like fold (the HAT repeat module found in RNA processing proteins), the pumilio (PUM) repeat ( 42 , 43 ), and the NIC domains ( 16 ) for functions in RNA metabolism. This parallels the widespread utilization of these α-helical repeat modules in a number of other contexts in eukaryotes. Many of the distinct, small RBDs that evolved in eukaryotes, such as CCCH, Zn knuckle, C2H2-, LRP1- and C4-Little fingers utilize the common theme of stabilization through metal chelating cysteines and histidines (Fig. 1 D). This type of structure is ancient, with numerous Zn-ribbon modules found in archaea ( 44 ), but many of these metal- and RNA-binding domains seem to have evolved de novo in eukaryotes, given that utilization of metal coordination to stabilize the core of a domain requires relatively few evolutionary changes, namely the emergence of a strategically placed set of metal-chelating residues.
Another major pattern in the phyletic distribution is the presence of numerous catalytic and interaction domains that are shared by eukaryotes and bacteria, to the exclusion of archaea (Fig. 1 A–D). Another distinct set of domains is solely shared by archaea and eukaryotes, which supports the chimeric origin of the eukaryotic systems of RNA metabolism. A subset of proteins containing domains shared by eukaryotes and bacteria function in the mitochondria and chloroplasts that have descended from endosymbiotic bacteria. This is reflected in the larger average number of proteins with such a phyletic pattern in plants that have two distinct endosymbiont organelles, mitochondria and chloroplasts. However, several domains with a bacterio-eukaryotic distribution pattern function in non-organellar contexts, such as cytoplasmic RNA degradation. Enzymes of apparent bacterial origin recruited for cytoplasmic functions include several superfamilies of RNases, such as the 3′→5′ exonucleases ( 45 ). Of the domains with an archaeo-eukaryotic phyletic pattern, several are involved in core processes, such as RNA maturation, e.g. the tRNA endonucleases, and translation, e.g. PIWI ( 14 ), pelota and SUI1 domains ( 9 ).
Most of the domains involved in ancient functions, such as RNA modification enzymes and RBDs associated with RNA modification, translation and transcription (Table 1 and Fig. 1 ), are present in nearly constant numbers in all life forms, except that eukaryotes often have more paralogs, partly owing to the presence of organelles derived from bacteria. Eukaryotes show a striking expansion of ancient SFII RNA helicases and, to a lesser extent, of other ancient catalytic domains, such as SFI helicases, GTPases, Rossmann-fold methylases, 3′→5′ exonucleases, RNase III and deaminases. A corresponding expansion of non-catalytic domains is mainly restricted to those newly invented or recruited in eukaryotes, including RRM, CCCH, Zn-Knuckle and G-patch. The advent of these RBDs correlates with the emergence of eukaryote-specific functional systems, such as pre-mRNA splicing, PTGR, and mRNA editing and modification (Fig. 1 ).
These observations indicate that 40–45 of the approximately 100 principal domains associated with RNA metabolism originated at early stages of evolution, prior to LUCA. These domains were associated with the most ancient and conserved cellular functions, such as translation, transcription and some forms of RNA modification. The next phase of innovation marked the separation of the bacterial and archaeo-eukaryotic lineages and saw the origin of some proteins, which are involved in basic cellular functions, but are specific to one of these lineages. Finally, with the emergence of the chimeric eukaryotic lineage, domains from both the bacterial and the archaeo-eukaryotic precursor were incorporated into the eukaryotic RNA metabolism pathways. In addition, eukaryotes also ‘invented’ several new domains and recruited or expanded preexisting ones, concomitant with the origin of new RNA processing systems that were largely absent in prokaryotes. No archaea-specific domains involved in RNA metabolism were identified. This might reflect the retention of most core archaeal systems in eukaryotes, which makes the corresponding domains archaeo-eukaryotic in distribution. In addition, archaea could possess some distinct domains that were not detectable through homology and remain unknown due to the paucity of experimental studies in archaeal systems.
The surveyed organisms dedicate, approximately, between 3 and 11% of their proteomes to RNA metabolism, with the highest fraction, predictably, seen in parasitic bacteria with small genomes and the lowest fraction in multicellular eukaryotes and complex bacteria. Generally, this seems to reflect (i) the central place that RNA metabolism systems occupy in all cells, compared with the substantially more variable systems of transcription, replication or DNA repair, and (ii) a more or less linear growth of the number of proteins involved in RNA metabolism with the increase of the total number of encoded proteins in free-living organisms. Below we discuss in detail specific trends in evolution of catalytic and interaction domains involved in RNA metabolism.
Evolutionary histories of catalytic domains involved in RNA metabolism
RNA modifying enzymes. Cellular RNAs are subject to a number of post-transcriptional modifications that involve modification of the bases and sugars or synthesis of non-canonical bases or nucleotides ( 46 – 48 ). The direct nucleotide modifications include methylation of bases and sugars on N, C or O atoms, deamination and demethylation, whereas formation of non-canonical bases includes thiouridylation, pseudouridylation, thioadenylation, dihydrouridylation, and synthesis of archaeosine and queuine.
Methylases. The most common among RNA modifications are the numerous methylations of all types of RNA molecules ( 46 ). The RNA methylases come in two major classes (Table 1 ): (i) the Rossmann-fold methylases, which include the majority of N-, C- and O-methylases that modify both sugars and bases in RNA, and (ii) the recently described SPOUT ( 49 ) superfamily, which consists of the m 1 G-specific methylase TrmD ( 50 , 51 ), the 2′- O -methylguanosine-specific methylase SpoU ( 52 – 54 ), and several other poorly characterized predicted RNA. The SPOUT superfamily is traceable to LUCA, but the evolution of these methylases is not considered here in detail because it has been recently described in detail elsewhere ( 49 ).
The methylases of the Rossmann-fold class share a six-stranded Rossmann-fold core with the dinucleotide-binding dehydrogenases and are distinguished from them by a methylase-specific 7th strand ( 20 , 55 ). This class contains the great majority of the known methylases that participate in almost every conceivable methylation reaction in biological systems, and RNA specificity appears to have emerged on multiple occasions among them. We sought to resolve the evolutionary relationships among Rossmann-fold RNA methylases using a combination of conventional phylogenetic trees and cladistic analysis based on specific shared sequence motifs (Fig. 2 ). Several distinct lineages of dedicated RNA methylases can be detected; some of the corresponding protein families also include related DNA methylases. The RNA methylases, typically, are highly conserved and are often associated with specific RBDs, which distinguish them from the DNA methylases; many of the latter are large proteins occurring in restriction-modification operons with a sporadic phyletic distribution. The largest monophyletic superfamily of nucleic acid methylases are the base N-methylases (the BNM superfamily). These methylases are characterized by a shared derived character, the [N/D]PP[Y/F] motif at the end of strand 4, which is associated with base specificity (Fig. 2 ). Phylogenetic analysis helped in identifying several distinct families within the BNM superfamily, and most of these families can be distinguished by specific derived characters in the above motif. Within the BNM superfamily, two families, namely the HemK family ( 19 ) and the MJ0438 family of predicted methylases containing the RNA-binding THUMP domain ( 12 ), are represented in all three primary kingdoms and are thus traceable to LUCA. Along with several other related families with more restricted phyletic patterns, these families form a large assemblage of (predicted) purine N-methylases with the NPP[Y/F] motif associated with strand 4. Some of the smaller families appear to be more closely related to either the HemK or the MJ0438 family and might have emerged from them through duplications much later in evolution. The RsmC family methylases that methylate G1207 in 16S rRNA ( 56 ) and RsmD,YfiC and YbiN families are bacteria-specific elaborations that are related to the HemK family, whereas the MJ0046 family apparently was derived from the HemK family in the archaeo-eukaryotic lineage. The MJ0438-related elaborations, namely the MJ0710 and MJ0284 lineages, are present in archaea and eukaryotes. The YhhF and MJ1273 families, which are restricted in their distribution to bacteria and archaea, respectively, also belong to this assemblage, but do not show a specific relationship with either the HemK or the MJ0438 family. The functions of the HemK and MJ0438 families are poorly characterized, but their nearly universal conservation pattern suggests a role in purine methylation in rRNA. In Rickettsia , the HemK methylase is fused with another methyltransferase of a different family, MicA (Fig. 2 ). This suggests that these two methylases coordinately function in rRNA methylation.
The next major assemblage within the BNM superfamily is distinguished by the motif DPP followed by a polar residue (typically R) after strand 4. One of the main families within this assemblage is the Trm2 family, which is involved in methylation of U54 in tRNA at the 5 position ( 57 ). This family with its pan-bacterial distribution appears to have emerged early in bacterial evolution and apparently was subsequently transferred to the eukaryotic lineage through the mitochondrial symbiosis. Certain bacteria encode an additional methylase family of this assemblage, TrmA, which has the same specificity ( 58 ), and appears to have branched off the more widespread Trm2 family. Similarly, eukaryotes have their own, specific methylase family related to the Trm2 family proteins and typified by CG3808 from Drosophila . Another prominent group within this assemblage is the MJ1653 family that shows a fusion to the RNA-binding PUA domain and is widespread in both archaea and bacteria. Families with a more restricted distribution, which are probably more recent offshoots of this lineage, include the YcbY family seen only in some bacteria and the archaeal MJ1233 family (Fig. 2 ).
The last major group of the BNM assemblage are the methylases with a circularly permuted methylase domain. All members of this group that are widespread in prokaryotes are DNA adenine methylases associated with restriction–modification systems. In eukaryotes, this group diversified into three distinct families of adenine mRNA methylases ( 59 ) typified by the yeast proteins Kar4p and Ime4p, and Drosophila CG14906 (lost in S.cerevisiae ), respectively (Fig. 2 ). In these families, the motif associated with strand 4 assumes the form [D/E]PPW, which is shared with DNA adenine methylases, such as MunI.
The SUN superfamily is the next major assemblage of Rossmann-like fold RNA methylases, which is the sister group of the BNM superfamily (Fig. 2 ) and has the diagnostic motif DAPC associated with strand 4. The Sun family enzymes, which methylate rRNA at the cytosine 5 position ( 23 ), are represented in all three primary kingdoms, consistent with their presence in LUCA. The SUN superfamily has undergone extensive radiation in archaea and eukaryotes, giving rise to two distinct families prior to the separation of eukaryotes and archaea and the eukaryote-specific Nop1 family involved in rRNA and snU RNA methylation ( 60 ).
The Erm1/KsgA family that has the motif NLP[Y/F] associated with strand 4 is another close sister group of the BNM superfamily (Fig. 2 ). These methylases are conserved in all life forms and are responsible for diadenine 2-methylation in rRNA ( 61 ), which suggests the presence of this modification in LUCA. The archaeo-eukaryotic Trm5 tRNA methylase family and the archaea-specific MJ1557 family also have a similar form of the strand-4 motif, suggesting that these families form a monophyletic superfamily with the KsgA family (Fig. 2 ).
Generically related to the BNM, SUN and KsgA-Trm5-like superfamilies are two methylase groups with a more restricted distribution. One of these is the bacterial YqlF family, which has an N-terminal S4 domain and a strand-4 motif of the form D[V/L]DF. Thus, this family shares the conserved D or N followed by two small residues and the predicted base-interacting aromatic or hydrophobic residue with the former superfamilies. The second group, the Uvi22 superfamily, also has a similar strand-4 motif, but has a unique, two small amino acid insert prior to the conserved D at the end of strand 4. While none of the members of this superfamily has been experimentally characterized as RNA methylases, the presence of the characteristic form of the above mentioned strand-4 motif supports this function. Additionally, one of the yeast members of this family is fused to a RNA deaminase (see below), suggesting a role in RNA modification (Fig. 2 ). This superfamily is restricted to the proteobacteria (conserved in all α‐proteobacteria) amidst the bacteria, while it vastly expanded into several distinct families in eukaryotes. This pattern, taken together with phylogenetic analysis results (data not shown), suggests an origin from the mitochondrial endosymbiont. Members of this superfamily might represent a major, as yet unexplored group of eukaryotic nucleic acid methylases.
Sequence evidence and the distinct form of the strand-4 motif suggest that all methylase superfamilies described above descended from a common RNA-methylating ancestor well before the emergence of LUCA. Structural comparisons reveal even deeper links, suggesting that these methylases, in turn, form a higher-order monophyletic group with the FtsJ superfamily of methylases involved in 2′- O -methylation of uridine in LSU rRNA ( 62 ) (Fig. 2 ). The FtsJ/RrmJ family proper is represented in all three primary kingdoms, which points to its presence in LUCA. Several other related families, such as YgdE in bacteria and at least four distinct eukaryotic families, including two animal-specific ones, were derived at various later points in evolution, probably from a FtsJ-like precursor. Some of these, e.g. the Spb1 family, might methylate Sno RNAs ( 63 ), suggesting that other, unexplored specificities exist within this family of methylases. Structural comparisons indicate that the group of RNA methylases closest to the FtsJ superfamily is the Fibrillarin/Nop1 family, which is involved in snoRNA methylation ( 64 ). This family is restricted to the archaeo-eukaryotic lineage and might have been derived from the FtsJ superfamily through extreme divergent evolution. The archaeo-eukaryotic Trm1 methylase family and the MicA family shared by bacteria and eukaryotes appear to comprise another monophyletic group, which appears to be a sister group of all of the rRNA methylases described above (Fig. 2 ). Both these families share a similar form of the strand-4 motif with the signature DP followed by an aromatic and then by a small residue. Trm1 functions as a tRNA N2,N2‐dimethylguanosine-26 methyltransferase ( 65 , 66 ) and MicA probably performs a similar, although not identical, role in bacteria and eukaryotic mitochondria. These two families might represent the archaeo-eukaryotic and bacterial branches, respectively, of an ancestral methylase that was represented in LUCA.
All the other groups of RNA methylases appear to have been derived, independently, on more than one occasion in evolution, from within the vast assemblage of small molecule and protein methylases. None of these families is traceable to LUCA; instead, they are restricted in their distribution to only one or two of the primary kingdoms. Two of these families, the Abd1p family that methylates the eukaryotic mRNA cap, and Yml014w family that is fused, in some cases, to the AlkB domain (see below), have a dyad of aromatic residues in the 4th position after the end of strand 4. This feature suggests their derivation from within the vast class of small molecule methylases. The Yml014w family has additionally lost the polar residue (D/N) at the end of strand 4. Also derived from within this small molecule methylase assemblage is the family typified by the plant Corymbosa2/Hen-1 protein. Predicted methylases of this family are present in the crown group eukaryotes and in some bacteria, such as Streptomyces and Nostoc , and retain a single aromatic residue in the 4th position after the end of strand 4. The plant representatives of this family are fused to an N-terminal RNA-binding LA domain and a double-stranded RNA-binding domain (dsRBD) (Table 1 and Fig. 2 ), which suggests that these proteins are RNA methylases that probably methylate substrates containing double-stranded regions (see below). The GCD14 family of methylases ( 67 , 68 ), which methylate A58 of tRNAs in position 1, was derived in the archaeo-eukaryotic lineage and is more closely related to protein arginine and carboxyl group methylases than to other RNA methylases. These methylases have been sporadically transferred to bacteria, such as M.tuberculosis and A.aeolicus . They are distinguished by the presence of a distinct C-terminal domain similar to the transcript cleavage factor GreA ( 69 ). This family appears to have undergone a duplication in eukaryotes, giving rise to a paralog, GCD10, whose methylase domain shows a disruption of the Rossmann-fold loop and the strand-4 region. The RrmA family that methylates G745 in position 1 in LSU rRNA ( 70 ) is another family that appears to have been derived from the small molecule methylases late in bacterial evolution, followed by inter-bacterial dispersion via horizontal transfer.
Thus, Rossmann-fold methylase appear to have been recruited for RNA methylation at an early stage of evolution, well before LUCA. From this ancient, ancestral methylase, the significant majority of the RNA methylases, including the five to six aforementioned methylase families that were probably already present in LUCA, were derived. Extensive duplication, later in evolution, particularly in eukaryotes, resulted in the formation of several more families within this large, monophyletic assembly of RNA methylases. Additionally, lineage-specific RNA methylases were apparently derived independently, on multiple occasions, from within the small molecule and protein methylase clade. At early stages of their evolution, RNA methylases formed stable fusions with several distinct RBDs, such as the S4, PUA ( 9 ), TRAM ( 11 ), THUMP ( 12 ), NusB and a potential OB-fold domain (in Trm5) ( 71 ) (Fig. 2 ). In addition, in eukaryotes, fusions of RNA methylases to eukaryotic-specific RBDs, including RRM and CCCH domains in the TrmA-family methylases and a G-patch domain ( 18 ) in the FtsJ family, were detected. These fusions appear to have emerged relatively late in eukaryotic evolution and probably participate in the methylation of eukaryote-specific snRNAs. Most of these pan-bacterial families of methylases appear to have been horizontally transferred to the eukaryotic genomes as a consequence of organellar endosymbiosis, resulting in a bacterial–eukaryotic distribution pattern. The identification of several uncharacterized RNA methylase groups in this analysis (Table 1 ) may help in further investigations of the diversity of this crucial RNA modification.
Pseudouridine synthases. The modified base pseudouridine is synthesized by pseudouridine synthases via in situ isomerization of uridines in tRNAs, rRNAs and eukaryotic snRNAs, such as U5 and U3 ( 46 , 72 ). Pseudouridine synthases belong to two apparently unrelated superfamilies, one of which (Type I PSUS) includes the four principal ancient families, RluD, RsuA, TruB and MJ0041, whereas the other superfamily (Type II PSUS) consists of a single ancient lineage typified by TruA ( 22 , 73 , 74 ) (Fig. 3 ). Type II PSUS are present in a single copy in all proteomes, except for eukaryotes that have at least three enzymes of this superfamily. Within the Type I PSUS superfamily, the TruB family is traceable to LUCA; several members of this family are fused to a PUA domain, suggesting that this was the ancestral PSUS Type I domain architecture. The RluD and RsuA families originated in bacteria; each family includes several members containing the S4 RBD ( 9 ), which was probably present in the ancestor of these families, but was subsequently lost on multiple secondary occasions. Conversely, the THUMP-domain-containing MJ0041 family of PSUS appears to be an innovation specific to the archaeal lineage. The RluD family has been secondarily transferred to the eukaryotes, probably via the pro-mitochondrial endosymbiont. Type I PSUS are predicted to adopt an α+β fold; the crystal structure of the Type II PSUS shows the presence of a core RRM-like fold common to several ancient nucleic acid-binding domains ( 75 ). This, taken together with the use of guide RNAs by the eukaryotic PSUS, suggests that Type II PSUS might have evolved from an ancient RBD that functioned in conjunction with a ribozyme, with a gradual shift of the active site from the RNA to the protein component.
Enzymes involved in base thiolation. A variety of thio-bases are represented in cellular RNAs, the most common ones being 2- or 4-thiouridines and their derivatives, and 2-methylthioadenine derivatives. The methylthioadenines are typically additionally modified with bulky adducts, such as threonine or 4-hydroxyisopentene in the N6 position. Recently, the enzyme responsible for adenine thiolation in E.coli , MiaB ( 76 ), has been identified and shown to consist of a C-terminal RNA-binding TRAM domain and an N-terminal biotin synthase-like, metal cluster-containing catalytic domain that is predicted to catalyze sulfur insertion via SAM-dependent organic radical generation ( 11 , 77 , 78 ). MiaB-like proteins are universally present in all life forms, indicating their origin prior to LUCA. Several organisms encode more than one version of this enzyme, which appear to have diversified through early duplications; these multiple forms might differentially function in the synthesis of different 2-methylthioadenine derivatives, such as 2-methylthio-N6-threonyl carbamoyladenosine and 2‐methylthio-N6-methyladenosine ( 46 ).
Thiouridine synthase (ThioUS; ThiI protein in E.coli ) is involved in the synthesis of 4-thiouridine in tRNAs and has a core PP-ATPase domain ( 79 ), which catalyzes adenylylation of the 4-carbonyl group of uridine, followed by sulfur insertion catalyzed by a rhodanese-like enzyme ( 80 , 81 ). This rhodanese-like enzyme either comprises a distinct domain of the ThiI protein or functions as a stand-alone protein. 2-Thiouridine is universally present in tRNA, and 2-thiouridine derivatives, typically containing an additional modification of a methyl or aminomethyl group in position 5, are also common. One of the enzymes involved in 2-thiouridine synthesis, TrmU, has been identified ( 82 ). This protein contains a PP-ATPase domain with an unusual conserved cysteine dyad inserted after strand 3 in the PP-loop domain. This suggests that syntheses of 2-thiouridine and 4-thiouridine follow similar biochemistry, which involves activation of the carbonyl group by adenylylation. In TrmU‐like enzymes, the internal conserved cysteines might directly participate in sulfur insertion as a functional counterpart of the separate rhodanese-like domain, which is required for 4-thiouridine formation.
Previously, we predicted that the MJ0066 family represents a novel family of archaeal ThioUS, on the basis of the fusion of a PP-ATPase domain with a PUA domain ( 9 ). Here, we systemically investigated other PP-ATPase families that potentially could be involved in thiouridine or thiocytidine synthesis by examining fusions with RBDs, association with the ribosomal super-operon and conserved phyletic patterns typical of RNA metabolism proteins. As a result, the MTH271-MJ1157 family, which showed fusions with the KH and Zn‐ribbon domains, and the MJ0690 family, which is associated with ribosomal super-operon in different archaeal genomes, emerged as candidates for these functions (Fig. 3 ). Furthermore, the MesJ family, which is closely related to the TrmU family, is universally conserved in all bacteria and potentially also could be involved in base thiolation.
The ThiI-family proteins contain a N-terminal THUMP domain and are bifunctional proteins that additionally participate in thiamin biosynthesis ( 80 , 81 , 83 ). These proteins are ubiquitous in archaea, but sporadic in bacteria, suggesting that they originated in archaea, with several subsequent horizontal transfers to bacteria. In contrast, the TrmU family is absent in archaea, but is nearly universal in bacteria and eukaryotes, suggesting origin in the bacterial lineage, followed by an early transfer to the eukaryotes, probably via the pro-mitochondrial route. These phyletic patterns do not seem to be consistent with the universal distribution of the simple 2-thiouridine modification in tRNAs ( 46 , 84 ). The only predicted universal ThioUS are the members of the MJ1157 subfamily (Fig. 3 ) of the MTH271-MJ1157 family containing N- and C-terminal Zn‐ribbon domains. The universal distribution, taken together with the distinct bacterial and archaeo-eukaryotic clades detected during the phylogenetic analysis of this family (data not shown), suggests that these enzymes are the 2-thiouridine synthases, whereas TrmU is likely to be specifically involved in 5-methyl-2-thiouridine biosynthesis. The presence of a conserved cysteine dyad insert in the PP-ATPase domain of the MTH271-MJ1157 family, similar to TrmU, might indicate an analogous catalytic mechanism. Archaea and eukaryotes have additional subfamilies of the MTH271-MJ1157 family (Fig. 3 ) that, along with the MJ0066 family, could compensate for the absence of TrmU or ThiI in some of these lineages. The sporadic presence of ThiI in bacteria suggests that it might be substituted by a more widespread, but so far unidentified 4‐thiouridine synthase specific to bacteria; the conserved MesJ protein could be one candidate for this function.
Queuosine and archaeosine synthases. The 7-deazaguanosines, queuosine and archaeosine, found, respectively, in bacteria and eukaryotes, and in archaea, are incorporated into tRNA through transglycosylation ( 46 , 84 ). The queuosine transglycosylase, Tgt ( 85 , 86 ), is present in a single copy in most bacteria, whereas eukaryotes, with the exception of yeast, encode two forms of this enzyme. Archaea have two distinct proteins of the archaeosine transglycosylase family, which are distantly related to Tgt ( 87 ). The MJ1022 subfamily so far has been found only in Euryarchaea; A.fulgidus , in addition, has a single copy of queuosine transglycosylase, apparently a lineage-specific acquisition from bacteria (Fig. 3 ). The complementary distribution of queuosine and archaeosine transglycosylases suggests that they originally diverged from a common ancestor with a TIM barrel fold ( 88 ), concomitantly with the split of the bacterial and archaeo-eukaryotic lineages. In archaea, the catalytic domain was fused with the RNA-binding PUA domain and this form of archaeosine transglycosylase underwent a duplication in Euryarchaea (Fig. 3 ). In eukaryotes, acquisition of the bacterial queuosine synthase through horizontal transfer from the pro-mitochondrion probably resulted in displacement of the ancestral archaeo-eukaryotic archaeosine synthase, with a further duplication leading to the forms involved in modification of organellar and cytoplasmic tRNAs.
RNA deaminases. RNA deaminases are responsible for the synthesis of certain modified nucleosides, such as inosine, and for base conversions during various RNA editing reactions. The cytidine deaminase family includes generic enzymes that catalyze generation of uridine from cytidine. In yeast, these enzymes are responsible for C→U editing ( 89 ), suggesting that they might perform a similar function in many, if not all, eukaryotes. Plants show an expansion of a specialized form of this family, with an N-terminal inactive deaminase domain, in addition to the C-terminal active one; conceivably, these proteins might be involved in a plant-specific form of regulated RNA editing. Deaminases of the Tad2p family, which generate uracil from cytosine and inosine from adenosine in the wobble position of tRNAs ( 90 , 91 ), are present in most bacteria and all eukaryotes, but not in archaea (Fig. 3 ). The Tad3p family, which comprises the second subunit of the inosine-generating deaminase, is eukaryote specific. The combination of Tad2p and Tad3p probably confers the specificity that differentiates this enzyme from generic cytosine deaminases. The eukaryote-specific Tad1p family of deaminases ( 92 ) is involved in inosine generation at A37 of tRNA Ala and in adenine editing of mRNAs in animals ( 93 , 94 ). The animal versions typically have the characteristic dsRBD fused to the catalytic domain, whereas one of the vertebrate paralogs contains a winged helix–turn–helix domain (Fig. 3 ). Cytosine deaminases of the vertebrate-specific APOBEC family are involved in C→U editing and are represented by at least eight paralogs in mammals ( 95 ). These enzymes appear to have been recently derived from the cytidine deaminases through rapid divergent evolution. The deaminases related to the RibD protein, which is involved in riboflavin biosynthesis, are fused to a Type I PSUS in S.cerevisiae and to a potential RNA methylase in S.pombe , suggesting that, similarly to cytidine deaminases, they might be involved in specific editing processes (Figs 2 and 3 ).
Specific RNA deaminases of known families are nearly absent in archaea. The corresponding functions might have been taken over by unrelated, still unknown enzymes or, at least in some cases, could be provided by related enzymes of the deoxycytidine deaminase family that are present in some archaea. This phyletic pattern suggests a bacterial origin for at least two of the major deaminase lineages, cytidine deaminases and cytosine deaminases. Following their acquisition by eukaryotes from the bacterial endosymbiont, cytosine deaminase underwent duplication to give rise to the two A→I deaminases involved in wobble-specific inosine synthesis. Additionally, members of both the cytidine and cytosine deaminase lineages were independently recruited for mRNA editing in vertebrates and possibly in other eukaryotic lineages (Fig. 3 ).
Dihydrouridine synthases. Dihydrouridine synthases are poorly characterized enzymes that synthesize dihydrouridine through the reduction of the aromatic ring of uracil. This base is widely found in tRNAs from all three primary kingdoms and in LSU rRNA from prokaryotes ( 96 , 97 ). The yeast dihydrouridine synthase Dus1p belongs to the superfamily of FAD-binding TIM barrel oxidoreductases typified by dihydroorotate dehydrogenase ( 98 ). This enzyme is universally represented in eukaryotes and bacteria, but completely missing in archaea. Eukaryotes have four main lineages within this family, which are typified by the yeast proteins Dus1p, Smm1p, Ylr405wp and Ylr401cp. The members of the first three families typically show fusions with the LRP1 Zn-finger, dsRBD and CCCH RBDs, respectively (Fig. 3 ); these RBDs probably target dihydrouridine synthases to specific sites in the substrate RNAs. Bacteria have at least three principal lineages of dihydrouridine synthases typified by the YhdG, YohI and YjbN proteins from E.coli (Fig. 3 ). The phyletic pattern of dihydrouridine synthases suggests that this enzyme emerged early in bacterial evolution and was transferred to eukaryotes, probably via the endosymbiotic route. The diversification of dihydrouridine synthases into multiple forms apparently occurred independently in bacteria and eukaryotes. Dihydrouridine has been detected in tRNAs of T.acidophilum and M.thermoautotrophicum , but appears to be missing in other archaea studied to date ( 99 , 100 ). Hence, at least in those archaea that appear to contain this modification, an alternative as yet undiscovered enzyme is likely to be present.
NTP-dependent enzymes involved in RNA metabolism
In addition to the PP-loop ATPases discussed above in the context of base modification, a variety of ATP- and GTP-utilizing enzymes of the P-loop NTPase fold are involved in RNA modification, processing and splicing and especially in translation itself. In addition, aminoacyl-tRNA synthetases (aaRS), which belong to two other distinct, ancient classes of ATP-utilizing enzymes, are central to the translation process. Evolutionary relationships of aminoacyl-tRNA synthetases have been examined in detail in several recent studies ( 10 , 36 , 101 , 102 ). Here, we briefly summarize the evolutionary history of the vast class of P-loop NTPases in the context of their repeated utilization in RNA metabolism.
GTPases. P-loop GTPases are among the central, most ancient components of RNP complexes and at least nine distinct GTPases associated with different aspects of translation are traceable to LUCA. These include the four translation factors involved in initiation and elongation, two distinct versions of the OBG family of GTPases containing the RNA-binding TGS domain, the circularly permuted YlqF-like GTPases, and two GTPases associated with the signal recognition particle and its receptor. The first seven of these families belong to a large assemblage of GTPases related to the translation factors (the TRAFAC class), whereas the remaining two are members of the signal recognition/MinD/BioD (SIMIBI) class of GTPases and related ATPases ( 103 ). These two classes correspond to the first fundamental split in the evolution of GTPases and, because both classes include proteins involved in translation, it appears likely that the primordial GTPase was a component of an ancient RNP complex that functioned as a generic regulator of translation. Even prior to LUCA, the GTPases have diversified through several duplications to perform more specific, essential functions in translation and secretion. After the radiation of the major lineages of life, many GTPases were recruited for specific functions within the translation system, such as translation-termination and RNA modification and processing. The Era family GTPases ( 104 ), which contain a C-terminal domain that is a topologically rearranged version of the KH domain, the PseudoKH domain ( 105 ), and the TrmE (ThdF) family were derived in bacteria within the TRAFAC class of GTPases and participate in rRNA and tRNA modification. TrmE is involved in the synthesis of the modified nucleotide 5‐methylaminomethyl-2-thiouridine in tRNAs ( 106 ). The archaeo- eukaryotic Clp1 GTPase family of the SIMIBI class was recruited to participate in polyadenylation site selection ( 107 ). In eukaryotes, a distinct paralogous derivative of the universal translation factor EF-2, typified by Snu114p, acquired a new function in splicing as a component of the U5 RNP ( 103 ). Further details of GTPase evolution are presented elsewhere ( 108 ).
RNA helicases. The next major class of P-loop NTPases that are associated with RNA metabolism are RNA helicases and related ATPases. The known RNA helicases of cellular life forms belong to two major superfamilies, SFI and SFII, that descend from an ancient common ancestor antedating LUCA. This ancestral helicase contained two distinct α/β domains that are present in both SFI and SFII ( 109 ). The N-terminal domain is a classic P-loop ATPase domain that belongs to the RecA-like subclass of P-loop domains ( 110 , 111 ). The C-terminal domain appears to represent an extremely divergent P-loop domain that might have evolved through an ancient duplication of the N-terminal domain, followed by extreme sequence divergence, which probably accompanied a functional shift to single-strand nucleic acid binding. The extant lineages of SFI and SFII helicases include both DNA and RNA helicases, and other nucleic acid-dependent ATPases. Among the helicases involved in RNA metabolism, SFII occupies a more prominent position than SFI; SFII helicases are much more prevalent in eukaryotes than in bacteria (Fig. 4 ). Seven major families of SFII helicases have experimentally characterized or clearly predicted roles in RNA metabolism. Two of these, namely the eIF4A-DeaD family (with the classic DEAD motif in the Walker B site) and the Ski2p-Lhr family, are widespread in all three primary kingdoms, which points to their presence in LUCA. Within the eIF4A-DeaD family, the orthologous group typified by the bacterial DeaD protein, which is involved in translation regulation ( 112 ), is widely represented in bacteria and archaea and might be the form closest to the ancestor of this family. In eukaryotes, this family has vastly expanded to include at least 30 distinct lineages, with almost 25 of them traceable to the common ancestor of the crown group (Fig. 4 ). Most members of this expanded helicase subfamily are subunits of pre-mRNA splicing complexes, whereas some others, such as Rrp3p ( 113 ), function in other RNA processing pathways, and Upf1p is involved in mRNA degradation ( 114 ). The pan-eukaryotic translation initiation factor eIF4A appears to be the direct equivalent of the prokaryotic DeaD-like helicases, and its function in eukaryotes might be an extension of the ancient role of these helicases in regulatory unwinding of mRNA secondary structure. Proteobacteria have a lineage-specific expansion of the DeaD lineage, with additional orthologous groups, such as RhlE and RhlB ( 115 ), whereas most of the other bacteria have only a single member.
The Ski2p-LHR family is a much smaller family whose ancestral form probably was involved in RNA degradation and processing ( 116 , 117 ). Archaea typically have three distinct helicases of this family, whereas eukaryotes have four members of the Ski2p-Mtr4p-like subfamily, all of which apparently function in conjunction with the exosomal nucleases in RNA degradation (Fig. 4 ). Another eukaryote-specific orthologous group within this family includes Brr2p-like proteins, which contain two helicase and sec63 domains and are involved in both cytoplasmic RNA processing and splicing as a component of U5 snRNP ( 118 ). One orthologous group within the Ski2p-LHR family, which is typified by mus308 of D.melanogaster and MJ1401 of M.jannaschii , appears to have been recruited for DNA-related functions in the archaeo-eukaryotic lineage and, in eukaryotes, shows a fusion to a DNA polymerase domain ( 119 , 120 ).
The remaining families of SFII helicases involved in RNA metabolism show purely eukaryotic, bacterial or bacterio-eukaryotic distribution. The Suv3 family involved in mitochondrial RNA degradation ( 121 ) and the CAF family involved in PTGS are small groups that are restricted to eukaryotes and appear to function in eukaryote-specific regulatory processes (see below). The Prp2p-Mle subfamily is found in both bacteria and eukaryotes. Eight distinct orthologous groups can be delineated within this family in eukaryotes, with the majority involved in splicing, including Prp2p, Prp16p, Prp43p, Prp22p and Mle ( 122 ). The HrpA/B proteins are bacterial representatives of the Prp2p-Mle family that are present only in proteobacteria, spirochetes and Deinococcus , which suggests dissemination via horizontal transfer among bacteria, although the initial direction of horizontal transfer responsible for the bacterio-eukaryotic distribution remains uncertain. The SecA family proteins are ubiquitous in bacteria and plants and have been shown to possess RNA helicase activity ( 123 – 125 ). However, the role of this activity in vivo remains unclear because SecA also has a well characterized function as an ATP-dependent translocase involved in protein secretion. The RecQ family of SFII helicases is unusual in that these proteins have functions in both DNA repair and RNA metabolism. This family is represented only in bacteria and eukaryotes, with a single horizontal transfer into the crenarchaeon A.pernix . This distribution suggests that the RecQ family originally evolved in bacteria and was subsequently acquired by eukaryotes from the pro-mitochondrial endosymbiont, which was followed by extensive diversification into at least five distinct orthologous, eukaryote-specific groups. Many members of this family share a predicted RBD, the HRDC domain ( 126 ), with the RNase D family of nucleases, suggesting that the ancestral function of the RecQ family helicases might have been in RNA metabolism, with a subsequent shift to DNA-related functions. A member of this family from Neurospora has been shown to have a role in RNA metabolism, in particular PTGS ( 127 ). Orthologs of this protein are present in other eukaryotes; furthermore, fusion of the RecQ family helicases with the Zn-knuckle and the F-box domains in plants and animals (see Figs 4 and 6 ) indicate that this family might have more extensive RNA-related functions than presently conceived.
Several SFI helicases are implicated in RNA-related functions in eukaryotes; they all belong to the Smubp-Sen1p family, which is conserved throughout the archaeo-eukaryotic lineage and in a few bacteria ( 128 ). This family includes both DNA and RNA helicases and probably emerged early during the evolution of the archaeo-eukaryotic clade, rather than in LUCA. All archaeal members of the Smubp-Sen1p family are orthologs of the eukaryotic Smubp, which is a DNA-binding protein ( 129 ). However, the presence of the single-stranded nucleic acid-binding R3H domain ( 15 ) in some of the eukaryotic members of this family might point to an undiscovered role in RNA metabolism (see Figs 4 and 6 ). All known eukaryotic SFI RNA helicases of the Smubp-Sen1p family were derived after the divergence of eukaryotes from the common ancestor with archaea (Fig. 4 ). Five distinct lineages of RNA helicases of this family emerged prior to the divergence of the crown group eukaryotes and include proteins involved in a variety of functions, such as snoRNA maturation [Sen1p ( 130 )], mRNA degradation [Nam7p ( 131 )] and PTGS [Sde3 ( 132 )] (Fig. 4 ). One of the eukaryotic SFI lineages, represented by the S.pombe SPCC1739.03 and its orthologs, is closely related to the NAM7p subfamily and is an uncharacterized group of predicted RNA helicases, which, on the basis of their phyletic pattern ( 133 ), are likely to participate in PTGS (Fig. 4 ). Another distinct pan-eukaryotic family, typified by the Aquarius protein ( 134 ), is predicted to include inactive helicases as indicated by the disruption of the P-loop and Walker B motifs; these proteins probably function as RNA-binding regulators, rather than as enzymes. Two small, lineage-specific expansions of these helicases were detected in Arabidopsis and C.elegans , typified by the F1E22.14 (eight members) and K08D10.5 (six members) proteins, respectively; these might represent specific adaptations for antiviral response or related processes. Most of the other SFI families, such as the RecD family, appear to have evolved in bacteria and are known to be involved only in DNA repair and recombination ( 119 ).
The PhoH family of ATPases ( 135 ) evolved in bacteria, apparently through the loss of the C-terminal α/β domain that was present in the common ancestor of SFI and SFII helicases. A role in RNA metabolism is strongly suggested by the presence of RNA-binding PIN and KH domains in different members of the PhoH family (Fig. 4 ). There are two orthologous groups of PhoH-like ATPases, typified by PhoH and YlaK, respectively, that evolved as a result of an early duplication in the bacterial lineage. The PhoH proteins could either function as helicases or could be involved in ATP‐dependent dynamics of as yet uncharacterized RNP complexes in bacteria.
Miscellaneous P-loop NTPases involved in RNA metabolism. In addition to the above, well characterized classes of P-loop NTPases involved in RNA metabolism, several others have less common and less thoroughly understood RNA-related functions. The most notable of such groups includes the PilT ATPases, which form a distinct class within the P-loop fold and appear to be a sister group to the ABC class (D.D.Leipe, E.V.Koonin and L.Aravind, unpublished data). The PilT ATPases implicated in RNA metabolism appear to be predominantly an archaeal innovation and are typified by MJ1533 and its orthologs that are highly conserved in archaea ( 136 ). These proteins combine the PilT ATPase domain with RNA-binding PIN and KH domains. In bacteria, a group of PilT ATPases is present sporadically in Bacillus and Synechocystis and form fusions with the RNA-binding R3H domain. These ATPases might represent a novel class of RNA helicases or could participate in other ATP-dependent reactions of RNA metabolism.
Some kinases of the P-loop fold, such as polynucleotide kinases, also participate in RNA metabolism. A generic polynucleotide kinase that probably acts on both DNA and RNA seems to be conserved in all eukaryotes except for S.cerevisiae ( 137 – 139 ). Additionally, some lineage-specific P-loop kinases are implicated in RNA metabolism on the basis of suggestive domain fusions, including the kinase fused to yeast RNA ligase ( 140 , 141 ) and the animal-specific hnRNP-U (SAF-A) proteins, which contain a SAP domain, and might function as chromatin-bound polynucleotide kinases in pre-mRNA splicing ( 142 , 143 ). P-loop kinase domains are also fused to the ligase-related nucleotidyltransferase domains of the capping enzyme in trypanosomes ( 144 ).
The P-loop proteins of the MiaA family modify adenines, chiefly in tRNAs, through the addition of bulky adducts, such as isopentene, in position 6, using organic phosphates, e.g. dimethylallyl diphosphate, as donors of the modifying groups ( 145 , 146 ). These enzymes are distantly related to the AAA+ class of P-loop ATPases and are nearly ubiquitous in bacteria and eukaryotes, which is consistent with the phyletic pattern of 6-isopentenyl adenines in tRNA. MiaA probably evolved in the common ancestor of bacteria and was acquired by eukaryotes from the promitochondrial endosymbiont. On the basis of operon organization, it can be predicted that, at least in certain bacteria, such as proteobacteria, Aquifex and Synechocystis , MiaA utilizes the Hfq protein (the bacterial homolog of the eukaryotic SM proteins) as an RNA-binding subunit.
Other enzymes of RNA metabolism
At least 15 superfamilies of RNases are involved in a variety of processes, such as maturation of tRNAs and rRNAs, polyadenylation site-specific cleavage of mRNAs, and RNA degradation in various contexts and cellular compartments. A detailed evolutionary classification of RNases has been published recently ( 45 ), and therefore individual groups of these enzymes are not discussed here in detail. However, we cover some specific aspects of their evolution when reconstructing the evolution of individual functional systems in RNA metabolism (see below).
In addition, a number of other enzymes that form relatively small families, sometimes with restricted phyletic distribution, are involved in RNA metabolism. One such group is the RNA ligases that are related to the DNA ligases and appear sporadically in cellular life forms. The fungi possess a RNA ligase, which is required for the maturation of tRNAs and non-spliceosomal mRNA maturation ( 147 ), whereas in trypanosomes RNA ligases participate in mRNA editing ( 38 , 148 ). Homologs of these RNA ligases are encoded by several DNA viruses, including phage T4, baculoviruses and entomopox viruses ( 38 ). This observation, together with the sporadic distribution of RNA ligases, might suggest that cellular organisms acquired these enzymes independently from DNA viral sources. Additionally, a variety of other nucleotidyltransferases are involved in non-templated polymerization of ribonucleotides during polyadenylation of mRNAs, CCA addition in tRNAs and RNA editing. All these enzymes have the DNA polymerase β-fold ( 149 ) and are considered in greater detail below in the context of evolution of the capping and polyadenylation systems.
Cyclic phosphodiesterases of the LigT superfamily hydrolyze 2′-5′ phosphoesters in various contexts in RNA metabolism. The most conserved of these enzymes form the core LigT family, which apparently evolved in the archaeo-eukaryotic lineage, with a few transfers into bacteria; the animal members of this family have a fusion with the RNA-binding KH domain. They apparently catalyze hydrolysis of ADP-ribose 1″,2″-cyclic phosphate that is formed as an intermediate in tRNA processing ( 150 , 151 ). Additional members of this superfamily, which are not orthologs of LigT, were identified as fusions with RNA ligases in yeast, in RNA viral polyproteins, and as stand-alone proteins in Arabidopsis (L.Aravind and E.V.Koonin, unpublished observations); these proteins might have related phosphodiesterase activities in RNA metabolism.
The Macro domain (first detected in vertebrate macrohistone 2) is another highly conserved phosphoesterase that is involved in Appr-1″-p-processing ( 152 ), as part of tRNA maturation. Macro domain phosphoesterases are conserved across the three superkingdoms of life, which is compatible with the presence of such an enzyme in LUCA. Finally, several families of enzymes, such as the enigmatic RNA-dependent RNA polymerases ( 153 – 156 ), and AlkB-like oxoglutarate-dependent dioxygenases ( 157 ), show a limited phyletic distribution. Most of these are known or predicted components of the eukaryotic post-transcription regulatory systems and are further explored below in the context of evolution of these functional systems.
Evolutionary history and trends of non-catalytic domains involved in RNA metabolism
Approximately 50 major superfamilies of non-catalytic domains, primarily RNA-binding ones, are implicated in RNA metabolism (Fig. 1 A and B and Table 1 ). In addition, several conserved domains are found exclusively in ribosomal proteins. Below we consider some of the general and specific features of the natural history of these domains that emerge from a detailed analysis of their phyletic patterns combined with attempts on evolutionary classification.
Evolutionary mobility of domains. RBDs show remarkable diversity in terms of domain architectures. Several RBDs, such as ribosomal protein L30 and the SRP14-domain, typically occur as stand-alone proteins and in a single copy per genome. At the other end of the spectrum are ‘promiscuous’ domains, such as RRM, which display over 35 distinct multidomain architectures and are found in combination with up to 20 different domains (Figs 5 – 7 ). These observations suggests major differences in evolutionary mobility among RBDs. Certain highly conserved, ancient RBDs, such as L30, S6 and SmpB, appear to have largely stabilized in specific functional niches in the ribosome or in lineage-specific RNP complexes and are not typically recruited to roles in more general contexts related to RNA metabolism. In contrast, some other conserved domains found in ribosomal proteins, such as S1 ( 158 ), KOW ( 13 ) and S4 domains ( 9 ), have been recruited for a variety of other functions which involve RNA binding. Some of these domains (KOW, S4), along with other mobile RBDs, such as EMAP, PUA, PIN, TRAM, THUMP, TGS, N-OB, NusB ( 9 – 12 , 71 , 136 ) and several conserved domains found in aaRS ( 10 ), form a group of moderately mobile, ancient domains. The majority of the fusions that involved these domains appear to have evolved close to the origin of one of the superkingdoms or, in some cases, even in or prior to LUCA. Most of these architectures show remarkable parallelism of fusions of different RBDs to various RNA modification and processing enzymes. It appears that these RBDs emerged at early stages of evolution and, shortly after their origin, formed fusions that facilitated the delivery of diverse catalytic activities to RNA and hence were maintained in most lineages. These moderately mobile domains formed lineage-specific fusions on relatively rare occasions, such as those of N-OB and EMAP to the C-termini of plant and vertebrate TyrRS, respectively ( 10 ), or the fusion of TRAM to a FtsJ-like methylase in Thermoplasma ( 11 ).
The next major phase of domain mobility coincided with the emergence of eukaryotes and continued through the divergence of the major eukaryotic lineages. This burst of mobility correlates with the origin of splicing and other post-transcriptional regulatory mechanisms in eukaryotes. Some of these domains, such as S1, dsRBD ( 159 ) and KH ( 160 , 161 ), were already present in LUCA as parts of ubiquitous ribosomal proteins or enzymes. These domains went through an initial phase of moderate evolutionary mobility, but experienced a new spurt of mobility in eukaryotes, each giving rise to several new architectures associated with splicing and other post-transcriptional regulatory processes. However, most of the domain shuffling events in eukaryotes involve relatively new, eukaryote-specific domains, such as RRM, Zn-Knuckle, CCCH, Little Finger, G-patch, SWAP and PWI.
Differential utilization of some ancient RBDs and high mobility of the eukaryote-specific domains point to two distinct evolutionary forces involved in the emergence of the complexity of eukaryotic RNA metabolism. First, it appears that most of the ancient, moderately mobile RBDs were not sufficiently versatile to occupy the new functional niches, such as splicing and PTGR. Exceptions include several ancient mobile domains, such as S1 and KH; proteins containing these domains in eukaryote-specific architectures have undergone lineage-specific expansions, which indicates greater functional versatility and adaptation to some of the new functional niches. These domains, however, largely formed combinations amidst themselves or with catalytic domains, akin to their more ancient versions, rather than with more recently invented domains. Secondly, the newly invented domains appear to have been recruited en masse to the new, eukaryote-specific functions close to the points of origin of these functions. Thus, through an evolutionary feedback process driven by duplication and repeated selection for the same set of newly derived domains, they started rapidly colonizing the new functional niches to the exclusion of the older, moderately mobile RBDs. This strong selection favoring the proliferation of the recently evolved, mobile domains also appears to have resulted in architectures that most frequently involved combinations among themselves rather than with the less common, ancient RBDs.
A brief history of major families of RBDs . The specific evolutionary histories of the common RBDs are important for understanding the emergence of the functional systems that comprise cellular RNA metabolism. Below we briefly consider the main events in the diversification of major RBD families.
OB-fold and other all- β strand domains. The OB-fold is a six‐stranded β-barrel, which is common to several superfamilies of nucleic acid-binding domains. Among the domains involved in RNA metabolism, the S1, S1-like, EMAP, N-OB and thermonuclease domains adopt the OB fold ( 55 , 71 , 158 ). Most of these domains were already represented in LUCA, which indicates that a major phase of divergent evolution of OB-fold domains took place at even earlier stages of evolution. Several of the OB-fold domains are seen in proteins that have been conserved throughout evolution as central components of the translation system. Ribosomal protein S12 and initiation factor IF1/eIF1A are the most conserved orthologous groups of S1-domain proteins, each traceable to LUCA. In addition, several conserved versions of the S1 domain are present in ribosomal protein S1, RNase E, RNase II, polynucleotide phosphorylase, the circularly permuted GTPases of the YjeQ family, Tex and NusA, all of which are (nearly) ubiquitous in bacteria and probably evolved at the onset of bacterial evolution. Conversely, the forms of the S1 domain present in eIF2-α, RpoE and Rrp4p/Rrp40p exosomal subunits go back to the base of the archaeo-eukaryotic clade. The Rrp5p and Prp22 lineages of S1 domains evolved in eukaryotes, whereas the SPT5p family appears to have evolved in eukaryotes, from a Tex-like ancestor that was acquired from bacteria. ‘S1-like’ domains belong to a lineage that is of bacterial origin and is represented by orthologous groups, such as the major cold shock protein (CspA), RNase II and transcription terminator Rho. Another OB-fold domain related to the S1 domains is the C-terminal domain of the universal translation factor EF-P/eIF5A,which appears to have branched off from all the other S1 domains prior to LUCA and has not shown any evolutionary mobility ever since.
The most ancient form of the EMAP domain seems to be the one in methionyl-tRNA synthetase ( 10 ), which is widely distributed throughout all three primary kingdoms. Additionally, a duplication at the base of the bacterial clade gave rise to the EMAP domain in the β-subunit of PheRS. Similarly, the most ancient lineage of N-OB domains ( 162 ) is the one that is present in AspRS; this domain underwent duplications to give rise to the forms present in LysRS and AsnRS in bacteria and eukaryotes, respectively. Other N-OB domains appear to have been recruited widely in various DNA metabolism enzymes, which suggests exaptation of an ancient RBD for DNA binding ( 162 ).
The SH3-like barrel ( 163 ) is another all-β fold, which is present in several non-catalytic domains involved in RNA metabolism, such as the KOW, SM, L21E, L2 and tudor domains. The KOW domain present in the ribosomal protein L24, NusG/Spt6 and EF-P/eIF5A evolved prior to LUCA and the KOW-domain-containing proteins have largely retained their architectures ever since. The eukaryotic ortholog of NusG, Spt6, contains four or five divergent copies of the KOW domain, apparently resulting from a previously undetected amplification. The SM domain ( 164 – 166 ) also appears to have been present in LUCA, although it seems to have been subsequently lost in several bacterial lineages. This domain is unusual in that it always occurs as a stand-alone protein, suggesting selection against the formation of multidomain architectures, the underlying cause of which remains unclear. Prokaryotes encode one or two SM-domain proteins, whereas, in eukaryotes, 16 distinct orthologous groups of SM proteins already evolved prior to the radiation of the crown group, which is consistent with large-scale recruitment of this domain to snRNP complexes involved in splicing. The L2 domain seen in the universal ribosomal protein L2 is an orphan version of the SH3-like barrel fold that might have been derived from the ancient KOW superfamily, with subsequent extreme sequence divergence. Similarly, the L21E domain of the archaeo-eukaryotic lineage might be a divergent derivative of the more universal superfamilies of the SH3-like fold. The TUDOR domain ( 167 ) is also related to the SM and L2 domains and appears to have been derived from one of them in eukaryotes. Several members of the tudor superfamily appear to have lost the RNA-binding function and participate in protein–protein interactions in the splicing snRNP complexes ( 168 ); some divergent versions even function in chromatin structure maintenance (L.Aravind, unpublished observations). At least four distinct orthologous groups of proteins containing TUDOR domains with functions in RNA, such as Drosophila TUDOR itself and its orthologs and the splicing factor SPF30, had evolved prior to the radiation of the eukaryotic crown group. A few additional TUDOR-containing proteins emerged in animals, such as the Drosophila RNA helicase Homeless and the SMN‐like proteins .
The PUA domain and the TRAM domain are two other RBDs that are confidently predicted to have an all-β fold ( 9 , 11 ); however, they cannot be classified with any of the folds discussed above in the absence of a 3D structure. Both PUA and TRAM are ancient, moderately mobile domains that typically are associated with various RNA-modifying enzymes (Table 1 and Figs 2 and 3 ). Eukaryotes and bacteria have some lineage-specific architectures of PUA domain-containing proteins, such as the fusion with another RBD, SUI1, in eukaryotes and an unexpected, conserved fusion with glutamate kinase in bacteria. The TRAM domain is principally associated with the ancient MiaB-like enzymes (see above) and is also fused to the predominantly archaeo-eukaryotic TRM2-like methylases.
Major RBDs with α / β and α + β folds. The RRM, which is the most prevalent RBD in eukaryotes and is involved in all aspects of RNA metabolism, is a eukaryotic invention. At least 40 distinct orthologous groups of RRM-containing proteins appear to have emerged prior to the radiation of the eukaryotic crown group and, additionally, several more orthologous groups are confined to animals, plants or fungi. This explosive diversification of the RRM domain is surprising given the absence of this domain in archaea and bacteria, except for a few occurrences, which probably are horizontal transfers from eukaryotes. The RRM domain belongs to an ancient fold of nucleic acid-binding domains, which is present, for example, in ribosomal protein S6 ( 75 ) and also in the catalytic domains of a variety of enzymes, including RNA and DNA polymerases and Type II PSUS ( 55 ) (L.Aravind, unpublished data). It appears most likely that the RRM domain proper has been derived from a S6-like ancestor at an early stage of eukaryotic evolution.
Several other α/β and α + β domains, such as KH, dsRBD and THUMP (Table 1 ), have ancient representatives among ribosomal proteins or RBDs of conserved RNA-modifying enzymes. The lineage-specific orthologous groups of proteins containing these domains appear to have evolved through duplication and diversification of these ancient lineages. The TGS domain and the S4 domain that have a distinct α + β fold, called the α-L fold ( 169 ), appear to have diverged from a common ancestor and become distinct lineages prior to LUCA.
All α -helical domains. A distinct version of the helix–hairpin–helix (HhH) domain, which is typified by the RBD of ribosomal proteins S13/S18, is ubiquitous in all three primary kingdoms and may represent one of the most ancient lineages of the HhH domains ( 170 ). This domain was subsequently sporadically recruited to RNA metabolism, e.g. in the NusA and Tex-Spt6 families, but is far more prevalent in DNA-binding contexts. Thus, this might be another case of an ancient RBD, which diversified extensively only after recruitment for DNA binding.
The PIN domain is another predominantly α-helical domain found in proteins, which, in eukaryotes, are associated with PTGR and RNA degradation ( 136 , 171 ). Stand-alone PIN-domain proteins are widespread across all three primary kingdoms, with distinct architectures in the form of fusions with PilT and PhoH ATPases conserved, respectively, in archaea and in bacteria. A protein containing a PIN protein and a Zn-ribbon domain (human ART-4 orthologs) is conserved in the archaeo-eukaryotic lineage, whereas eukaryotes additionally have a unique architecture of PIN fused to RNase II and TPR repeats. These domain fusions suggest that PIN domains perform a wide range of functions and experimental analysis of PIN‐domain proteins might unravel new facets of RNA metabolism. An enigmatic aspect of the evolution of the PIN domains is the expansion of the stand-alone versions of these domains in archaea, such as Archaeoglobus and Aeropyrum , and bacteria, such as Mycobacterium and Synechocystis . These PIN domains potentially might be involved in some unusual regulatory mechanism or in defense against RNA viruses. It has been hypothesized, on the basis of limited similarity to 3′‐5′ exonucleases of the RNase H fold, that PIN domains, particularly those involved in RNA degradation in eukaryotes, might have exonuclease activity ( 171 ). However, the proposed catalytic residues are not conserved in all PIN domains and a nuclease activity appears unlikely at least for the expanded prokaryotic forms.
The translin domain is an α-helical RBD that is found in a single copy in archaea and in two copies in eukaryotes. The eukaryotic translin protein might be part of a cytoplasmic RNP complex that mediates localization or tethering of mRNAs ( 172 ). Given the conservation of this protein in archaea, it seems that these RNP complexes have an ancient function in maintaining RNA stability. As discussed above, several α‐helical superstructure-forming domains, such as PUM, HAT [a specific version of the TPR repeat ( 173 )] and NIC, have been recruited for functions related to RNA metabolism in eukaryotes.
Metal-chelating domains. Of the large number of mobile metal-chelating domains that are utilized in RNA metabolism, only the Zn-ribbon ( 44 ) (ZNR) is of ancient provenance. The ZNR is a four-stranded domain stabilized by a metal atom typically chelated by four cysteine side chains (sometimes replaced by histidines). The ZNRs function as RBDs and DNA-binding domains and as cofactors in redox reactions, and are also involved in structural stabilization of various proteins ( 44 ). The ZNRs in MetRS, IleRS and ribosomal protein S14 are traceable to LUCA. Several ZNRs in translation-associated proteins, such as L40A, L36AE, S27, eIF5 and eIF5β, are conserved throughout the archaeo-eukaryotic lineage, whereas many others are specific to archaea and some to bacteria. This is indicative of massive recruitment of ZNRs during the emergence of the archaeal clade, which might correlate with the iron respiration typical of archaea ( 174 ).
The Zn-chelating RBDs that evolved in eukaryotes include the Zn-knuckle with a C2HC pattern of metal ligands, the CCCH domain (named after its conserved chelating cysteines), the little finger with a C4 metal-binding pattern and characteristic conserved tryptophan, the LRP1 finger and the classic C2H2 Zn-finger. There are approximately 12 orthologous groups of proteins containing Zn-knuckles, 13 groups of proteins with CCCH domains and three groups with Little Fingers that are conserved throughout the eukaryotic crown group. All these domains are highly mobile and several lineage-specific fusions of ancient or recently derived proteins to these domains were detected. This suggests a burst of proliferation in early eukaryotes resulting in the establishment of the major orthologous groups, followed by sporadic duplications in individual lineages.
The LRP1 finger is a previously undetected domain that we identified as part of this study. LRP1 has a C6H ligand pattern, which suggests chelation of two metal ions. In animals, this domain is fused to the dihydrouridine synthase Dus1p (Fig. 3 ), whereas in plants, it has undergone a lineage-specific expansion, with at least 10 stand-alone members, including the namesake LRP1 protein ( 175 ). The classic C2H2 Zn-finger is typically associated with DNA binding in eukaryotes and is part of numerous transcription factors and chromatin-associated proteins. However, several members of this family are associated with known or predicted RBDs, e.g. in the experimentally confirmed RNA-binding proteins TFIIIA and dsRBP-Zfa (JAZ) ( 176 – 178 ). However, no distinct sequence features or specific phylogenetic relationships of the RNA-binding versions of this domain were detected so far, making it impossible to predict the fraction of C2H2 fingers in eukaryotic proteomes that have RNA-related functions. We only documented those occurrences where the evidence was sufficiently clear from either experimental data or association with other specific RBDs. This is likely to represent the lower boundary of the C2H2 fingers involved in RNA metabolism.
Evolutionary history of RNA metabolism systems and reconstruction of their ancestral states
Analysis of evolution of individual domain families, a summary of which is presented above, provided a means of reconstructing the evolutionary history and probable ancestral states of the numerous functional systems, pathways and protein complexes that comprise RNA metabolism. We summarize below the results of this reconstruction, which is based on the data gathered for principal conserved domains involved in RNA metabolism. Figure 5 is a Venn diagram that shows the numbers of conserved orthologous groups of proteins shared by various lineages across the entire phylogenetic spectrum that we sampled, for various functional systems.
The ancient core: translation, transcription and RNA modification. Comparative genomics showed that the basic translation apparatus contains the largest number of (nearly) universally conserved proteins. The set of translation-associated proteins whose origin is traceable to LUCA and possibly beyond includes 15 proteins associated with the small subunit of the ribosome, 18 proteins associated with the large subunit, nine class I aaRS, seven class II aaRS, seven GTPases associated with various aspects of translation, and at least two other translation factors. Other ancient proteins associated with translation are the glutamate (aspartate) amidating enzyme subunits, which are necessary for glutamine (and in some cases, asparagine) incorporation into proteins in most bacteria and archaea ( 179 ), the signal recognition particle GTPases that form the link between translation and secretion, possibly a SFII helicase associated with translation regulation or initiation, and a variety of RNA-modifying enzymes (Table 2 ). The modification enzymes that could be confidently traced back to LUCA include two distinct classes of methyltransferases with six to seven representatives altogether, two classes of pseudouridine synthases, and enzymes involved in the synthesis of thiouridine and thioadenine derivatives and 7-deazaguanosines. Thus, LUCA possessed an abbreviated protein core of the modern ribosome and the basic repertoire of accessory proteins required for translation. From this pivotal point, it is possible both to track back the early, pre-LUCA stages in the evolution of RNA metabolism and to examine its elaborations in the major clades of life.
As pointed out above for individual domains, many components of the ribosome, translation factors and RBDs of RNA-modifying enzymes, which are traceable to LUCA, descended from even more ancient common ancestors. Numerous ribosomal proteins and other translation/modification-associated RBDs in the ancestral set belong to a small number of folds, such as OB-fold, SH3-like barrel and the α-L fold. Thus, prior to the divergence of the S1, N-OB and EMAP domains, or the KOW, SM and L2 domains, or the TGS and S4 domains, their respective ancestors probably functioned as RBDs with generic properties. The same logic applies to enzymes of RNA metabolism. The case is particularly clear for aaRS, which are indispensable components of the modern translation machinery responsible for the specificity and efficacy of amino acid incorporation into protein. Since most of class I and class II aaRS were already present in LUCA, there is obviously a history of pre-LUCA duplications in each of the classes ( 102 ). The ancestral aaRS of each class, which functioned in the primitive translation system, most likely was a non-specific amino acid-activating enzyme, with the specificity determined by tRNAs themselves. This type of translation system appears to be a transition state between a primordial machinery based entirely on RNA catalysts and the modern, largely protein-based system. Furthermore, the catalytic domains of both classes of aaRS are homologous to certain other NTPases and nucleotidyl transferases, whose functions are unrelated to translation; some of these, for example, are enzymes of coenzyme biosynthesis, such as NAD synthase in the case of class I ( 102 ) and biotin synthase for class II ( 180 ). Thus, the progenitors of the two classes of aaRS, which evolved from within the primitive RNA world, probably were non-specific nucleotidyl transferases, which combined functions in translation with those in other branches of metabolism. Similarly, at this stage of evolution, the individual translation factors and RNA-modifying enzymes, such as methyltransferases, had probably not yet differentiated into their specific versions, but were represented by the corresponding ancestral forms, which functioned in multiple contexts with a low specificity.
Looking forward from LUCA, it is immediately apparent that several major additions to the translation apparatus and its accessories map to the point of divergence of the two principal branches of life, the bacterial and the archaeo-eukaryotic clades. Approximately 28 proteins were added to the ancestral ribosomal core in the archaeo-eukaryotic lineage and, conversely, 21 ribosomal proteins are specific to the bacterial lineage, which results in the profound differences in the ribosomal superstructure between the two clades. The translation termination factors and several initiation factors also were added to the conserved set as these major lineages diverged. Eukaryotes showed a further development in the complexity of the translation initiation system: several new translation regulators emerged in the eukaryotic lineage, some of which consist of the RRM domain or newly derived α-helical domains, such as NIC, MI and W2 ( 16 ), whereas others have new combinations of ancient RNA-binding and enzymatic domains, such as PUA, SUI1 and SFII helicases. The complexity of RNA modification also increased during the post-LUCA phase of evolution as a result of several duplications within various enzyme families and the origin of several new enzymes, such as dihydrouridine synthetase and MiaA (Figs 2 and 3 ). Most of the RNA modification enzyme superfamilies, in addition to the highly conserved groups of orthologs, include many smaller groups, which are restricted to a specific lineage or show a sporadic distribution (Figs 2 and 3 ). Thus, a subset of RNA modifications, while not universally essential, are likely to have specific adaptive value for particular organisms in their ecological niches. These adaptations might include tolerance to extreme environmental conditions, such as high temperature or osmolarity, or resistance to anti-translation antibiotics or particular xenobiotics. The relatively late emergence of many RNA modifications suggests that the RNA modification state in LUCA and especially at earlier stages of evolution was relatively simple and therefore these modifications might not have been a major factor in modulation of the catalytic activities of primordial ribozymes.
Several RNA-binding proteins contribute to transcription. The best-studied proteins in this category are the transcription elongation/antitermination factors that include the universally conserved NusG-Spt5p family of KOW-domain proteins ( 181 ). Bacteria additionally possess several distinct subunits of the transcription antitermination complex, including NusB, which contains the prototype of the α-helical NusB domain, ribosomal protein S4 and the S1 and KH domain-containing protein NusA ( 182 – 184 ). The functionally equivalent eukaryotic transcription elongation complex contains Spt6 ( 185 , 186 ), which is the ortholog of the bacterial Tex protein ( 187 ). Similarly to NusA, this protein contains an S1 domain and is likely to be the functional counterpart of NusA. In animals, this complex additionally contains the RRM-containing RD protein ( 188 ). The ancestral form of the transcription elongation/antitermination complex, which was present in LUCA, might have consisted of a single KOW-domain protein and perhaps the ribosomal protein S4. This was followed by accretion of additional subunits, at least in bacteria. Bacteria also evolved transcription antiterminators containing the α-helical AmiR domain that relieve specific mRNAs from termination in response to stimulation of specific signaling pathways that lie upstream of them (Table 1 ) ( 189 ). The corresponding additions in archaea, if any, remain unknown, but in eukaryotes, SPT6, apparently acquired from bacteria via horizontal transfer, was recruited to this complex, followed by other lineage-specific additions.
The archaeo-eukaryotic RNA polymerase E1 subunit containing the S1 domain and eukaryotic transcription factors EWS/TAF68 and TAF II 250 containing the Zn-knuckle domain are other transcription-related RNA-binding proteins. Fusion of the SAP domain with RBDs ( 143 ) suggests that eukaryotes might have still uncharacterized RNP complexes, which could couple nuclear RNA processing with transcription. Finally, in animals, several chromosomal RNAs, such as RoX1/2 and XIST, have been described that have a role in regulating chromosomal structure, and thereby transcription, on a global scale. A specific class of Chromodomains typified by the MSL proteins ( 190 ) and other proteins, such as the SFII helicase Mle ( 122 ), interact with these RNA molecules.
Polyadenylation and capping. Polyadenylation occurs in all three primary kingdoms. Prokaryotic poly(A) tails are short (~30 nt) compared with the eukaryotic ones, which extend to several hundred nucleotides ( 191 ). Bacterial poly(A) polymerases also have CCA-adding activity and are often fused to HD or DHH phosphohydrolase domains ( 149 ). The eukaryotic Poly(A) polymerases are only distantly related to the bacterial versions and, instead, are more closely related to the Trf4/5 family of eukaryotic DNA polymerases and archaeal CCA-adding enzymes ( 149 ), suggesting that these archaeal enzymes probably have a second function as Poly(A) polymerases. In eukaryotes, the free 3′ end for the Poly(A) polymerase is generated by a predicted nuclease of the metallo-β-lactamase fold, CPSF-I ( 192 – 194 ). This enzyme is conserved throughout the archaeo-eukaryotic lineage and is also present in many bacteria. Thus, LUCA probably had a polyadenylation system that consisted, at least, of a CPSF-I-like enzyme that cleaved the transcript and a polymerase β family nucleotidyltransferase that added the adenylates. The reasons for the rapid evolution of the poly(A) polymerases in each of the three primary kingdoms are unclear. It seems plausible that, in eukaryotes, the displacement of the CCA-adding function by a horizontally transferred bacterial enzyme resulted in the divergence of the poly(A) polymerase from the ancestral, bifunctional form seen in the archaea. Eukaryotes additionally recruited to the CSPF complex several new RNA-binding proteins containing eukaryote-specific domains, such as RRM, CCCH and Zn‐knuckle. Furthermore, RRM and NIC-domain-containing proteins were recruited to form a eukaryote-specific poly(A) tail-binding complex.
The cap is a unique structure present in eukaryotic mRNAs; the minimal form of the cap is synthesized through the following steps: (i) removal of the terminal phosphate of the triphosphate at the 5′ end of mRNA, (ii) guanylylation of the 5′ diphosphate and (iii) methylation of the guanine at the N-7 position ( 195 ). The first two steps are catalyzed by the capping enzyme, which consists of a triphosphatase and a nucleotidyltransferase, whereas the N-7 methylation is catalyzed by methylases of the Abd1p family ( 196 ). The enzymes that catalyze the latter two capping reactions appear to be conserved throughout the eukaryotes. The capping guanylyl transferase apparently was derived from the more ancient ATP-dependent DNA ligase ( 38 , 39 ), whereas the capping methylase probably evolved from within the vast small-molecule methylase class, rather than from the regular, monophyletic RNA N-methylases (see above). The capping triphosphatase, however, shows great variability among eukaryotes. Animals and plants share a triphosphatase of the tyrosine phosphatase superfamily that is fused to the N-terminus of the guanylyl transferase ( 197 , 198 ). The fungi and Plasmodium falciparum contain a distinct phosphoesterase of an all-β fold, which occurs as a stand-alone subunit and is also present in large DNA viruses, such as PBCV ( 199 ). The earlier branching trypanosomes have a phosphoesterase domain of the P-loop-containing adenylate kinase family fused to the N-terminus of the guanylyl transferase ( 144 ). This unusual diversification of the triphosphatase domain suggests that, whereas the capping methylase and guanylyl transferase were derived early in eukaryotic evolution, there was no specific triphosphatase at the corresponding stage of evolution. Instead, the triphosphatase reaction might have been performed by a non-specific phosphatase. Subsequently, in each lineage, an independent triphosphatase appears to have been recruited for this function. We found that the animal-specific CG6379 family of methylases of the FtsJ-like superfamily have a divergent, catalytically inactive version of the capping enzyme nucleotidyltransferase domain fused to the methylase domain. These RNA methylases might function as regulators of the capping process that bind cap through the inactive capping enzyme domain.
The principal proteins of the nuclear and cytoplasmic cap‐binding complexes, CBP80 and eIF4G, respectively, appear to have diverged from an NIC-domain-containing ancestor, which was probably the core subunit of the ancestral cap-binding complex ( 16 , 17 ). After the divergence of these central components, new subunits, such as CBP20 ( 200 ), a RRM domain protein and eIF4E ( 201 ), appear to have been independently recruited to the respective complexes, at least prior to the divergence of the eukaryotic crown group. EIF4E also has a core RRM-like fold, although no sequence similarity to RRM domains is detectable; this domain might have been derived from a common precursor with the RRM.
Post-transcriptional regulatory mechanisms. Mechanisms of PTGR that act directly on the transcript and affect its stability or association with the ribosome are common in both bacteria and eukaryotes. At the core of these mechanisms are the ribonucleases that mediate RNA degradation; these enzymes are conserved in all three primary kingdoms ( 45 ). Eukaryotes evolved a specific elaboration of this system whereby a whole class of dedicated proteins and RNAs lend specificity to the degradation system with respect to the transcripts that are regulated ( 202 – 205 ). This phenomenon has been termed PTGS and, in many eukaryotes, depends on the amplification of small regulatory RNAs by an RNA-dependent RNA polymerase ( 153 – 156 ). Additionally, while distinct from the chromatin-level transcriptional silencing, the PTGS system appears to interact with it ( 133 , 206 ).
The most ancient PTGR systems are comprised of RNases and helicases that unwind RNA secondary structures to aid degradation or regulate translation (Fig. 6 ). Many, if not all, of the nucleases implicated in PTGR appear to be involved also in the processing of RNA precursors. The RNA degradation enzymes that can be traced back to LUCA are RNase HII and RNase PH, of which the former is responsible for the removal of the RNA primer during DNA replication and apparently has no direct role in PTGR. In contrast, RNase PH is one of the principal RNA degradation enzymes, along with RNase P. RNase P is present in all extant organisms, but its protein subunits are not homologous in bacteria and archaea-eukaryotes, which suggests that, in LUCA, RNase P existed as pure ribozyme. RNase PH and the bacterial RNase P protein subunit have a common nucleic acid-binding domain of the S5 fold ( 207 , 208 ). This suggests an evolutionary scenario whereby the S5 domain was recruited by a common ribozyme ancestor of RNases PH and P and, during the subsequent evolution, the ribozyme was gradually replaced entirely by a protein catalytic scaffold in RNase PH-like enzymes, whereas RNase P retained the ribozyme and the RNA-binding subunit. This scenario implies that the protein subunit of the bacterial RNase P retains the ancestral state and probably has been displaced by unrelated proteins in the archaeo-eukaryotic lineage. The primitive RNA degradation system of LUCA might also have included a LHR-Ski2p family helicase and, possibly, a generic thermonuclease-like protein of the OB fold and RNA-binding PIN domains. Another component that might have been represented in LUCA is the SM domain. In prokaryotes, SM domain-containing proteins bind numerous specialized small RNAs, such as the DsrA/RprA RNA, and regulate mRNA stability and association with the ribosome ( 209 ). It remains to be seen if any of the small RNAs bound by the SM proteins possess ribozyme activities.
With the separation of the archaeo-eukaryotic and bacterial lineages, several distinct superfamilies of nucleases were independently recruited in each of them for RNA degradation and processing [see the recent detailed evolutionary classification of RNases ( 45 )]. The most important innovations in bacteria included 3′→5′ exoRNases, RNase E/G, RNase II and RNase III. In the archaeo-eukaryotic lineage, a 3′→5′ RNA degradation and processing complex, the exosome, has evolved. The eukaryotic exosome has been extensively characterized experimentally ( 116 , 210 , 211 ), whereas the existence of the archaeal counterpart and, by inference, the presence of the exosome in the common ancestor of archaea and eukaryotes, have been postulated through comparative analysis of archaeal genomes. Genes for predicted exosomal components form some of the most conspicuously conserved gene strings (probable operons) in archaea ( 212 ). The exosome consists of Rrp41p- and Rrp42p-like RNase PH family nucleases, RNA-binding proteins containing S1 domains combined with KH or Zn‐ribbon domains, such as Rrp4p and Csl4p, PIN domain proteins, a LHR/Ski2p-like helicase and, possibly, also RNase P as predicted during archaeal genome analysis.
The archaea also evolved a distinct RNase of the DHH hydrolase family, which contains S1 and ZnR domains and, as suggested by the comparative genome analysis, might interact with the exosome ( 45 ). In addition to these conserved complexes involved in RNA degradation, other RNA-binding complexes, which might contribute to PTGR by affecting mRNA stability and association with the ribosome, evolved after the split of the primary lineages. Cold shock proteins (CspA) containing S1-like domains are among such bacterial regulatory RNA-binding protein ( 213 ). Additionally, proteins such as Hsp15, with a stand-alone S4 domain, which bind RNA and regulate translation, point to the existence of diverse PTGR systems in bacteria ( 214 ). Some of the RNA-binding proteins predicted during this study, e.g. a protein that combines a PIN and a TRAM domain, could provide leads for discovery and investigation of poorly understood PTGR systems in prokaryotes (Fig. 6 ).
The emergence of eukaryotes was accompanied by several major elaborations of the PTGR systems, which involved several types of evolutionary processes. One of the major factors was the collusion of the archaeal and bacterial inheritances that gave rise to more complex forms of ancient PTGR systems. A case in point are nucleases, such as 3′→5′ exoRNases (e.g. Rrp6p) and RNase II (e.g. Rrp44p), which apparently were acquired by eukaryotes from bacteria, probably via the pro-mitochondrial endosymbiont, and added to the exosome whose core was inherited from the archaeo-eukaryotic ancestor. The large-scale, intra-familial duplication, e.g. among helicases such as Mtr4p and Ski2p (Fig. 4 ), was the second major evolutionary phenomenon that contributed to the elaboration of the eukaryotic exosome complex. The third trend in the ontology of these complexes was the recruitment of pan-eukaryotic, superstructure-forming domains, such as WD40 and TPR, which probably provided scaffolding for the enlarged eukaryotic complexes.
The eukaryote-specific mRNA degradation system, which destroys both nonsense codon-containing (nonsense-mediated decay or NMD) and normal mRNAs, appears to have been assembled, in part, from various translation-related components. Among these components, NMD3p appears to have emerged in the archaeo-eukaryotic lineage and functions in ribosomal assembly ( 215 , 216 ). The other components of this system are eukaryote-specific innovations that mimic the set of similar components that have been added to the exosome. NMD2p ( 217 ) contains an NIC domain and shares a common ancestor with the translation factor eIF4G. NMD4p and its metazoan equivalents, such as SMG6 ( 171 , 218 ), contain PIN domains and might ultimately have descended from the stand-alone PIN-domain proteins detected in archaea. NMD5p is a HEAT repeat protein and UPF1p is a SFII RNA helicase ( 217 ). The poly(A)-degrading complex also appears to have emerged prior to the divergence of the major eukaryotic lineages and contains at least three conserved nuclease components, namely Pan1p, Pop2p and DAN-like nucleases, which belong to the 3′→5′ exonuclease family, and CCR4, which is a derivative of the DNase I superfamily ( 45 , 219 , 220 ).
The eukaryote-specific PTGS system is present throughout the crown group, at least. Recent experimental results combined with computational predictions based on phyletic patterns resulted in the identification of a complex PTGS apparatus that can be traced back to the common ancestor of the eukaryotic crown group. The core of this system includes a SFII helicase–RNaseIII fusion protein of the carpel factory (CAF, also called DICER) family, which generates small, 21–25 nt RNAs [small interfering RNAs (siRNAs)] used as guides to promote degradation of specific RNAs by a nuclease complex ( 133 , 221 – 223 ). Additionally, the DICER helicase–nuclease appears to be involved in the processing of numerous other small regulatory RNAs, including the stRNAs, such as Lin-4 and Let-7, which regulate specific transcripts through antisense interactions ( 224 ). A LIN-28-like RNA-binding protein containing an S1-like domain and homologous to bacterial Csp ( 225 ), which binds these small RNAs, probably is another ancestral component of the PTGS system. The siRNAs function as primers in an amplificatory degradative PCR-like reaction that generates dsRNA and is catalyzed by a specialized RNA-dependent RNA polymerase that is thus far traceable to the base of the eukaryotic crown group ( 153 – 156 ). Proteins of the PIWI-argonaute family, which combine PIWI and PAZ domains ( 14 ), also probably participated in the ancestral PTGS as siRNA-binding components ( 226 ). The actual RNA destruction apparently depends on several other components, including a RecQ-like helicase ( 127 ) and RNase D family 3′→5′ nucleases, such as Mut-7 and Egl ( 227 ). From the time of its emergence, the PTGS system probably closely interacted with the more generic RNA degradation systems, including the exosome, NMD and the poly(A)-tail degradation system.
A substantial part of the PTGS system, including the progenitor of most of the 3′→5′ exonucleases, RNase III, the RecQ-like helicase and the RNA-binding CSP proteins are part of the bacterial inheritance of the eukaryotes. The 3′→5′ exonucleases and RNase III, after their acquisition by eukaryotes, each underwent series of duplications to give rise to several distinct groups of orthologs and also formed new architectures through domain fusions. The Mut-7 proteins contain a module, C‐terminal to the 3′→5′ exonuclease domain, which consists of a unique α/β domain fused to a Zn-ribbon, which might bind RNA ( 45 ). This Mut-7C module appears as a stand-alone protein in archaea and bacteria and potentially might interact with a 3′→5′ nuclease already in prokaryotes, followed by the fusion in eukaryotes. The Argonaute-like proteins are represented in archaea and Aquifex ; one of the eukaryotic members of this family has been described as translation initiation factor eIF2C ( 228 ). These ancient versions contain only a PIWI domain and their phyletic pattern is typical of translation machinery components, suggesting that their original function was related to translation. Prior to the divergence of the eukaryotic crown group, the PIWI domain combined with a predicted RBD, PAZ, which is also fused to the helicase and nuclease domains in the CAF family proteins (Fig. 6 ). The PAZ domain, which might bind the small RNAs that are generated as part of PTGS, evolved in eukaryotes with the emergence of this system.
Within the crown group, PTGS shows considerable variability, with extensive gene loss completely or partially eliminating the system in various lineages. In yeast S.cerevisiae , the entire system appears to have been lost ( 133 ), whereas in Drosophila and humans, the apparent loss is restricted to the RdRp and the Mut-7 nuclease. However, the detection of a functional PTGS system in Drosophila ( 229 ) suggests that the role of the RNA polymerase may have been taken over by other enzymes, such as the DNA-dependent RNA polymerase or a reverse transcriptase-like enzyme, which are known to possess similar activities in vitro . In contrast, plants and Dictyostelium show expansions of the RdRp family, with at least six and four distinct members, respectively. Furthermore, the architectures of the proteins involved in PTGS show lineage-specific variability, e.g. fusion of RRM domains to the RdRp in plants and a duplication of the RdRp within a single protein in C.elegans . Several eukaryotic proteins were identified that, on the basis of their domain architectures, seem to be likely candidates for participation in PTGS. Examples include a nuclease of the RNase II family that is fused to a Sen1p-like SFI helicase in humans and a family of plant 3′→5′ exonucleases fused to the RRM domain ( 45 ). Analysis of phyletic patterns and domain architectures also resulted in the identification of several novel candidates, which could be parts of a more extended PTGS network ( 133 ) (Fig. 6 ). The most notable of these include an orthologous group of predicted adenine methylases (the CG14906 group) related to the Kar4-Ime-4 family of mRNA methylases (Fig. 2 ). Another group of predicted RNA methylases with a similar phyletic pattern are the Corymbosa2/Hen1 family of methylases that are predicted to be dsRNA methylases (see above). These enzymes could specifically regulate the stability of dsRNA regions formed by pairing of mRNAs with anti-sense RNAs (Figs 2 and 6 ). Homologs of the DNA repair protein AlkB fused to the RRM domain might be involved in RNA modification (Fig. 6 ). It has been predicted that this subfamily of AlkB proteins, similarly to their homologs involved in DNA repair, possess iron- and 2-oxoglutarate-dependent oxidative demethylating activity ( 157 ). Consistent with this prediction, these AlkB homologs, in addition to the RRM domain fusion, also show fusions to a distinct family of methylases. Taken together with the widespread distribution of these enzymes in the crown group, with the exception of S.cerevisiae [a phyletic pattern typical of other PTGS components ( 133 )], these observations suggest that a mRNA methylation–demethylation circuit might be another component of PTGS.
Finally, numerous other uncharacterized eukaryotic RNA-binding proteins were predicted, which could point to still unknown PTGR systems and complexes. For example, Ro protein, which shares the RNA-binding ROT domain with telomerase subunits ( 230 ), binds small RNAs called Y RNAs in animals and the resulting RNPs might be involved in several poorly characterized regulatory functions, such as RNA quality control ( 231 ). Ro protein homologs are also present in certain bacteria, such as Deinococcus and Streptomyces , probably as a result of horizontal gene transfer from eukaryotes and it has been shown that, in Deinococcus , the Ro homolog binds several small RNAs and belongs to a PTGR system that regulates radiation resistance ( 232 ).
RNA processing and splicing. In both eukaryotes and prokaryotes, rRNAs and tRNAs are released from larger precursors through RNA processing events mediated by the same nucleases that are involved in RNA degradation, such as RNase PH and RNase P. As discussed above, the presence of distinct nuclease families in the archaeo-eukaryotic and bacterial lineages suggests that many of these processing systems evolved only after the separation of these primary lineages, with the eukaryotes processing machinery combining the archaeal and bacterial inheritances. Archaea-eukaryotes evolved a specific system of tRNA processing, which removes an intron present in the middle of the tRNA precursor ( 233 ). The tRNA splicing endonuclease is a distinct member of the restriction endonuclease fold ( 234 ), which might have been derived from an ancient, restriction enzyme-like genomic parasite. This is consistent with the mobile parasitic behavior of several members of the restriction endonuclease superfamily ( 235 , 236 ). In eukaryotes, this enzyme underwent a tetraplication followed by inactivation of two of the copies and resulting in a heterotetrameric functional complex ( 21 , 45 ). The U3 RNP complex is involved in rRNA processing, which involves chiefly rRNA modifications guided by the associated small RNAs ( 237 ). This complex consists of, at least, Imp4p, Prp31p and the methylase fibrillarin and evolved in the common ancestor of archaea and eukaryotes; archaeal genome comparisons suggest that it might functionally interact with the exosome ( 212 ). In eukaryotes, some of the components of this complex, e.g. PRP31p ( 238 ), appear to have been additionally recruited for pre-mRNA splicing.
The most distinctive RNA-processing pathway is mRNA splicing, which, in its entirety, is seen only in eukaryotes (Fig. 7 ). Eukaryotic spliceosomal mRNA introns share with Type II self-splicing introns the intermediate step of lariat formation. This observation prompted the hypothesis that Type II introns, which existed as parasitic retroelements in the genomes of the organellar precursors, invaded the eukaryotic nucleus, giving rise to the spliceosomal introns ( 239 – 241 ). The analysis of the spliceosomal components that we present here suggests that a version of this hypothesis is plausible and argues against the competing ‘introns early’ hypothesis, which postulates extensive presence of introns in LUCA ( 242 – 244 ).
The eukaryotic splicing apparatus consists of five principal snRNP particles, U1, U2, U4, U5 and U6 ( 245 – 248 ), which contain their namesake small RNAs (Fig. 7 ). Many specialized spliceosomal particles, especially in multicellular eukaryotes, contain alternative counterparts of these main U RNAs and are dedicated to the processing of special (non-canonical) splice junctions ( 249 ). The components that are common to all five spliceosomal U RNP particles can be traced back to the common ancestor of the eukaryotic crown group, suggesting that the core of the spliceosomal machinery was firmly established by the time the crown-group eukaryotes radiated. Examination of the inferred domain composition of the ancestral spliceosomal machinery shows marked enrichment of several conserved domains (Fig. 7 ). These include SFII helicases and RBDs, namely RRM, SM, Zn-knuckle, CCCH, G-patch, SWAP and PWI. Thus, the spliceosomal particles are largely made up of paralogous forms of a relatively small set of domains. It appears that the ancestral spliceosome was assembled mainly from eukaryote-specific domains and its elaboration resulting in the origin of the five principal spliceosomal particles had occurred largely through the proliferation and shuffling of just these few domains that, in the early spliceosome, were represented by their common ancestors. Common to all these U snRNPs are small, stand-alone SM proteins, which belong to a class of RNA-binding SH3-fold β-barrel domains (see above); this RBD probably bound small RNAs already at a pre-LUCA stage of evolution.
The expansion of the SM family from a single ancestral form found in archaea to the numerous lineages seen in eukaryotes suggests that the SM protein formed the ancestral core of the splicing complex by acting as a protein cofactor for the self-splicing Type II introns that invaded eukaryotes. This could have increased the efficiency of splicing of the Type II introns and diminished their deleterious effects, thereby contributing to their spread. At this point, proteins containing some of the newly emerged eukaryote-specific domains, such as RRM, Zn-knuckle and CCCH, and RNA helicases of the eIF4A-DEAD and Maleless families, might have been added to the set of protein cofactors of the Type II introns. Additionally, some proteins that were initially associated with exosomal function, such as helicases of the Ski2p-Lhr family, also might have been recruited to the emerging spliceosome. The next stage of evolution probably involved partial degeneration of the introns themselves and the emergence of distinct intron fragments as precursors of the U RNAs, which possess ribozyme activity and appear to be the primary catalysts of splicing ( 250 ). Simultaneous evolution of eukaryotic chromatin allowed the major increase in genome size in eukaryotes and thus provided the niche for selectively neutral or advantageous (although the nature of these potential advantages is not clear) expansion of the introns throughout eukaryotic evolution. This expansion probably was accompanied by a feedback loop that selected for the proliferation and diversification of the original protein cofactors recruited for splicing, causing an explosive expansion of RRM, SFII helicase and other eukaryote-specific domains involved in splicing.
Genome sequences of early-branching eukaryotes might provide the details of the actual temporal order of the duplications in the evolution of the splicing system, but some inferences on the relative branching pattern already can be drawn from the currently available eukaryotic genome sequences. At least 70–80 orthologous lineages of proteins containing one or more of the common RBDs mentioned above, 15 or more lineages of SFII helicases ( 249 ), and several other single-copy proteins with no mobile domains, such as PRP38 or Snu66, are traceable to the ancestor of the crown group. Among the RRM-domain proteins, the most common architectures include the single- and multi-RRM proteins, followed by fusions to the G‐patch, CCCH and Zn-knuckle domains (Fig. 7 ). From this ancestral state that existed prior to the radiation of the crown group, several lineage-specific developments ensued, which correlate with the origin of alternative splicing in multicellular eukaryotes. The common ancestor of animals apparently had approximately 40 orthologous groups of splicing-related proteins that evolved after the divergence of the major crown group lineages (Figs 5 and 7 ). However, the most striking development is seen in vertebrates, which have at least 30 distinct RRM-domain proteins with no orthologs in arthropods or nematodes and several vertebrate-specific expansions within other ancient ortholog groups of RRM proteins. This diversity of RRM proteins correlates with and is probably functionally linked to the extensive utilization of alternative splicing as a means of generating protein diversity ( 251 , 252 ). A similar situation seems to exist in plants because over 50 plant-specific RRM proteins were detected in Arabidopsis ; however, the exact point of origin of this diversity is currently unclear, given the absence of other plant genomes. In contrast, in yeast, the U2 and, to a lesser extent, U5 snRNPs show extensive degeneration, which correlates with the near-complete elimination of spliceosomal introns ( 133 , 253 ).
Links between molecular chaperones, protein degradation, the ubiquitin system and RNA metabolism. Several deep evolutionary links seem to exist between RNA metabolism, protein degradation and ubiquitin signaling pathways, suggesting that these cellular systems have a long history of interactions. The earliest of these links appears to be the potential functional coupling of the RNA-degrading exosome, the protein-degrading proteasome and co-translational protein folding facilitated by prefoldins, as indicated by the juxtaposition of the corresponding genes within a superoperon, which is conserved in most archaeal genomes ( 212 ). Such functional coupling can be rationalized in terms of coupled pre- and post-translational regulation of the protein level through mRNA and protein stability, respectively. This type of interaction appears to have extended into eukaryotes as suggested, in particular, by the presence of the shared Sec63 domain in chaperones involved in endoplasmic protein translocation and degradation and the exosome/splicing-related helicase Brr2p ( 118 ), and by the presence of the Little Finger domain in the animal versions of the Npl4p (suppressor of Sec63p) protein (Fig. 8 ). A pan-eukaryotic SPBC17G9.05-like protein containing a cyclophilin-like PPIase fused to a RRM domain (with an additional Zn-knuckle in plants) might be another component of such a system, through coupling protein unfolding to RNA metabolism. Furthermore, animals possess another distinct cyclophilin–RRM fusion (Fig. 8 ) that might also perform a similar function.
Another ancient link between RNA metabolism and protein degradation is suggested by the domain architecture of the prokaryotic protease HypF, which is involved in hydrogenase maturation and assembly. The HypF protein consists of a dsRNA-binding Sua5 domain ( 254 ), an OSGP metallo-protease domain of the Hsp70 fold ( 38 ) and an acyl phosphatase domain; this domain architecture is suggestive of complex regulation of specific protein processing events through interaction with RNA (Fig. 8 ).
In eukaryotes, the elaborate ubiquitin signaling system has a central role in targeting proteins for degradation ( 255 , 256 ). Ubiquitin also acts as a signaling moiety to direct specific protein–protein interactions. A number of domain architectures seen among eukaryotic proteins involved in RNA metabolism suggest close interactions with the ubiquitin system. These include numerous fusions of RING-finger domains, which function as ubiquitin E3 ligases, with RBDs, such as KH, Little Finger and CCCH; these domain combinations are present in MDM2, Makorin and several other proteins (Fig. 8 ). These proteins might function as E3 ligases that specifically tag certain splicing or other RNP complexes with ubiquitin or ubiquitin-like molecules and thereby target them for degradation or regulate their assembly. Furthermore, fusions of other domains involved in ubiquitin signaling, such as ubiquitin itself, UBA, F-box and CUE with various domains involved in RNA metabolism are also seen in several proteins, such as PRP21, TAF II 250 and TAB2 (Fig. 8 ). These architectures are again suggestive of a role in bringing the ubiqutination machinery to the RNA-binding complexes, in which these proteins reside.
A protease of the ubiquitin C-terminal hydrolase family, Sad1p, is required for the assembly of the U4/U6.U5 tri-snRNP and might act a protease in processing of some of the subunits of this complex ( 257 , 258 ). Eukaryotes have re-used inactive versions of a predicted ancient hydrolytic enzyme, the JAB domain, which is also found in several components of the proteasome/signalosome, in two distinct RNA-associated complexes ( 259 , 260 ). Specifically, the JAB-domain protein PRP8 is a subunit of the U5 and U6 snRNP complexes, and the translation-related eIF3 complex also contains several JAB-domain subunits. In the context of potential links between RNA and protein degradation, among the most interesting architectures are the fusions of the Little Finger domain with three distinct protease domains (Fig. 8 ), namely an inactive version of the Otu-A20 family protease ( 261 ) in TRABID, a calpain protease in Small optic lobes ( 262 ), and a metalloprotease domain of the WSS1p family in Arabidopsis and Trypanosoma F14J16.17. This, taken together with the fusion of the Little Finger with E3 ubiquitin ligases in certain proteins, such as MDM2 ( 263 ), suggests that this domain might provide a specific link between RNA metabolism and protein degradation. The exact nature of this connection is unclear, but it seems plausible that still uncharacterized, small RNAs regulate the function of the protein degradation complexes. Alternatively or additionally, the Little Finger might function as a tether to target proteolytic machinery to proteins associated with specific cellular RNAs. Most of these architectures are restricted to a few eukaryotic lineages, suggesting the existence of numerous lineage-specific mechanisms for modulation of RNA metabolism.
The significance of the phyletic patterns of conserved proteins in RNA metabolism systems for inferring evolutionary relationships between major taxa. Examination of the phyletic patterns of the conserved proteins in the RNA metabolism systems potentially could help in testing phylogenetic hypotheses regarding the relationships between major lineages. At the deepest level, the presence of a distinct archaeo-eukaryotic lineage is supported by approximately 60 conserved orthologous groups that are shared exclusively by archaea and eukaryotes. This is contrasted by a mere 20 or so orthologous groups common exclusively to archaea and bacteria and approximately 39 bacterial–eukaryotic groups. This pattern is consistent with the domain distribution data and supports a model whereby eukaryotes are a chimeric lineage, which combines archaeal and bacterial inheritances. This massive chimerism in the eukaryotic inheritance most likely reflects the endosymbiotic interaction between the pro-mitochondrial α-proteobacterium and an archaeon. The evidence for the presence of an ancestral mitochondrion in all, including the earliest branching eukaryotes ( 264 – 267 ), and the extensive bacterial contribution that can be seen in the available genomic data from early-branching eukaryotes ( 268 – 270 ) supports this model. There has been a smaller, but noticeable gene flow between the two prokaryotic superkingdoms, apparently driven by the regular process of horizontal gene transfer rather than large-scale chimerism; however, in some cases, such as the bacterial hyperthermophiles, this gene transfer probably made a much greater contribution ( 271 , 272 ).
Within the eukaryotes, the observed phyletic distribution of domains and proteins involved in RNA metabolism seems to conflict with two well known phylogenetic hypotheses. The number of orthologous groups of proteins shared exclusively by animals and plants is approximately 41, in contrast to just 15 that are exclusively shared by fungi and animals. At face value, this contradicts the currently accepted phylogeny, in which fungi and animals are sister groups ( 273 ). A possible explanation for this pattern, however, is a massive loss of ancestral genes in the currently available fungal genomes, those of two yeasts. Comparative genomics indeed provides support for large-scale gene loss in the yeasts ( 133 , 274 ). However, in some cases, such as the capping enzyme, TAFii250, eIF4G and Whi3p, the yeast versions have domain architectures distinct from those that are shared by their orthologs in animals and plants. Thus, the topology of the primary branches within the eukaryotic crown group probably should be considered unresolved, emphasizing the need for further investigation from the comparative genomics angle, in addition to individual phylogenies of multiple proteins.
The second piece of evidence that contradicts a popular phylogenetic hypothesis is the presence of 24 exclusive orthologous groups shared by arthropods and vertebrates as opposed to only three that are shared by arthropods and nematodes. A similar phyletic pattern has been reported in the case of orthologous groups shared by nematodes and vertebrates in other functional systems, such as chromatin structure and organization and the apoptosis apparatus ( 275 , 276 ). These observations are not consistent with the existence of a nematode–arthropod clade, which is favored by the ecdysozoan model of eukaryotic evolution ( 277 ). Although some gene loss in C.elegans is a possibility, the minimal animal proteomes are of approximately the same size (once lineage-specific family expansions are factored in), and therefore it appears less likely that the specific link between vertebrates and arthropods can be attributed to massive gene loss in nematodes. This suggests that the traditional model of a coelomate clade ( 278 ), as opposed to an ecdysozoan clade ( 277 ), could be a more accurate representation of animal phylogeny.
The RNA metabolism system includes approximately 80 orthologous groups of proteins traceable to LUCA, which makes it the most evolutionarily conserved system among all cellular functional systems. This simple observation is consistent with the idea of a primordial ‘RNA world’ wherein RNA-related functions had a dominant role. Even before the radiation of the bacterial and archaeo-eukaryotic clades from LUCA, RNA metabolism had already differentiated into several distinct functional complexes: the ribosome involved in protein synthesis, the accessory apparatus of protein synthesis, which includes aaRS and translation factors, a battery of RNA-modifying enzymes involved in production of functional RNAs, a RNA degradation system with nucleases involved in both recycling and maturation of RNAs, and complexes with more specialized functions, such as transcription elongation and polyadenylation (Table 2 ). The majority of these proteins can be dissected into a limited set of about 40–50 principal domains, including several paralogous versions, which were present already in LUCA. This observation points to a pre-LUCA phase of evolution, with an even more limited set of RNA-associated proteins. More specifically, comparisons of the paralogous domains/proteins traceable to LUCA indicate that, at this early stage of evolution, the primitive organisms had single, ancestral GTPases, methylase, helicase (the ancestor of both SFI and SFII) and several other enzymes, as well as single versions of proteins containing RBDs, such as the progenitors of the α-L domains, the OB fold, the SH3-like barrels, KH, dsRBD and ZnR. Each of these ancestral proteins probably performed a wide range of functions, albeit with low specificity. The inevitable corollary of this notion is that, unlike the modern systems of RNA metabolism, the primitive system relied primarily on RNA for specificity of interactions and even catalysis, with proteins functioning largely as co-factors. Thus, these reconstructions seem to provide support for an ancient RNA world in which simple proteins with generic functions facilitated catalysis and specific interactions that were primarily mediated by RNAs. With the gradual increase in the number of proteins interacting with RNAs, as a result of multiple duplications, proteins gradually evolved greater diversity to occupy most functional niches that, in the primordial organisms, belonged to RNAs. This led to the gradual displacement of the ribozymes, while leaving behind remnants, such as RNase P, the guide RNAs involved in RNA modifications, the spliceosomal U RNAs and, most prominently, the 23S rRNA. Most but not all of these displacements appear to have already taken place prior to the LUCA. Previous studies on the evolution of DNA replication systems suggested that LUCA most likely did not possess a modern-type DNA genome, but instead had a mixed RNA–DNA genetic system ( 279 ). Thus, as long as the nature of the genetic material can be considered a criterion, LUCA itself probably still was one of the terminal stages of the evolution of the RNA world.
As discussed above with regard to numerous protein families, evolution of the RNA metabolism system involved multiple horizontal gene transfers which, in principle, could jeopardize the use of the parsimony principle for evolutionary reconstructions (see Materials and Methods). Furthermore, backward extrapolation suggests that horizontal transfer was ever more rampant early during evolution, which could potentially refute the very concept of a single LUCA ( 280 ). However, the (near) omnipresence of numerous translation components and a substantial set of RNA modification enzymes, together with the fact that most of them conform with the standard model of evolution, indicate that reconstruction of LUCA, although necessarily probabilistic, is feasible (Table 2 ). These reconstructions indicate that LUCA probably was an organism or, more precisely, a population of organisms, certain major characteristics of which were very different from those of modern organisms. In particular, LUCA’s genome probably consisted of multiple RNA and DNA segments ( 279 ), which led to extreme genome fluidity ( 279 , 280 ). Nevertheless, the previous and present evolutionary reconstructions show that many functional systems, including, above all, the RNA metabolism system, have already ‘crystallized’ in this organism ( 281 ).
During the post-LUCA phase of evolution, the ontology of RNA metabolism followed an evolutionary course essentially similar to other biological systems, but showed a strong tendency toward conservation of its ancient components. While some novelties evolved in both archaeal and bacterial lineages, the emergence of eukaryotes was marked with the most remarkable burst of innovation. Many of these can be traced to the ‘cross-fertilization’ between the archaeal and bacterial inheritances of the eukaryotic protein complement, whereas others involve new, eukaryote-specific domains. These innovations led to the origin and development of new functional systems, such as splicing, PTGS and other forms of post-transcriptional regulation; via a feedback loop, the evolution of these systems apparently stimulated lineage-specific expansion of numerous domains, particularly eukaryote-specific ones, through multiple rounds of duplication. This phase of eukaryotic evolution culminated in the extensive expansion of RBDs in the vertebrate and plant lineages, which seems to correlate with the advent of alternative splicing as a major force in the diversification of the functional potential of an organism.
Availability of complete results
An annotated list of all detected proteins from completely sequenced genomes that are known or predicted to be involved in RNA metabolism is available at ftp://ncbi.nlm.nih.gov/pub/aravind/RNA
NOTE ADDED IN PROOF
After this paper was submitted, the crystal structure of Type I pseudouridine synthase TruB was published ( 282 ). Like Type II pseudouridine synthases, TruB has an RRM-like fold, which indicates that all known pseudouridine synthases have evolved from a common ancestor, which originally probably was derived from a primitive RNA-binding protein. Two members of the HemK family of predicted methylases, HemK itself and YfcB, have been shown to methylate a specific glutamine residue in bacterial class 1 peptide release factors and in ribosomal protein L3, respectively ( 283 ). The role of HemK in release factor methylation was also demonstrated in an independent study ( 284 ). Thus, although the HemK family belongs to the BNM superfamily, which consists predominantly of RNA- and DNA-methylases, these proteins turned out to be protein N-methylases specific for protein with fundamental roles in translation. This specificity is compatible with the universal presence of the HemK family in all forms of life. It appears likely that these protein methylases evolved from RNA methylases at an early, pre-LUCA stage of evolution, in a fundamental switch of specificity, which resembles similar transitions in other methylases, e.g. the origin of cap methylases from small-molecule methylases. Finally, it has been shown that yeast Mrm2, a member of the FtsJ family of RNA methylases, is responsible for the 2’-O-ribose methylation of two nucleotides in the peptidyltransferase center of yeast mitochondrial 21S rNA ( 285 ).
We thank J. Bujnicki for helpful discussions on RNA methylases. We gratefully acknowledge all researchers who contributed to the current understanding of diverse aspects of RNA metabolism and apologize for inevitable omissions in citation of their work due to space considerations.
To whom correspondence should be addressed. Tel: +1 301 435 5913; Fax: +1 301 435 7794; Email: firstname.lastname@example.org
a α/β, regular alternating αβ units with a typically parallel β sheet; α + β, domains isolated α and β elements with a typically antiparallell β sheet.
b Classification of protein functions involved in RNA metabolism. S, splicing and processing; P, PTGR; C, capping and polyadenylation; Tl, translation; Tc, transcription; M, modification; U, miscellaneous. The numbers after the function designations indicate the number of orthologous groups of proteins containing the given domain for each function.
c The number of paralogs is indicated in parentheses whenever a lineage-specific expansion of a protein family is mentioned.