DNA replication is central to all extant cellular organisms. There are substantial functional similarities between the bacterial and the archaeal/eukaryotic replication machineries, including but not limited to defined origins, replication bidirectionality, RNA primers and leading and lagging strand synthesis. However, several core components of the bacterial replication machinery are unrelated or only distantly related to the functionally equivalent components of the archaeal/eukaryotic replication apparatus. This is in sharp contrast to the principal proteins involved in transcription and translation, which are highly conserved in all divisions of life. We performed detailed sequence comparisons of the proteins that fulfill indispensable functions in DNA replication and classified them into four main categories with respect to the conservation in bacteria and archaea/eukaryotes: (i) non-homologous, such as replicative polymerases and primases; (ii) containing homologous domains but apparently non-orthologous and conceivably independently recruited to function in replication, such as the principal replicative helicases or proofreading exonucleases; (iii) apparently orthologous but poorly conserved, such as the sliding clamp proteins or DNA ligases; (iv) orthologous and highly conserved, such as clamp-loader ATPases or 3′→5′ exonucleases (FLAP nucleases). The universal conservation of some components of the DNA replication machinery and enzymes for DNA precursor biosynthesis but not the principal DNA polymerases suggests that the last common ancestor (LCA) of all modern cellular life forms possessed DNA but did not replicate it the way extant cells do. We propose that the LCA had a genetic system that contained both RNA and DNA, with the latter being produced by reverse transcription. Consequently, the modern-type system for double-stranded DNA replication likely evolved independently in the bacterial and archaeal/eukaryotic lineages.
DNA replication is an essential, central feature of cellular life. There are many important functional parallels among all known cellular systems of DNA replication. These common features can be roughly summarized as follows: (i) replication is semi-conservative; (ii) replication always initiates at defined origins with the participation of an origin recognition system; (iii) replication fork movement is typically bidirectional; (iv) replication is continuous on the leading strand and discontinuous on the lagging strand; (v) RNA primers are needed to start DNA replication; (vi) nucleases, polymerases and ligases replace the RNA primers with DNA and seal the remaining nicks ( 1 , 2 ). It is therefore surprising that the protein sequences of several central components of the DNA replication machinery, above all the principal replicative polymerases, show very little or no sequence similarity between bacteria and archaea/ eukaryotes ( 3 , 4 ). These observations suggest that some of the replication system components may not be homologs at all, whereas others, while homologous, are highly diverged. This is in stark contrast to the highly significant sequence similarity between the principal components of the transcription machinery, such as the DNA-dependent RNA polymerase (DdRp) sub-units and a number of translation apparatus components.
The last 10 years have witnessed significant progress in our understanding of the relationships between proteins and domains involved in DNA replication. Significant sequence similarity between the polymerase-associated proofreading exonucleases of pro- and eukaryotes was noted in early studies ( 5 ). The recognition of homology between other replication proteins where sequence similarity was initially hard to detect has been made possible by structural comparisons. This was the case for the sliding clamp ( 6 , 7 ), the single-stranded (ss)DNA-binding proteins ( 8–10 ) and the 3′→5′ (flap) endo-nucleases ( 11–14 ). No sequence similarity, however, has been detected between the principal replicative polymerases, namely the eubacterial family C (pol III) and the archaeal/ eukaryotic family B polymerases, despite intense scrutiny at the sequence level ( 15–17 ) and despite the increasing availability of polymerase structures, including pol I from Escherichia coli ( 18 ) and Thermus aquaticus ( 13 ), HIV reverse transcriptase ( 19 ), T7 RNA polymerase ( 20 ) and a family B polymerase from phage RB69 ( 21 ). In the same vein, no sequence similarity could be found between the eubacterial and archaeal/eukaryotic primases ( 22 ).
Thus, the pattern of sequence conservation and divergence displayed by the replication proteins is fundamentally different from the pattern observed in the translation and transcription systems. It seems most likely that the core of the translation and transcription machinery was established in the last common ancestor (LCA) of all extant cells and subsequent evolution in different divisions of life did not involve dramatic alterations of the ancient molecular foundation. In contrast, major changes have occurred in the peripheral components, such as transcription regulators. Conversely, the core replication machinery, including the main replicative DNA polymerase, primase and the gap-filling polymerase, shows no detectable conservation. Several of the peripheral components, however, are clearly homologous or even orthologous. An obvious, though radical, explanation of the observed disparity is that the LCA did not have a DNA genome and its entire genetic system was RNA based. This hypothesis, however, does not account for the fact that several proteins involved in DNA replication as well as enzymes of deoxyribonucleotide biosynthesis and the recombination ATPase RecA are homologous in all extant organisms ( 23–26 ).
Edgell and Doolittle ( 4 ) delineated three distinct scenarios that could explain the existence of two versions of the replication machinery without invoking an RNA-only LCA. (i) The bacterial and archaeal/eukaryotic replicative systems have evolved from the LCA replication apparatus and the main replicative enzymes are actually homologs but, for some reason, have diverged rapidly and, in several cases, beyond recognition, (ii) The LCA possessed both a bacterial-type and an archaeal/ eukaryotic-type DNA replication system (one of these could be responsible for repair) and the existence of two radically different systems in extant cells is due to differential gene loss in the bacterial and the archaeal/eukaryotic lineages, (iii) Either the bacterial or the archaeal/eukaryotic replication system is the direct descendant of the ancestral replication apparatus whereas the other version evolved by recruitment of non-homologous proteins accompanied by replacement of ancestral components.
To reach a clearer understanding of the origin(s) of the DNA replication system by comparative analysis of the sequences and structures of their components, additional, systematic effort in two directions seems to be necessary: (i) detecting subtle sequence and structural similarities that have escaped detection previously; (ii) solving the issue of orthologous relationships between replication components. The importance of the former aspect is underscored by the homologous relationship between the bacterial and eukaryotic sliding clamp proteins that was not originally recognized but became apparent when their structures had been determined ( 7 , 27 ). With the advent of more powerful methods for sequence analysis, such as PSI-BLAST ( 28 ), the similarity between the clamp proteins has become detectable at the sequence level. This suggests that systematic, careful comparisons of replication proteins might reveal additional subtle but evolutionarily and functionally important similarities. Such findings could shift the balance in our thinking about the evolution of DNA replication towards the common origin hypothesis, whereas the absence of detectable similarity in spite of a careful comparison might suggest independent origin for at least some of the components. It is critical for any meaningful evolutionary reconstruction to distinguish orthologs that likely evolved from an ancestral component of the replication machinery from homologous but not orthologous proteins that might have independently originated from proteins that had functions other than DNA replication.
With these considerations in mind, we attempted an exhaustive comparison of the sequences and structures of bacterial, archaeal and eukaryotic proteins known to be directly involved in DNA replication. We classified these proteins into orthologs, non-orthologous homologs and those components that appear to be completely unrelated. On the basis of this analysis, we propose a hypothesis that the LCA possessed a genetic system that involved both RNA and DNA, with the latter being produced by reverse transcription. Consequently, the modern-type system for double-stranded (ds)DNA replication might have evolved independently in the bacterial and archaeal/eukaryotic lineages.
Databases and Sequence Analysis
For all sequence searches, the non-redundant database (NR) at the National Center for Biotechnology Information (NIH, Bethesda, MD) was used. The protein sequence similarity searches were performed using the gapped BLAST program and the PSI-BLAST program ( 28 ). The PSI-BLAST program constructs a position-dependent weight matrix from multiple alignments generated from the BLAST hits above a certain expectation value (e-value) and carries out iterative database searches using the information derived from this matrix ( 28 , 29 ). Normally, an e-value of 0.01 is considered an indication that a database hit is statistically significant after regions of low compositional complexity that tend to produce artifactually low e-values in database searches have been masked in the query sequence ( 29 , 30 ). Compositionally biased regions in protein sequences were masked prior to searches using the SEG program ( 31 ). The taxonomic breakdown of the database hits was produced using the Tax_Collector program of the SEALS package ( 32 ). The likely orthologs were identified on the basis of consistent inter-genomic best hits as described previously ( 33 , 34 ) and derived shared characters (synapomorphies) manifest at the level of distinct sequence motifs or features of domain architectures; the reasoning behind the assignment of orthologs is discussed below for each individual case.
Evolutionary Relationships Between Bacterial and Archael/Eukaryotic Dna Replication Systems
Table 1 lists the best database hits from archaea and eukaryotes for the principal bacterial proteins that are involved in DNA replication and DNA precursor synthesis in bacteria and in archaea/eukaryotes; analogous data for transcription machinery components are included as a control. Only a minority of the bacterial DNA replication machinery components show significant similarity to archaeal/eukaryotic homologs. Some of the strongest hits from bacteria to eukaryotes, such as those to the human NAD-dependent DNA ligase and the pol I homolog from Drosophila , are readily explained by horizontal gene transfer, most likely from organelles (see also 35). Additional cases of likely horizontal transfer, apparently from eukaryotes or archaea to bacteria, are seen in a reciprocal analysis of eukaryotic replication machinery components. These include the B family DNA polymerases, which are ubiquitous in eukaryotes and archaea but so far present only in the γ-proteo-bacterial lineage, and ATP-dependent DNA ligases, which show a sporadic presence in certain bacteria (data not shown).
These cases of apparent horizontal gene transfer apart, the striking contrast between the replication and transcription systems, in terms of conservation of the respective components (or lack thereof), in bacteria and archaea/eukaryotes is obvious ( Table 1 ). Although both the replication system and the transcription system include proteins that are highly conserved between bacteria and archaea/eukaryotes, along with ones that show little or no similarity, the breakdown of these systems into conserved and distinct components goes along very different lines. In the transcription machinery, the principal subunits of the DdRp show high levels of conservation, whereas accessory polymerase subunits and transcription factors are poorly conserved or show no detectable similarity at all. Amongst the replicative proteins, the situation is inverted; the DNA polymerases and primases are not detectably similar and only some of the accessory subunits, such as clamp-loading ATPases, enzymes that participate in replication but are not components of the replication fork, such as topoi-somerase I, and at least some DNA precursor biosynthesis enzymes are highly conserved ( Table 1 ).
To solve the central conundrum in the evolution of replication— common versus independent origins of the bacterial and archaeal/eukaryotic systems—it is not enough to show that components of the DNA replication machinery are homologous or non-homologous. Replication of dsDNA poses a number of similar problems in any system and it would not be unexpected if independently evolving solutions were similar, given that ancient protein superfamilies, such as the P-loop ATPases, were already available for recruitment in the LCA. Thus the goal of comparative analysis of the replication systems is to distinguish, as best we can, between those components that appear to be orthologous and thus should have descended from an LCA protein that had the same function and those for which, whether they are homologous or not, independent origin is more likely. Proving independent origin is hard, if at all possible. The case, however, is strongly supported if, for example, an archaeal/eukaryotic protein with a central role in replication is most closely related not to its bacterial functional counterpart but to a protein family that performs functions outside replication.
The lack of detectable sequence similarity does not automatically mean that the respective proteins are not homologs; there are examples of very subtle relationships between bacterial and archaeal/eukaryotic proteins that nevertheless appear to indicate homology or even orthology (see for example 36). Conversely, even highly significant sequence similarity, such as that observed between the clamp-loader ATPases, is not necessarily a guarantee of orthology.
With these considerations in mind, we performed a more detailed, case-by-case analysis of the bacterial, archaeal and eukaryotic proteins involved in DNA replication. Figure 1 summarizes the domain arrangements seen in the protein components of the bacterial, archaeal and eukaryotic DNA replication machineries and the relationships between them. In Table 2 , replicative proteins are classified into four principal categories that are discussed below.
Unrelated components in the bacterial and archaea]/eukaryotic DNA replication machineries
This category consists of only four domains but, strikingly, these include all three functional types of DNA polymerases required for replication, namely the DNA polymerase involved in elongation, the primase that is responsible for primer synthesis and hence the initiation of DNA replication, and the DNA polymerase involved in gap-filling upon primer removal. Not only database searches but also direct comparisons fail to show any sequence similarity between the nucleotide polymerization domain of bacterial DNA polymerase III (pol III) a-subunit and the functionally analogous domain of the archaeal and eukaryotic family B DNA polymerases (or any other proteins). The same is true of the second archaeal DNA polymerase (pol IV), whose large subunit, with the exception of a Zn-ribbon domain, appears to be unrelated to either bacterial or eukaryotic polymerases ( 37 , 38 ). The 3-dimensional structures of pol III and pol IV have not been determined and therefore it cannot be ruled out that they have the ‘palm-and-fingers’ structure similar to that seen in other DNA polymerases, including the bacteriophage RB69 polymerase ( 21 ), which represents the archaeal/eukaryotic family B. However, counterparts to the conserved motifs that appear to be shared by the eukaryotic and archaeal DNA polymerases, reverse transcriptases and RNA-dependent RNA polymerases (RdRps) ( 16 ) are not detectable in pol III and pol IV. This makes a specific evolutionary affinity between the bacterial and archaeal/eukaryotic DNA polymerase subunits involved in chain elongation during DNA replication most unlikely.
Both types of replicative DNA polymerases possess two additional enzymatic domains that also may function as separate subunits, namely a 3′→5′exonuclease and a predicted phos-phoesterase ( Fig. 1 ). The exonuclease domains are related but may not be orthologous, as discussed below. In contrast, the phosphoesterase domains/subunits belong to two distinct enzyme superfamilies, namely the PHP superfamily in bacteria and the calcineurin-type superfamily of metal-dependent phos-phoesterases in archaea and eukaryotes, which show no indication of a homologous relationship ( 39 ).
DNA primases present a case where an independent origin of the bacterial and archaeal/eukaryotic enzymes appears to be supported by positive evidence as well as a lack of detectable sequence similarity. The catalytic domain of bacterial primases shows a subtle but statistically significant sequence similarity to the DNA-nicking-rejoining domains of type I, type II and type VI topoisomerases and a distinct group of nucleases; all these proteins are predicted to contain the conserved Toprim domain ( 22 ). Despite a careful search, we were unable to detect any similarity to the Toprim domain in the sequences of eukaryotic primases. The fact that bacterial primases show an apparent structural and evolutionary relationship not with their archaeal/eukaryotic functional counterparts but with enzymes that have significantly different, even if mechanistically related, functions seems to effectively rule out an origin of the two types of extant primases from an ancestral primase.
Finally, the bacterial gap-filling DNA polymerase (pol I) appears to be unrelated (or, at best, extremely distantly related), with the exception of the 3′→5′ exonuclease domain, to any other DNA polymerases, whereas eukaryotes utilize family B DNA polymerases for both elongation and gap-filling ( Fig. 1 and Table 2 ).
Homologous but not orthologous components of the DNA replication apparatus in bacteria and archaea/eukaryotes
Several important components of the DNA replication machinery in bacteria and archaea/eukaryotes, while homologous, are strong candidates for independent recruitment for a role in replication. The example of the principal replicative helicases is the most straightforward one. All helicases appear to be ultimately homologous as members of the P-loop NTPase fold ( 40–42 ). This generic relationship apart, however, the bacterial replicative helicase DnaB and the helicases involved in eukaryotic replication, such as the DNA polymerase α-associated helicase A from yeast (ORF YKL017c) ( 43 ), belong to different divisions of the P-loop NTPase fold. Yeast helicase A belongs to helicase superfamily I, which includes a variety of DNA and RNA helicases, such as, for example, bacterial UvrD, that are involved in repair functions and may also perform accessory roles in replication. Some of the highly conserved eukaryotic homologs of helicase A are RNA helicases, such as the NAM7/ UPF1 proteins from fungi and animals, that are required for the processing of nonsense mRNAs ( 44 , 45 ), and yeast SEN1, that is involved in the endonucleolytic cleavage of introns from precursor tRNAs ( 46 ). Another group of highly conserved archaeal and eukaryotic DNA helicases involved in replication, the MCM proteins, belongs to the AAA+ superfamily of P-loop NTPases ( 42 , 47 ). In addition to the MCM helicases and the bacterial helicase RuvB, involved in repair, this superfamily includes a variety of ATPases with broadly defined chaperone-like functions, e.g. subunits of ATP-dependent proteases. In contrast, DnaB is a member of a distinct family that is specifically related to the RecA family, to the exclusion of other groups of ATPases ( Table 2 ; D.D.Leipe, L.Aravind and E.V.Koonin, unpublished observations). Thus, the principal replicative helicase seems to be an irrefutable case of independent drawing of enzymes from the pool of P-loop ATPases for a central function in DNA replication.
The case of the origin recognition and licensing ATPases is more complicated in that the protein that performs this function in bacteria (DnaA), its functional analogs in eukaryotes (the origin recognition complex subunits, e.g. 0RC1) and their archaeal homologs all belong to the AAA+ superfamily of P-loop ATPases ( 42 ). Within this superfamily, however, DnaA does not cluster with its functional counterparts from eukaryotes or archaea, suggesting that there is no orthologous relationship between the bacterial and archaeal/eukaryotic origin recognition ATPases ( Table 2 ).
An even more complex relationship is seen between the 3′→5′ proofreading exonucleases of bacterial and archaeal/ eukaryotic replicative polymerases. In bacteria they exist either separately as the 8 subunit of pol III or are inserted into the PHP domain of one of the multiple a-subunits of pol HI in the Gram-positive lineage and Thermotoga ( Fig. 1 ). In the archaea and eukaryotes, the 3′→5′ exonuclease is always fused to the DNA polymerase catalytic domain. Both bacterial and archaeal/eukaryotic proofreading exonucleases belong to the large superfamily of 3′→5′ exonucleases that includes not only DNases but also a variety of RNases ( 48 ). Phylogenetic tree analyses do not show enough resolution to meaningfully address the issue of the monophyly of the proofreading exonucleases to the exclusion of other nucleases in this superfamily (data not shown). The sequence similarity between the exonuclease domains of bacterial pol III and archaeal/eukaryotic polymerases is low (two to four iterations of PSI-BLAST are required to detect it). Bacterial pol III proofreading enzymes show the greatest similarity to a group of eukaryotic poly(A)-processing enzymes. The 3′→5′ exonuclease domains fused to bacterial pol I and to helicases, such as the vertebrate Werner syndrome protein, are also significantly similar to this group, which suggests that these domains were recruited for different functions on multiple occasions. Given the high level of divergence and the abundance of RNases in the 3′→5′ exonuclease superfamily, it is not certain whether the extant proofreading nucleases are all descendents of an ancestral proofreading enzyme or have been independently recruited for this task from the general pool of exonucleases.
The ssDNA-binding proteins represent another case of homologous domains that apparently have been independently recruited to perform a similar function in the archaeal/eukaryotic and bacterial lineages. Both bacterial and archaeal/eukaryotic ssDNA-binding proteins contain the ancient, widespread nucleic acid-binding domains of the OB-fold ( 49 ). A detailed sequence comparison showed that the eukaryotic ssDNA-binding protein that contains three OB-fold domains and its archaeal counterpart containing five OB-fold domains (the RPA proteins) are most closely related to the subclass of OB-folds typified by those in the lysyl- and aspartyl-tRNA synthetases (L.Aravind, unpublished observations). Similar OB-folds are also found in bacterial pol III α-subunits, the small subunit of the archaeal DNA polymerases (N-terminal to the phosphoesterase domain) and some bacterial and archaeal nucleases. Thus the archaeal/eukaryotic ssDNA-binding proteins belong to a distinct family of OB-folds that includes both RNA- and DNA-binding members. In contrast, sequence comparisons show that bacterial ssDNA-binding proteins form a separate family of OB-folds with distinct structural features, such as unusually long β-strands ( 9 , 50 ).
Orthologous components of the bacterial and archaeal/eukaryotic replication machineries
A considerable subset of the proteins that comprise the replication machinery appears to be represented by orthologs in all extant organisms. In only two cases, however, namely those of RNase HII and topoisomerase IA, do these protein show obvious, high conservation at the sequence level ( Table 1 ).
The bacterial and archaeal/eukaryotic clamp-loader ATPases show a moderate but statistically significant similarity to each other ( Table 1 ). There are, however, considerable differences in the domain architectures of the bacterial and eukaryotic clamp-loaders, such as the presence of BRCT domains in eukaryotic but not bacterial clamp-loaders and, conversely, the presence of a zinc-finger in bacterial but not eukaryotic ones ( Fig. 1 ). Nevertheless, the presence of unique sequence signatures, such as the SRC motif ( 42 , 51 ), suggests that the ATPase domains of the clamp-loaders are orthologous.
Other proteins and domains, namely archaeal/eukaryotic FEN1/RAD2 nucleases and bacterial 5′→3′ exonuclease domains of polymerase I, the replication sliding clamps (PCNA) and DNA ligases (the NAD-dependent ligase in bacteria and the ATP-dependent ligase in eukaryotes), show very low sequence conservation but, nevertheless, appear to be orthologs ( Table 2 and Fig. 1 ). Until recently, the homologous relationships between these components of the replication machinery remained undetected. However, detailed sequence comparisons as well as structural superposition for the sliding clamps and the ligases ( 36 , 52 ; see also above) indicated that in each of these cases, the bacterial and archaeal/eukaryotic proteins are homologous. Moreover, apparent horizontal gene transfers apart, the bacterial proteins in each of these cases are more similar to their functional counterparts from archaea/eukaryotes than to any other archaeal or eukaryotic proteins ( Table 2 ). These observations suggest that orthologous relationships exist for each of these proteins, in spite of the high level of divergence.
Finally, some replication proteins, such as RNase HI and topoisomerase II, are highly conserved in bacteria and eukaryotes but are missing from the Archaea. This distribution might be indicative of a horizontal transfer from bacteria to eukaryotes, although it cannot be ruled out that these proteins were present in the LCA and have been lost in the archaeal lineage. Furthermore, archaeal topoisomerase VI appears to be orthologous to eukaryotic proteins involved in recombination (e.g. yeast Spo11) but is only distantly related to bacterial and eukaryotic topoisomerase II ( 53 ). This suggests that the lack of a distinct archaeal topoisomerase II ortholog might be alternatively explained by extreme divergence.
Hypothesis: a mixed, RNA/DNA genetic system in the LCA
As discussed above, the DNA replication machinery in bacteria, compared to that of archaea/eukaryotes, is built from a patchwork of orthologous (but sometimes highly diverged) proteins, proteins that are homologous but apparently have been independently recruited for replication and a core of polymerases that seem to be unrelated ( Table 2 and Fig. 1 ).
How can this mixture of ancestral and independently acquired features of the DNA replication systems be accounted for? Three principal models can be envisioned for the replication of the genome of the LCA. (i) The LCA had an RNA genome that was replicated by RdRp. (ii) The LCA already had a DNA genome, like modern-day cells, that was replicated by DNA-directed DNA polymerases (DdDp). (iii) The genome of the LCA had an RNA component and a DNA component, with the DNA being transcribed into RNA and RNA being reverse transcribed into DNA. Given the orthology and high conservation of the core components of the eubacterial and archaeal/eukaryotic transcription machinery, as well as the orthologous relationships between at least some enzymes of DNA precursor biosynthesis, several components of the replication machinery itself and the RecA/RadA recombinase, the first possibility seems unrealistic. The LCA must have been able to synthesize and make use of DNA. The second model must somehow explain the lack of orthology and, in several cases, any detectable homologous relationship whatsoever between key components of the DNA replication apparatus in bacteria compared to archaea/eukaryotes. As already mentioned, such explanations would involve one or more of the three main themes: (i) the principal components of the DNA replication are in fact orthologous in all forms of life but have diverged beyond recognition; (ii) there has been non-orthologous displacement of some but not other components of the DNA replication machinery in one of the divisions of life (e.g. bacteria); (iii) the LCA possessed two (partially) independent DNA replication systems that have been eliminated in a lineage-specific fashion during subsequent evolution.
The complexity of the eukaryotic chromatin in the form of linear chromosomes, larger genome size and higher order packaging does impose new problems on any DNA handling system ( 54 ). Such changes are visible in the basic repair enzymes ( 35 ) and transcription machinery of the eukaryotes and, in principle, might account for the rapid divergence of the replication systems. However, archaea have single circular chromosomes and genome size in the same range as bacteria but their replication machinery is orthologous to the eukaryotic one (with some important distinctions, such as the presence of a unique DNA polymerase) and dissimilar from the bacterial one, as discussed above. Thus the distinction between the bacterial and the archaeal/eukaryotic replication systems does not seem to correlate with the major changes in chromatin structure and genome organization which separate eukaryotes from both bacteria and archaea. The advent of the eukaryotic chromatin organization is associated with the recruitment of additional subunits to the replication complexes but not with dramatic changes to the core components. This makes a major acceleration of evolution a highly unlikely explanation for the disparity between the replication systems of bacteria and archaea/eukaryotes.
Non-orthologous gene displacement, i.e. recruitment of genes from outside the replication machinery, offers an alternative way to account for the lack of sequence similarity between replication machinery proteins. For some of the replication proteins, a possible source for such recruitment exists, e.g. topoisomerases for bacterial-type DNA primases or AAA+ ATPases with chaperone functions for ATPases involved in replication (DnaA or 0RC1). It is hard to imagine, however, what could be the selective advantage of the displacement of key components of the replication apparatus, particularly if such displacements were to occur one at a time. Simultaneous displacement of multiple components, in contrast, would effectively amount to a takeover by an independently evolved replication system which would mean two origins rather than one for the DNA replication machinery.
The third option, namely the differential loss of one of the two DNA replication systems inherited from the LCA (one of them originally responsible for repair), is perhaps most difficult to refute. However, in addition to being based on the unlikely assumption that the replication system of the LCA was considerably more complex than modern ones, this hypothesis also runs into problems with non-orthologous displacement mentioned above, in this case with regard to the DNA repair machinery. Indeed, comparative analysis of the proteins involved in DNA repair reveals an extreme diversity of the repair systems in bacteria and archaea/eukaryotes ( 35 ).
As an alternative to all these explanations, we hypothesize that the modern-type systems for dsDNA replication evolved independently in bacteria and in the archaeal/eukaryotic lineage. In the proposed model, the LCA did not have a replicating DNA genome and instead maintained a mixed RNA/DNA genome that had the following basic properties ( Fig. 2 ): (i) genomic RNA was reverse-transcribed into a RNA/DNA heteroduplex by a reverse transcriptase; (ii) the RNA moiety of the RNA/DNA duplex was digested by a nuclease; (iii) the remaining ssDNA served as the template for the synthesis of a dsDNA molecule (this step can be catalyzed by the same reverse transcriptase as step 1); (iv) RNA was transcribed by a DdRp from the DNA genome (this step is the evolutionary forerunner of modern-day transcription).
This model explains the universal conservation of the core transcription machinery, the enzymes for DNA precursor biosynthesis and those components of the extant replication machinery that are orthologous and highly conserved in all forms of life, namely RNase HII and FEN1-like 5′→3′ exo-nuclease. The role of the other universal components of the replication machinery, such as the sliding clamp, the clamp-loader, the ligase and topoisomerase I, is less obvious and they do not seem to be required for the postulated mixed genetic system to function. It is conceivable, however, that a sliding clamp and a clamp-loader functioned in the LCA to increase the processivity of reverse transcription.
This model assumes a central function for a reverse transcriptase in the replication cycle of the LCA. Moreover, the hypothetical cycle that we have inferred by comparing the cellular DNA replication machinery components strikingly resembles those of retroid viruses, particularly caulimoviruses and hepadnaviruses ( 55 ). The similarities between the retro-viral replication system and that of a hypothetical ancient cellular organism have been considered by Wintersberger and Wintersberger ( 56 ). It is conceivable that present-day retroid viruses are descendants of ancient genetic elements that escaped during the reverse transcription stage of cellular replication. The existence of an astonishing variety of reverse transcribing genetic elements, both RNA- and DNA-based, in modern-day eukaryotes and bacteria is not incompatible with this idea. On the other hand, except for eukaryotic telomerases ( 57 ) and eubacterial multicopy ssDNA-related enzymes ( 58 ), reverse transcriptases are rarely encoded by cellular genomes. It appears that reverse transcriptase cannot be tolerated by DNA replication-competent cells. Once DdDps have evolved, selection would favor elimination of the reverse transcription pathway to prevent the ‘backward’ propagation of damage to RNA into DNA.
A notable aspect of the conservation pattern of the transcription machinery components supports this reverse transcription-based model. While the principal RNA poly-merase subunits are highly conserved in the three domains of life, the subunits that are required for gene-specific transcription, such as the σ-factors in bacteria and TFIIB/TBP in archaea/ eukaryotes, show no relationship beyond the generic nucleic acid binding helix—turn—helix domain ( Table 1 ; 59 ). This suggests that in the LCA, the RNA polymerase might not have been used for gene-specific transcription, but rather as a ‘replicative enzyme’ ( Fig. 2 ).
An important feature of the discussed model (as probably in any RNA genome model) is that the genome of the LCA consisted of multiple segments, simply because very long RNA molecules are unstable. A further attractive possibility is that circular DNA intermediates could have been formed in the LCA via mechanisms similar to those involved in the formation of circular proviruses in extant retroviruses and/or the virion dsDNA of hepadnaviruses and caulimoviruses ( 55 ). The formation and subsequent transcription of such circular dsDNA elements could have required the function of DNA ligase and topoisomerase I, respectively, thus justifying their likely presence in the LCA. Furthermore, the size of these replicons could increase via recombination, leading to an increasing demand for the sliding clamp, the clamp-loader and the topoisomerase and mounting pressure for the ‘invention’ of a true DNA replication system. A hint that recombination might have been actively occurring at this stage is the ubiquity and substantial conservation of RecA/RadA (the principal recombination ATPase) in all extant life forms ( 35 ). The presence of replicons of substantial size (∼30 kb) at this point in evolution is suggested by the conservation in bacteria and archaea of the ribosomal protein super-operon, which encodes some of the most highly conserved proteins in all life forms, namely the ribosomal proteins and RNA polymerase subunits ( 60 , 61 ). In all likelihood, this super-operon has been inherited from the LCA. Thus the first, ‘provirus-like’ DNA molecules could have been the precursors of bacterial-size circular dsDNA replicons, probably the ancestral form for all modern-type DNA genomes. This could happen, however, only after an efficient DNA replication system came to be—according to our hypothesis, independently in bacteria and in archaea-like ancestors of modern archaea and eukaryotes.
The outlined model of a mixed (hybrid) RNA/DNA genome should be conceived of as an intermediate stage between a pure RNA genome and the current, DNA-based genetic system. Initially, autonomous (non-DNA-dependent) RdRp-mediated RNA replication might also have persisted ( Fig. 2 ). Once RNA replication has ceased, a true hybrid genome (rather than a dual genome) has evolved in which RNA depends on DNA for its replication and DNA depends on RNA. Though cumbersome from today's (cells) point of view, in the absence of true DNA replication capabilities, this hybrid RNA/DNA genome seems to be the only way that a cell can benefit from the higher stability of DNA and its amenability to repair.
The portrait of the LCA emerging from this model has features that are similar to those proposed by other theories of early evolution, as well as unique ones. The model seems to be compatible with the notion of asynchronous ‘crystallization’ of different cellular systems recently discussed by Woese ( 62 ). In the postulated LCA with a mixed genetic system, the translation system is expected to be largely similar to the extant one and so are the principal aspects of transcription. Also, this organism should encode significant metabolic capabilities, including those for the synthesis of amino acids and ribo- and deoxynucleotides. In contrast, the replication system as we know it today is non-existent and the genome organization itself is not 'crystallized'. This creates potential for rapid evolution via recombination and re-assortment of genome segments.
The hypothesis of an independent evolution of DNA replication offers a parsimonious explanation for the strange assortment of apparently unrelated, homologous but not orthologous and orthologous components in the DNA replication machineries of bacteria and archaea/eukaryotes. Admittedly, this scenario cannot completely invalidate the competing hypothesis of an origin of the DNA replication machinery in the LCA followed by as yet unknown (but clearly dramatic) evolutionary events causing the observed dissimilarity. We may never know the final answer. It is conceivable, however, that sequencing of genomes from very early branchings of life, such as Korarchaeota, and determination of key protein structures that are still unresolved, such as the bacterial pol III α-subunit, the large subunits of the DdRp and the unique archaeal DNA poly-merase, might shift the balance toward one or the other of these competing hypotheses.