Enzymic recognition of amino acids drove the evolution of primordial genetic codes

Abstract How genetic information gained its exquisite control over chemical processes needed to build living cells remains an enigma. Today, the aminoacyl-tRNA synthetases (AARS) execute the genetic codes in all living systems. But how did the AARS that emerged over three billion years ago as low-specificity, protozymic forms then spawn the full range of highly-specific enzymes that distinguish between 22 diverse amino acids? A phylogenetic reconstruction of extant AARS genes, enhanced by analysing modular acquisitions, reveals six AARS with distinct bacterial, archaeal, eukaryotic, or organellar clades, resulting in a total of 36 families of AARS catalytic domains. Small structural modules that differentiate one AARS family from another played pivotal roles in discriminating between amino acid side chains, thereby expanding the genetic code and refining its precision. The resulting model shows a tendency for less elaborate enzymes, with simpler catalytic domains, to activate amino acids that were not synthesised until later in the evolution of the code. The most probable evolutionary route for an emergent amino acid type to establish a place in the code was by recruiting older, less specific AARS, rather than adapting contemporary lineages. This process, retrofunctionalisation, differs from previously described mechanisms through which amino acids would enter the code.


Introduction
The primordial genetic codes would have looked significantly different from their contemporary descendants [1,2].Whereas the genetic codes of today are almost deterministic and include up to twenty-two amino acids, the primordial genetic codes would have been ambiguous due to low translational fidelity [3] and used the limited pool of amino acids initially available for protein synthesis.Some amino acids were available through prebiotic geochemistry and simple metabolic pathways, but there would be no enrichment of the more complex or less stable molecules until protocellular metabolism had advanced sufficiently [4,5].Further, the first genetic codes were likely geographically regional.Protocellular populations may have resided in different parts of the ocean or in confined aqueous environments, but still able to exchange genetic material periodically.Under the error minimisation theory, these competing genetic codes were selected for their ability to dampen the effect of genetic mutation on protein structure and function [2].With time, translational fidelity sharpened, the pool of amino acids diversified, and the pairing between amino acids and anticodons was optimised -offering the genetic code greater precision, utility, and robustness.Protocellular complexity grew to a tipping point where even minor changes to the genetic code would no longer be tolerable, a phenomenon termed a "frozen accident" by Crick [6].
In all contemporary living things, genetic coding is effected by the catalytic action of aminoacyl-tRNA synthetases (AARS), a large group of enzymes that attach amino acids to their cognate tRNA.Aminoacylation is a two-step reaction powered by adenosine triphosphate (ATP), which releases adenosine monophosphate (AMP) and pyrophosphate (PP i ) as byproducts of the reaction: Amino acid activation: amino acid + ATP → aminoacyl-AMP + PP i tRNA charging: aminoacyl-AMP + tRNA → aminoacyl-tRNA + AMP Any comprehensive explanation of the origin of the genetic code, a subject of considerable debate (see reviews: [1,3]), must pay close attention to the AARS.The RNA world hypothesis suggests that the genetic code originated in an environment where self-reproducing populations of diverse RNA governed life's reaction pathways, including aminoacylation through hypothetical ribozymal aminoacyl-tRNA synthetases [1].Ribozymes would later be superseded by proteinaceous enzymes due to their superior catalytic properties.AARS enzymes are an afterthought in the RNA world version of the code's origin.Nucleopeptide world challenges this classical theory.It proposes the genetic code originated in an environment which supported the RNA catalysis of peptide synthesis, and peptide catalysis of RNA synthesis, with AARS serving the central integrating role [1,3,7] as these enzymes now do in all three domains of life -bacteria, archaea, and eukaryota -as well as mitochondria and chloroplasts.
Contemporary AARS are curious enzymes, rife with idiosyncrasies (see review: [8]).They consist of a catalytic domain which recognises an amino acid (and ATP), one or more domains that recognise tRNA (typically its acceptor stem and anticodon), and sometimes an editing domain that expels mistargeted amino acids from the reaction pathway.AARS belong to two distinct, apparently unrelated, evolutionary groups, which are designated Class I and Class II.
Nine of the 22 proteinogenic amino acids are exclusively encoded by Class I enzymes, eleven by Class II, and the remaining two amino acids, lysine [9] and cysteine [10], can be encoded by Class I or II analogs.In most cases, each AARS encodes a single amino acid type, specified in the naming of that enzyme -for example alanyl-tRNA synthetase (AlaRS) encodes alanine.
However, in some cases, an AARS can encode an additional amino acid through pretranslational modification of the amino acid substrate after its attachment to tRNA.This is the case for the non-discriminating aspartyl-and glutamyl-tRNA synthetases (AsxRS and GlxRS), which attach Asp to tRNA Asn and Glu to tRNA Gln , respectively [8].Similarly, o-phosphoseryl-tRNA synthetase (SepRS) encodes cysteine for organisms which lack CysRS [10], and SerRS encodes both serine and selenocysteine [11].
The Class I AARS catalytic domain is characterised by a four-stranded parallel Rossman fold, and Class II by a six-stranded antiparallel sheet [8].But while there are just two evolutionary superfamilies of catalytic domains (Classes I and II), there are several superfamilies of domains which recognise tRNA molecules [12], and these have a history of "hopping" between enzymes [9,13].Indeed, these domains are often auxiliary to tRNA recognition elements found in the catalytic domain [14,15], which are specific to different families [13,16,17,18].Due to the central role of the catalytic domain in recognising both amino acids and tRNA acceptor stems, and the comparatively fluid nature of tRNA anticodon recognition, we restrict our focus to the catalytic domains.
We combine information from both sequence and structure using a phylogenetic method within a Bayesian framework.To that end, we assembled a taxonomically representative dataset of AARS structural predictions to recover a "snapshot of the tree of life".We identified structural elements common to either class, and the insertion modules that characterise subclasses and families.These insertion modules define a succession of AARS catalytic domain families.This succession suggests a piecewise assembly of aminoacyl-tRNA synthetases through evolutionary time and demonstrates how the model explains key aspects of genetic code evolution.

Families of Catalytic Domains
Catalytic domain sequences and structures were compared in order to identify AARS families.
Although available experimentally solved AARS structures are manifold, they are oftentimes incomplete, harbour solubility-enhancing mutations or truncations, and are far from a representative sample of the biosphere, as they tend to be sourced from organisms that are culturable or have medical or economic significance.To overcome these biases, we used AlphaFold to generate 420 taxonomically-representative AARS structural models, which were structurally aligned so they could be used for phylogenetic inference.To validate the reliability of these structural models, we compared them with closely related solved structures (Fig S3).These results indicated that the variation within experimentally solved structures of the same family was similar to the variation between experimental and AlphaFold structures of the same family (p > 0.1), confirming they could be informative in comparative analysis.
We identified 33 families of AARS catalytic domains: 13 for Class I, 20 for Class II, and 9 more than Perona 2012 [19].Each family meets the following requirements.First, there is a minimum of four samples from four phyla, and where possible, up to eight bacterial phyla, four archaeal phyla, four eukaryotic phyla, and one viral phylum, plus two organellar (mitochondrial or chloroplast) samples from two distinct eukaryotic phyla.Second, all members of a family are predicted to display common aminoacylation activity based on their similarity to functionally characterised homologs.Third, each family is monophyletic, or monophyletic with a second family contained within it.Finally, in the event of a family containing a clade that can be further distinguished by an insertion or deletion of at least 50 amino acids, it was recursively split into two families, provided that both candidates meet these four requirements.The families are summarised in Table S1.
Families are identified with unique short names.In this notation, an AARS that is largely restricted to a certain taxonomy is suffixed accordingly: 'A' for archaeal-like, 'B' for bacteriallike, 'E' for eukaryote-like, and 'M' for mitochondrial-like.Most catalytic domain families are unique in their aminoacylation activity, with the following six exceptions.1.The dual forms of LysRS: as anticipated, LysRS belongs to two families LysRS-I and LysRS-II, one for each class [9].2. The dual forms of LeuRS: an archaeal-like form LeuRS-A and a bacteriallike form LeuRS-B, where eukaryotic genomes encode either one.The two forms differ in the placement of the editing domain within the catalytic domain [20,21].3. The dual forms of SerRS: the standard SerRS found in most organisms differs from the SerRS-A form found in certain archaea [22].4. The three forms for ProRS: ProRS-A, ProRS-B, and ProRS-M [23], where ProRS-B is characterised by an editing domain within the catalytic domain, which is absent from ProRS-A and most members of ProRS-M. 5.The three forms for GlyRS: GlyRS-A, GlyRS-E, and GlyRS-B.The first two are dimeric, and the third exists as a heterotetramer.
GlyRS-E is differentiated from GlyRS-A by the presence of an ∼ 90 amino acid insertion.6.
As such, the β chains are omitted from our main evolutionary model, but have been included in Fig. S2.

Phylogeny of Insertion Modules
We examined protein structures from the 33 families to identify features endemic to each class and the insertion modules found in specific families (Fig. 1).An insertion module (IM) is defined as a conserved structural element that is contiguous in sequence, with an average length of at least 30 amino acids in over half the members of a single family, or at least 10 amino acids but with a distinct IM nested within it.These length requirements improve the reliability of inferring homology among IMs, but it does mean that some conserved elements (such as the 1-2 short helices downstream of connecting peptide 1 in TrpRS and TyrRS) were not included in the analysis.Our search was confined to the catalytic domains; we did not consider IMs in editing or anticodon binding domains for instance.If an editing domain was nested within the catalytic domain (as in ProRS and ValRS), we considered the domain as a single IM and did not dissect any IMs within it.Our analysis identified 15 modules for Class I and 20 for Class II (Table 1).The elements common to all members of each class are helices H1-H5 and strands S1-S5 for Class I, and helices H1-H3 and strands S1-S5 for Class II.The final Class II strand is immediately followed by a helix and hence denoted as SH1 (which contains motif 3 [25]).Some of these helices contain a one-residue interruption, such as a turn, and therefore can be regarded as kinked helices, for example H4 in IleRS.
We developed a Bayesian phylogenetic method to integrate IMs with amino acid sequence data (see Methods).This model differs from standard sequence-based phylogenetic methods because it explicitly accounts for modular insertion and deletion.Under our prior distributions, IMs were assumed to appear and disappear at characteristic birth and death rates, which are considerably lower than the rates of amino acid substitution.The estimated birth/death rates were further informed by the data, which is evident when comparing the peaked posterior distributions with the flat and uninformed prior distributions of Fig. 2, and Class II was estimated to have a higher birth rate than Class I (consistent with its higher count of IMs).In most cases, when an extant protein was lacking an IM, it was explained as lack-of-birth, as opposed to deletion.But notably, a post-transfer editing domain (Editing II) appears to have been deleted from the mitochondrial ProRS after it diverged from the bacterial-like form.We examined four ProRS-M samples; three of which are predicted to localise to mitochondria, and have lost the domain, while the last, in Candida albicans, is predicted to reside in the cytoplasm, and has retained the domain (or perhaps lost and reacquired it through horizontal gene transfer).When   ) and the Cryptococcus neoformans GlyRS-E (generated by Al-phaFold).The GlyRS-E IM is intrinsically disordered [18], and therefore its predicted structure above (green) may be one of the many conformations it adopts.It exists as an insertion nested within the β-hairpin found in most members of IIa (yellow).
the editing domain was lost, it left behind an evolutionary scar, in the form of the small cysteinerich ProRS IM.ProRS-M is the only AARS family for which there is no experimentally solved structure.Further, these results also suggest the deletion of the LeuRS-B IM in two bacterial lineages (see supporting information).
The catalytic domain phylogenies informed by both IM and amino acid data are presented in Fig. 3.These analyses support splitting off the LysRS-I, ArgRS, CysRS, and PylRS families into singleton subclasses Id, Ie, If, and IIe respectively, due to the absence of close relatives or uncertainty concerning placement in existing subclasses.The results provide a number of new insights.First, our placement of HisRS into IIc, as opposed to IIa, is incongruent with most studies [8,12,19,43].Many of these studies placed HisRS into IIa because of its mode of tRNA binding via an anticodon binding domain, which is homologous with members of IIa.
Here however, we considered the phylogeny of the catalytic domain in isolation from other domains, and thus the anticodon binding domain of HisRS was likely borrowed from IIa. Ic and IIc alike are structurally simple, are not characterised by any IMs, and they adenylate some of the larger aromatic amino acids.Second, we placed PylRS into its own subclass IIe, which is closely related to IIb, congruent with a previous sequence-based analysis [44].However, a previous structural analysis placed it with IIc [45].Given that PylRS has the same profile of IMs as IIc, the high structural similarity scores with these families are not unexpected.Third, our placement of CysRS, ArgRS, and LysRS-I into singleton subclasses is at odds with some prior studies, many of which consider the mode of tRNA recognition in their classifications [12,19,43].The deep phylogenies describing relationships between subclasses is challenging to resolve, as reflected by the comparatively low levels of posterior support on internal nodes closer to the roots of Fig. 3.  , where we have assigned Pyl and Sep to phase II.Insertion modules are numbered using the key in Table 1.HIGH and KMSKS are the motifs of Class I, and M1-M3 are the Class II motifs 1-3 [25].Loops may contain other secondary structures (see Fig. 1).

Discussion
We describe a likely assembly of AARS catalytic domains, layer by layer throughout evolutionary history (Fig. 4).This model was generated using a Bayesian phylogenetic method that integrated information from amino acid substitutions with the presence or absence of insertion modules (Fig. 3).The phylogenetic method is open-source and is readily available for future use (see Methods).To begin our discussion, we first provide a brief overview of the origins of the Class I and II AARS.We then consider possible processes by which extant catalytic domains were assembled from small structural modules, which grew progressively on the surface of the protein, under principles similar to those described by Petrov et al. [46] for the accretion of RNA onto the ribosome.This process enabled discrimination between closely related amino acid side chains and tRNA molecules.Finally, we discuss the implications of these findings for the interconnected evolution of the genetic code and metabolism.

Inception of the AARS
One major theory on the origin of the AARS suggests the two AARS classes arose simultaneously as opposing strands of a bidirectional gene [7,47,48].This hypothesis, initially proposed by Rodin and Ohno [47], has prompted a series of experimental investigations into the reconstructed ancestral forms of the two AARS classes.These earliest forms were likely small, low-specificity, molten globules, known as protozymes [48].Although model protozymes from both classes have been experimentally investigated and found to exhibit adenylation activity [26,27], it is not clear how tRNA would have been aminoacylated or how the first protozyme genes originated.In extant proteins, the protozymic region contains the HIGH motif for Class I, and motif 2 for Class II [25].However, it is unlikely that the histidine in the HIGH motif, or the arginine in motif 2, were part of the coding alphabet at this early stage [4,5,49].The Class I protozyme would later be modified by a second crossover, leading to the Rossman fold, and the Class II protozyme would expand into an antiparallel β-sheet, giving rise to the Class I and II urzymes, which have been shown to aminoacylate tRNA [14].These expansions included the KMSKS motif in Class I and motif 1 Class II, respectively [25].The subsequent steps introduced nested insertions that differentiated the different AARS families and would have necessarily decoupled bidirectional coding into separate Class I and II genes.
The structures resulting from all of these later steps have no bearing on whether the urzymes of the two AARS classes have a common bidirectional origin.

Class I Assembly
The phylogeny of the Class I catalytic domain resembles a "caterpillar tree" with a central lineage providing the trunk from which extant enzymes emerged.This hierarchy of enzymic complexity, the result of gradual modular accretion, is reflected in the nearly linear progression from structurally simpler enzymes (TrpRS and TyrRS) to intermediate (ArgRS and GluRS) to more elaborate ones (ValRS and LeuRS).
Connecting peptide 1 (CP1) occurred early in Class I history, wrapping around the core like an exoskeleton [29].Two lineages diverged from the central Class I AARS lineage: one giving rise to subclass Ic (TrpRS and TyrRS), and another giving rise to Id (LysRS-I) with an anticodon binding domain similar to GluRS [9,13].However, it is unlikely that there was an abundance of tryptophan, tyrosine, or lysine until much later in evolution of metabolism [5], suggesting that the genesis of Ic and Id may have occurred much later in time than Ia and Ib.
The C-terminal of CP1 was later modified by inserting the Z-fold -an antiparallel β-sheet consisting of three strands Z1, Z2, and Z3.ArgRS presents this Z-shaped module in its most primitive form (Ins-2, [30]), which appears unrelated to the β-rich insert found in LysRS-I.
This β-sheet provided a platform for future additions nested between its three strands, notably a cysteine-rich zinc finger (ZF) at the end of Z1, and a short two helix bundle (connecting peptide 2, CP2) at the end of Z2.These two modules characterise subclass Ia and contribute to aminoacylation [31,32,33], however the zinc-coordinating cysteine and histidine residues are not entirely conserved, and therefore the ZF region does not always bind zinc [21].The arrival of these two modules coincided with the extension of Z1 and Z2 from around 4 to around 10 amino acids in length, such that it resembled a β-hairpin.A post-transfer editing domain provided the means to discriminate between amino acids with very similar side chains: leucine, isoleucine, and valine.Interestingly, this module occurs in two distinct positions: between CP2 and Z3 for LeuRS-B, and nested within the zinc finger for other enzymes [20,21].It is unclear whether the domain originated in one of these two positions or elsewhere in the proteome.

Class II Assembly
The phylogeny of the Class II catalytic domain is much more balanced, or "tree-like", than that of Class I (Fig. 4).This can perhaps be attributed to the structural plasticity of its antiparallel β-sheet fold, which, much like the smaller antiparallel sheet Z of Class I, provided fertile ground for the rapid proliferation of insertion modules within the loops connecting consecutive strands.Many of these insertions were stabilised by the formation of an additional strand running parallel to the sheet's C-terminal edge (Table 1).Taken together, it appears that the Class II fold is more receptive to insertions than the Class I Rossman fold.
Early in the history of Class II, a short loop, known as the small interface (SI) [37], emerged on the surface of the protein.The N-terminal region of SI works intimately with the active site through a range of distinct mechanisms, sequence signatures, and structures, and has been termed the flipping loop [36], the ordering loop [50], and the helical loop [51].Together with a strand in motif 1, the C-terminal region of SI appears at the dimeric interface where it often forms a six-stranded antiparallel sheet across the two subunits (C2-C3 loop, [38]).This β-hairpin would later acquire nested insertions on three independent occasions: PheRS-A, SepRS, and SerRS-A.SI emerged only after the divergence of IId, whose members oligomerise through mechanisms quite distinct from the rest of the class, a coiled coil for AlaRS [34] and a threehelix bundle for the tetrameric GlyRS-B [52].

Expansion of the primordial genetic code
Elaboration of the successive insertion modules defining the AARS families has revealed a curious inversion.AARS for the simplest amino acids have, in general, accumulated more insertion modules.Examining Fig. 4, we observe that the catalytic domains of AARS that bind to phase II amino acids (as defined by Wong [5]: see below), which supposedly appeared later in the coding alphabet, have, on average, significantly fewer insertion modules than those for phase I (p < 0.01; Fig. S4).This inversion is most clearly illustrated in tryptophan and tyrosine, which may have been the last two amino acids to enter the coding alphabet [4], and yet their AARS did not diverge from those of the earlier canonical amino acids, such as valine or glutamate, as one might expect.Rather, the genesis of TrpRS and TyrRS is rooted deep within the Class I phylogeny (Fig. 3) and their catalytic domains are similar to the earliest ancestral structures (Fig. 4).
Two interrelated observations help explain the unexpected strength of this inversion.First, as Pauling [53] noted, simpler amino acid side chains are harder to select without error.Rejecting small, similarly-shaped side chains required the acquisition of insertions to modulate the basic specificity determinants and eventually facilitate editing of incorrectly activated or misacylated amino acids.More complex side chains increase the scale of differences, facilitating discrimination with fewer structural tweaks.
Second, Wong's coevolutionary model for genetic code expansion suggests a complementary inference.Wong [5] distinguished those amino acids produced in abundance through pre-biotic chemistry or simple metabolic pathways as phase I amino acids.He proposed that these served as metabolic precursors for more complex phase II amino acids that required more extensive biosynthetic pathways, arriving at a delineation similar to Trifonov's consensus approach [4].The earliest proteins were presumably synthesised from a limited pool of phase I amino acids using promiscuous AARS and an ambiguous genetic code.With time, the binding specificities of AARS sharpened by acquiring new modules, allowing them to sterically discriminate between closely related amino acid types.This then enriched the types of molecules available through more elaborate metabolic pathways, eventually producing the amino acids of phase II.
These, in turn, became particularly valuable for catalysis (notably the side chains of histidine, arginine, lysine, cysteine, and tyrosine [54]).This reasoning recently gained experimental support from a demonstration that the histidine and lysine side chains in the Class I sequence motifs contributed little to catalysis, and were in fact inhibitory, in an ancestral model of the LeuRS-A urzyme which lacked CP1 [49].
Suppose that a novel amino acid type, X, were to emerge in abundance from a new metabolic pathway.A number of scenarios could follow.In the event that X was not recognised by existing AARS to any significant extent, its production would have no material impact on the genetic code.Second, were X to be recognised by existing AARS in a way that interfered with the protein synthetic machinery by perturbing its products, the production of X would be selected against, or perhaps there would be selection for AARS to preclude X.For instance, meta-tyrosine is a toxic amino acid which competes with phenylalanine during protein synthesis, leading to defective proteins, but PheRS catalyses the removal of mistargeted meta-tyrosine through its editing activity [55].In the third case, a midpoint between these two extremes, suppose X was recognised by AARS in a non-disruptive manner, allowing it to gradually work its way into the genetic code.By establishing itself as an essential metabolite, X and the metabolic pathways for its production would be selected.
The least disruptive way to incorporate X into the genetic code would be through its recognition by a promiscuous, and perhaps low-activity, AARS, as opposed to one of the more specialised enzymes, which would have evolved more precise substrate recognition and enabled, for example, discrimination between leucine and isoleucine, or serine and threonine.Thus, the most fruitful place to find such an AARS would be among the ancient lineages, perhaps acquired by exchanging genetic material with a geographically isolated population at a different stage of evolution.From there, the specificity of X-tRNA synthetase could be refined by using the newly available phase II amino acids, and their advanced catalytic propensities [54].
This proposed mechanism is a variation on the epistatic ratchet observed in the evolution of specificity in steroid hormone receptors [56].
Placement of X into the genetic code would be determined by the anticodons of whatever tRNA molecules were recognised by the adapted X-tRNA synthetase.As demonstrated by the dynamic phylogeny of tRNA specificity [57], and the sheer number of AARS modules (Table 1) and domain superfamilies [12] involved in tRNA recognition, the interaction between tRNA and AARS has been fairly malleable.Thus, the fluid nature of the pairing between amino acids and anticodons would enable X to assume a place in the genetic code, while also optimising the code's robustness under the error minimisation principle [2].
As the code evolved, amino acid types competed for a place in the parliament of sixtyfour seats.There are several routes which amino acid types have taken to enter the genetic code.First, there is subfunctionalisation [58], whereby a promiscuous AARS duplicates, and its daughters adapt to discriminate between the amino acids recognised by the parent.This mechanism has been suggested for the ancestor of IleRS and ValRS [59].Second, through neofunctionalisation, a duplicate of an existing specialised AARS is co-opted to encode a new amino acid, and has been suggested for the ancestor of TrpRS and TyrRS [60].Third, pretranslational modification enabled unstable amino acids (asparagine, glutamine, and selenocysteine) to enter the coding alphabet without the need for an AARS duplication event [5,11,12].Lastly, as demonstrated here, the recruitment of ancient, unspecialised AARS lineages provided a fourth route.However, much like the third route, this process does not readily fit into the framework of specificity-refinement or functional gain among gene duplicates, but rather it is a change in environmental condition (i.e., substrate availability) that enables an unfulfilled capacity (i.e., recognition of that substrate), dormant within the broader pool of AARS genes, to manifest as a novel biological function much later in time.In contrast to neofunctionalisation, the new function would emerge from a change in environment rather than a change in sequence, and in contrast to subfunctionalisation, the drive for specialisation would not exist until its function was activated.This process of parafunctionalisation may have been the point of entry for tryptophan, tyrosine, arginine, histidine, phenylalanine, pyrrolysine, cysteine, and methionine, all of which most likely entered the genetic code quite late, and yet their cognate AARS often have comparatively primitive catalytic domains.Further consideration of the mode of operation and detailed effects of this mechanism may help resolve the order in which amino acids entered the code, irrespective of which AARS class encodes them, and may also prove useful in attempts to expand the repertoire of the code.

Conclusion
Many efforts to root the origin of the genetic code in a hypothetical RNA world downplay the role of the AARS, the enzymes exclusively known to have operated the code since its inception.AARS phylogeny suggests that the chemical logic of the code was shaped simultaneously by an evolutionary pressure to refine AARS specificities, that is, the ability to discriminate between amino acids with similar side chains, and a pressure to expand the coding alphabet by recognising amino acids produced through emergent biosynthetic pathways.Unexpectedly, the complexity of an amino acid side chain is inversely related to that of its enzyme's modular structure (Fig. 4, S4).This inversion suggests that nature crafted specific enzymes for new, more specialised amino acids from the reservoir of relatively non-specific ancestral AARS, which served as blank canvases for expanding the coding alphabet.Following adaptation to the introduction of a new amino acid, the entrenchment of orthogonality -exclusivity in AARS-tRNA pair recognition -gives the code an appearance of it being a "frozen accident" [6].Widely known regularities in the coding table on which the error minimisation theory is founded [2] seem to have arisen from the coevolution of the coding table with the concurrent elaboration of metabolic pathways for more specialised amino acid side chains, as advocated by Wong [5].
Increasingly precise genetic coding can only have coevolved with enhanced control over biochemical pathways.The process of parafunctionalisation is distinct from the three previously observed mechanisms by which AARS lineages would differentiate: subfunctionalisation, neofunctionalisation, and pretranslational modification.Recognising the role of parafunctionalisation will be especially important in future efforts to characterise ancestral Class I and II aminoacyl-tRNA synthetases.

Limitations and Assumptions
These methods and results have limitations.First, the structures generated by AlphaFold [61] are merely predictions and are no match for experimentally determined structures [62].Although the reliability of these predictions benefits from an abundance of close relatives in the protein databank, they may also induce reference biases which obscure true deviations between structures.Second, our evolutionary model assumes that the AARS started as small structures which grew in complexity through time.Insertions are therefore assumed to be more afterwards.

Bayesian phylogenetic inference
All phylogenetic analyses were performed using BEAST v2.7.3 [68].Two independent Markov chain Monte Carlo chains were run for each class, and their convergence was assessed by confirming their effective sample sizes were over 200 using Tracer v1.7 [69].Trees were summarised using the maximum clade credibility tree [70] and visualised using UglyTrees [? ].

Insertion-deletion Dollo model
This Bayesian phylogenetic model has two components.First, IM evolution is modelled as a birth process, followed by either loss (a death event) or retention by extant taxa, following a stochastic Dollo process [76].This approach distinguishes between IMs lost from ancestral proteins and those never present, and assumes that all forms of an IM are homologs of a common ancestor, thus requiring careful identification of IMs.Second, the amino acid sequence evolves down the tree originating at the birth event using established substitution models for protein evolution [72].These module phylogenies are constrained within a family phylogeny, analogous to the multispecies coalescent model [77].All parameters, including trees, IM birth and death rates, and amino acid substitution parameters, are jointly inferred within a Bayesian framework, allowing for hypothesis testing and quantification of Bayesian posterior support.
The posterior density of this model is expressed in Equation 1, where the protein tree g is constrained within the protein family tree S. The insertion module data is represented in a binary form, where M i,j = 1 if taxon j has module i = 1, 2, . . ., k, or 0 otherwise.Taxon j has amino acid sequence D i,j if and only if M i,j = 1, whose sites are assumed to evolve independently down tree g under a continuous time Markov process [78].The stochastic Dollo model (the module likelihood) assumes that all 1's are homologous and were derived from a common birth event, such that loss of the module is irreversible [76].Each node of the family tree S describes a population of modules which belong to the same family, constituting a tree prior distribution governing how module lineages coalesce within each population of families [77] with effective population size N e , estimated per branch.The estimated model parameters θ include a pure-birth protein tree diversification rate, and a module birth and death rate -which are all relative to the amino acid substitution rate fixed at 1 -as well as vector N e , and other parameters pertaining to the OBAMA substitution model [72] and family tree relaxed clock

Fig. 1 :
Fig. 1: Multiple sequence alignment of Class I (top) and Class II (bottom) catalytic domains.One AlphaFold-generated representative was randomly selected from each family, provided that the reference structure contained all of the insertion modules which characterise the family.Helices are depicted by blue cylinders; β-strands by yellow arrows; all other secondary structural elements by black lines; and multiple sequence alignment gaps are left blank.For simplicity, when an extended helix or strand is interrupted by a single secondary structural element (such as a turn or a bend), that element is omitted from the diagram.

Fig. 2 :
Fig.2: Prior and posterior distributions of birth and death rates of IMs, relative to amino acid substitution rate.Protein structures are the catalytic domains of the Thermus thermophilus LeuRS-B (PDB: 2V0C[42]) and the Cryptococcus neoformans GlyRS-E (generated by Al-phaFold).The GlyRS-E IM is intrinsically disordered[18], and therefore its predicted structure above (green) may be one of the many conformations it adopts.It exists as an insertion nested within the β-hairpin found in most members of IIa (yellow).

Fig. 3 :
Fig. 3: Phylogenies of Class I (top) and II (bottom) catalytic domains.A selection of module trees are shown in blue, with red lineages depicting deletions.Family tree (grey) internal nodes are labelled by clade posterior support.The y-axes depict the rate of change (amino acid substitutions per site and births/deaths per module, weighted according to their instantaneous rates, see Supporting information), in contrast to the phylogenies in Fig. S1-S2 which are expressed in substitutions per site, and show similar heights for Class I and II trees.The remaining insertion modules, omitted from this diagram, are shown in supporting information.

Fig. 4 :
Fig. 4: AARS accretion model.Branching off from the central black-and-white ancestral lineages into extant proteins could have occurred at any time, and hence arrows do not denote the passage of time, but rather evolutionary relationships.The temporal component of this figure is depicted by the phase I and II amino acids, as identified by Wong 2005[5], where we have assigned Pyl and Sep to phase II.Insertion modules are numbered using the key in Table1.HIGH and KMSKS are the motifs of Class I, and M1-M3 are the Class II motifs 1-3[25].Loops may contain other secondary structures (see Fig.1).

Table 1 :
Summary of modules and their proposed functional roles.Modules in bold font are ancestral catalytic domains, and those in standard font are insertions.Module length ranges are 95% credible intervals across all AlphaFold generated structures.†These elements contain a strand which runs parallel to the N-terminal edge of the Rossman fold (Class I) or the Cterminal edge of the β-sheet (Class II).*Universal urzyme structures were constructed from aligned helices and strands, excluding loops, so these values underestimate the expected lengths (∼130 aa).