Elucidation of Codon Usage Signatures across the Domains of Life

Abstract Because of the degeneracy of the genetic code, multiple codons are translated into the same amino acid. Despite being “synonymous,” these codons are not equally used. Selective pressures are thought to drive the choice among synonymous codons within a genome, while GC content, which is typically attributed to mutational drift, is the major determinant of variation across species. Here, we find that in addition to GC content, interspecies codon usage signatures can also be detected. More specifically, we show that a single amino acid, arginine, is the major contributor to codon usage bias differences across domains of life. We then exploit this finding and show that domain-specific codon bias signatures can be used to classify a given sequence into its corresponding domain of life with high accuracy. We then wondered whether the inclusion of codon usage codon autocorrelation patterns, which reflects the nonrandom distribution of codon occurrences throughout a transcript, might improve the classification performance of our algorithm. However, we find that autocorrelation patterns are not domain-specific, and surprisingly, are unrelated to tRNA reusage, in contrast to previous reports. Instead, our results suggest that codon autocorrelation patterns are a by-product of codon optimality throughout a sequence, where highly expressed genes display autocorrelated “optimal” codons, whereas lowly expressed genes display autocorrelated “nonoptimal” codons.

Although each species has a preference toward a specific subset of codons (Hershberg and Petrov 2009;Plotkin and Kudla 2011), the origin and evolutionary pressures driving these preferences remains largely unknown. Codon usage variation within genomes (intraspecies codon usage) is often attributed to selection, due to the significant positive correlation between protein expression levels and the presence of "preferred" or "optimal" codons (Sharp et al. 1986;Duret and Mouchiroud 1999). Indeed, both in Bacteria and Eukarya, codon bias is more extreme in highly expressed genes to match the skew of tRNA gene pools, providing a fitness advantage due to increased efficiency and/or accuracy in protein synthesis (Bulmer 1991;Akashi 1994;Dong et al. 1996;Duret 2000;Gingold and Pilpel 2011). In contrast, the processes that drive codon usage variation across genomes (interspecies codon usage) are generally thought to be mutational (Chen et al. 2004;Hershberg and Petrov 2008;Sharp et al. 2010), although the extent to which these processes are driven by mutation, selection, or biased gene conversion remains controversial (Lassalle et al. 2015;Long et al. 2018). Genomic GC content has been identified as the strongest determinant of codon usage variation across species (Knight et al. 2001;Palidwor et al. 2010). Consequently, GCrich organisms tend to favor GC-rich codons whereas AT-rich organisms are enriched in AT-rich codons (Hershberg and Petrov 2009). tRNA gene content and codon usage bias are thought to coevolve (Dong et al. 1996;Yona et al. 2013) as a means to modulate translation speed for accurate cotranslational Article ß The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons. org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Open Access protein folding (Komar 2009;Yu et al. 2015). Indeed, tRNA deletions in Saccharomyces cerevisiae are recurrently corrected, with the anticodon of a second tRNA mutated to match that of the deleted tRNA (Yona et al. 2013). Supporting this, species belonging to the same domain of life (Archaea, Bacteria, Eukarya) were found to have evolved similarly in terms of their tRNA gene contents and decoding strategies , despite large differences in GC content. In the light of these observations, here we hypothesize and test whether species from the same domain of life may display similar codon usage biases despite their differences in GC content.
The elucidation of which evolutionary pressures have shaped extant genomes is crucial to comprehending why and how genomes evolve, but also can be exploited to build algorithms that can taxonomically annotate any given genomic sequence based on its properties. In this regard, nextgeneration sequencing has provided a great opportunity to explore complex ecological systems, such as microbiomes from the human gut or environmental samples. However, these samples often include a significant portion of uncharacterized species, and consequently, assigning sequence scaffolds to individual species, or even to higher-level taxa, remains challenging. Metagenomic annotation solutions, also known as "binning," often rely on similarity-based searches (Brady and Salzberg 2009;Gerlach and Stoye 2011;Huson et al. 2016). Unfortunately, such homology-based methods are unable to correctly annotate a significant portion of sequences (Prakash and Taylor 2012), such as those that are taxonomically restricted, or do not have detectable homologues in other lineages. De novo taxonomical predictions, independent of homology, can be extremely useful for these situations.
Here, we show that after removing variation associated with GC content, species from the same domain share similar codon bias signature, and identify that the codon usage bias of a single amino acid, arginine, is largely responsible for the separation of the species into their corresponding domains. We then show that coding sequences (CDS) can be correctly classified into their corresponding domain of life, with an accuracy of 85%, using exclusively their codon usage biases. We speculate that domain-specific preferences for arginine codons are related to translation speed, which would support the view that codon usage variation across genomes is shaped not only by mutational biases, but also by selective forces.

Beyond GC Content, Codon Usage Bias Shows Domain-Specific Patterns
The nonuniform usage of synonymous codons in a given sequence or genome can be measured as relative synonymous codon usage (RSCU), which is defined as the ratio of the observed frequency of codons to the expected frequency. In other words, the RSCU represents the deviation of the observed codon usage from a uniform distribution in which all codons encoding for the same amino acid have the same probability (Sharp et al. 1986). Therefore, the codon usage bias of each species can be represented by a 59-dimensional RSCU vector, where each element of the vector is the RSCU of an individual codon (Trp, Met, and stop codons are excluded). Upon hierarchical clustering of species based on their average RSCU, we find that species do not cluster following the tree of life, but rather, based on GC content, suggesting GC content is the major determinant of codon usage bias across species ( fig. 1), in agreement with previous works (Hershberg and Petrov 2009).
To deconvolute the bias related to GC content from that caused by codon usage, we applied principal component analysis (PCA) to the average RSCU of each analyzed species (see Materials and Methods), with the expectation that the first principal component (PC1) would capture the variance due to GC content. Indeed, we find that PC1 does not separate the species into domains ( fig We then wondered whether the arginine codon bias was sufficient to cluster species into their corresponding domains. To test this, we performed a new PCA analysis, this time using exclusively arginine codon biases, and finding that the differences in arginine codon usage across species alone were sufficient to recapitulate the clustering of species into domains ( fig. 2D). Therefore, we conclude that arginine codon usage bias is a major contributor to interspecies codon usage bias across domains.

The Usage of Arginine Codons Does Not Significantly Change across Highly and Lowly Expressed Proteins
Within a given species, the usage of individual codons across its genes is not uniform. Specifically, highly expressed proteins tend to be more enriched in "optimal" of "preferred" codons, compared with proteins with lower expression (Bulmer 1987;Duret 2000;Higgs and Ran 2008;McDonald et al. 2015). Therefore, considering this variation within-species, we wondered whether our findings would be applicable at the level of individual sequences. For this aim, we individually analyzed all sequences from the 1,625 EMBLCDS species from all three Codon Usage Signatures and Autocorrelation . doi:10.1093/molbev/msz124 MBE domains (see Materials and Methods). We computed RSCU values for each sequence, performed PCA dimensionality reduction, and retained the first three principal components for further analysis, based on Scree's test (Cattell 1966). We found that individual sequences also clustered by domain, suggesting that intraspecies codon usage variation is not larger than interspecies codon usage variation ( fig. 3).
Surprised by these results, we hypothesized that although global codon usage may strongly vary between highly and lowly expressed genes (Bulmer 1987;Duret 2000;Higgs and Ran 2008;McDonald et al. 2015), this might not be the case for all amino acids, such as arginine. To test this, we examined the codon usage of all CDS of S. cerevisiae, and determined how codon usage varied with protein expression-using previously published proteomics data sets (Newman et al. 2006)-for each individual amino acid and codon subtype ( fig. 4). As expected, we observed that codon usage drastically varied depending on protein abundance. For the majority of amino acids, codon preferences completely switch from "nonoptimal" to "optimal" depending on expression level (Lys, Asn, His, Phe, Asp, Tyr, Val, Ser, Ile)-here considering that a codon preference "switch" occurs if the most frequently used codon in lowly expressed proteins differs from the one that is most frequently used in highly expressed proteins. In other cases, however, codon preferences are maintained-although RSCU values may vary-when comparing lowly and highly expressed proteins (Gln, Glu, Cys, Gly, Arg). The relative consistency of arginine codon usage preferences across a genome might explain why our analyses can be applied not only at an average per-species level, but also at the level of individual sequences. Similar results were observed when the same analysis was performed on Escherichia coli, where codon preferences of certain amino acids switch from "nonoptimal" to "optimal" (His, Phe, Asp, Tyr, Ala, Gly, Val, Ser, Ile), whereas in others the same codon is preferentially used in both highly and lowly expressed proteins (Lys, Asn, Gln, Glu, Cys, Thr, Pro, Leu, Arg; supplemen-

Codon Usage Signatures Can Be Used to Taxonomically Annotate Sequences into Their Corresponding Domains of Life
We then wondered whether simple patterns of codon usage bias would be sufficient to classify a species into its corresponding domain of life. To test this, we used the previously built 59-dimensional RSCU vectors for each EMBLCDS sequence, subdivided the data into training and testing sets, and built a Support Vector Machine (SVM) with the training set data ( fig. 5A). We find that codon usage bias alone predicts the correct domain with AUC values ranging from 0.78 to 0.84 ( fig. 5B). The accuracy of prediction is dependent on the We then investigated whether this approach could be used to classify species into their corresponding phylum, and not just domain of life. For this aim, we retrieved the corresponding phylum for each of the species included in the analysis using NCBI Taxonomy, and trained three new SVMs for archaeal, bacterial and eukaryal EMBLCDS sequences, respectively (supplementary table S2, Supplementary Material online). We find that our methodology can identify the correct phylum with an overall accuracy of 0.359 (Bacteria), 0.589 (Eukarya), and 0.901 (Archaea) (supplementary table S3, Supplementary Material online). Overall, we find that sequences belonging to specific phylums, such as Fusobacteria or Chlamydiae, can be classified with reasonable accuracy (0.84 and 0.90, for Fusobacteria and Chlamydiae, respectively); however, in the majority of the cases, codon usage bias seems to be insufficient to accurately classify a

Codon Autocorrelation Does Not Reflect tRNA Reuse
Previous studies in yeast have shown that once a particular codon has been used, subsequent occurrences of the same amino acid in the same transcript are not random (Cannarozzi et al. 2010), a phenomenon termed as "codon autocorrelation" or "codon covariation." Mechanistically, it was argued that tRNA recycling was the driving force causing the observed biased distribution of synonymous codons along a sequence, that is, codons that would reuse the same tRNA would be favored as a means to increase the speed of translation (Cannarozzi et al. 2010). A subsequent study re-examined this question, and compared the autocorrelation between codons encoding the same amino acids to those encoding different ones (Hussmann and Press 2014). Intriguingly, this second study found that covariation between codons encoding different amino acids was as strong as covariation between codons encoding the same amino acid, concluding that there was insufficient evidence to claim that tRNA recycling is the force driving codon autocorrelation. Despite the uncertain cause of codon covariation, both studies show that the probability of observing a specific codon is dependent on previous codon occurrences, at least in the case of S. cerevisiae. Considering that species from the same domain of life share common codon usage signatures (figs. 2 and 3), we wondered whether codon covariation would also follow a similar behavior. For this aim, we calculated codon covariation as described in Cannarozzi et al. (2010) (see Materials and Methods), finding that codon covariation within same amino acids in S. cerevisiae partly supports a tRNA recycling model ( fig. 6A). For example, in the case of alanine codons, GCA and GCG show covariation, and are both decoded by tRNA Ala UGC . Similarly, GCC and GCT are decoded by tRNA Ala AGC and also show covariation.
In contrast, the covariation observed in other species was observed between codons that were decoded by different tRNAs (fig. 6B, see also supplementary fig. S4, Supplementary Material online). Taking as an example the same amino acid, alanine, covariation in human sequences was detected between GCA and GCT codons, which are decoded by two different tRNAs, tRNA Ala UGC and tRNA Ala AGC , as well as between GCG and GCC codons, despite being decoded by two different tRNAs, tRNA Ala CGC and tRNA Ala Overall, our results suggest that codon covariation is unrelated to tRNA reusage, in agreement with the second study described above (Hussmann and Press 2014).

Codon Autocorrelation Reflects Global Sequence Codon Optimality
We then wondered whether codon covariations may be in fact a simple consequence of codon optimality throughout a sequence, that is, whether "optimal" codons, which are abundant in highly expressed proteins, would appear as autocorrelated, and "nonoptimal" codons, which are abundant in lowly expressed proteins, would also appear as autocorrelated. To test this, we compared the observed codon covariations in S. cerevisiae and E. coli with the set of "optimal" and "nonoptimal" codons, defined as those that were highly or lowly abundant in highly expressed proteins, respectively ( fig. 7). We find that after binning codons into "optimal" and "nonoptimal," codon covariations were present within "optimal" or within "nonoptimal" codons, but not across them. More specifically, 97% (31/32) of the autocorrelated codon pairs (SD ! 3) in S. cerevisiae ( fig. 7) and 86% (25/29) of the autocorrelated pairs (SD ! 3) in E. coli (supplementary fig. S5, Supplementary Material online) could be explained by codon optimality. It is important to note that the remaining autocorrelated codon pairs in E. coli (4/29) and in S. cerevisiae (1/32) actually correspond to codons which we labeled as "intermediate optimal" (yellow boxes), for which we considered that we could not clearly assign the category of "optimal" or "nonoptimal," and thus were not counted as positive results. Overall, our results suggest that, at least in the case of S. cerevisiae and E. coli, codon autocorrelation may be a consequence of similar choice of "optimal" or "nonoptimal" codons throughout a sequence.

Codon Autocorrelation Is Not Domain-Specific
Regardless of the evolutionary forces driving codon covariation, our analysis demonstrates that codon covariation exists in all species analyzed ( fig. 6 and supplementary fig. S4, Supplementary Material online). Therefore, we extended Our results show that taxonomically related species display similar codon variation, however, multiple clusters appear within each domain of life, suggesting that codon covariation is not domain-specific ( fig. 8). Nevertheless, we do observe that some species belonging to the same domain cluster together, suggesting that codon covariation is not completely independent of their taxonomy. We then performed PCA analysis on the RSCPU values to test whether additional principal components might better separate the species into their corresponding domains; however, we find that species belonging to different domains largely overlap (supplementary fig. S6, Supplementary Material online). Overall, our results suggest that codon covariation patterns are not domainspecific, but they do show certain degree of clustering which is dependent on their taxonomy. Therefore, we suggest that codon covariation signatures may be used as additional features to improve the performance of current binning algorithms, but cannot be used alone to classify species into their corresponding domains of life.

Discussion
It is well established that the identity of favored codons varies among organisms (Chen et al. 2004;Hershberg and Petrov 2009;Sharp et al. 2010). However, the rules governing the identities of favored codons across organisms still remain obscure. High GC content organisms tend to have GC-rich favored codons, whereas low GC content organisms favor AT-rich codons, suggesting evolutionary pressures act in the same direction as the nucleotide substitution biases that determine overall nucleotide content of genomes (Hershberg and Petrov 2009). On the basis of these observations, previous works suggested that interspecies codon bias is driven mostly by genome-wide mutational biases. Here we suggest that, in addition to mutational forces, interspecies codon bias might also be shaped by selective forces.
Several studies have reported codon translation rates for a wide variety of species and conditions (Li et al. 2012;Gardin et al. 2014;Lareau et al. 2014;Bazzini et al. 2016). In a scenario of high abundance of nutrients, arginine codons have been recurrently identified as the slowest translated codons, both in Bacteria (Bonekamp and Jensen 1988;Chevance et al. 2014) and Eukarya (Charneski and Hurst 2013;Gardin et al. 2014;Requiao et al. 2016). However, within a species, not all arginine codons display slow translation rates. For example, in a bacterial system based on Salmonella enterica, CGC codons are rapidly translated, whereas AGG codons, which also encode for arginine, are slowly translated (Chevance et al. 2014). In the eukaryote S. cerevisiae, codons that are rapidly and slowly translated differ from those identified in bacteria. More specifically, CGC is slowly translated in S. cerevisiae, whilst AGA and CGU are rapidly translated (Gardin et al. 2014), matching the codon preferences of S. cerevisiae ( fig. 4). Moreover, in the case of Bacteria, the variance in translation speed between codons that encode for arginine is larger than for any other amino acid, and it is also high in the case of Eukarya (supplementary fig. S7A, Supplementary Material online). In the light of these observations, we suggest that selective pressures toward maintaining specific arginine codon usage biases, compared with other amino acids, might be responsible for the existence of domain-specific arginine codon usage biases.
A remaining question, however, is why such extreme variance in translation speed of arginine codons exists. It has In agreement with this hypothesis, Shine-Dalgarno-like sequences are typically depleted in bacterial genomes (Li et al. 2012). However, Shine-Dalgarno sequences are also employed by archaeal genomes to help recruit the ribosome to initiate protein synthesis, and in this domain, AGG codons-together with AGA codons-are in fact the most frequently used codons. Therefore, it is unlikely that the similarity to Shine-Dalgarno sequences alone is responsible for the depletion of AGG codons in Bacteria. An alternate-but not mutually exclusive-explanation for this phenomenon might reside in the differences between arginine tRNA decoding strategies used in Archaea and Bacteria ). More specifically, in Bacteria, the appearance of tRNA adenosine deaminases (tadA), which allows for efficient translation of CGC and CGU codons, might explain why these two codons are more frequently used and rapidly translated in Bacteria (supplementary fig. S7C, Supplementary Material online). Moreover, tadA does not exist in Archaea, and thus it may explain why archaeal genomes do not preferentially use CGN codons. Future work will be needed to decipher which are the forces causing distinct arginine codon usage bias across different domains of life.

A B
FIG. 6. Analysis of codon covariation across species does not support a universal tRNA recycling model. Codon covariation measured over all pairs comprised of one codon and the subsequent one encoding for the same amino acid, shown for Saccharomyces cerevisiae (A) and Homo sapiens (B). Values correspond to standard deviations from expected. Each codon has been labeled with its corresponding decoding tRNA, following parsimony-extended wobble rules when no Watson-Crick matching tRNA isoacceptor is available (as per gtRNAdb, Chan and Lowe 2016). Pairs have been shaded according to the number of standard deviations from expected: dark gray (>þ3SD; strongly favored codon pair), light gray (0-3SD; slightly favored codon pair), white ( 0 SD; nonfavored codon pair). In yeast, most codon pairs using the same tRNA are overrepresented, supporting a tRNA recycling model to explain the overrepresentation, but that is not true in other species. See also supplementary figure S4, Supplementary Material online for similar codon covariation analyses for Escherichia coli and Plasmodium falciparum.
Codon Usage Signatures and Autocorrelation . doi:10.1093/molbev/msz124 MBE Regardless of the evolutionary forces driving domainspecific codon preferences, here we find that codon usage bias of individual sequences can be used to taxonomically annotate them at the domain level ( fig. 5), largely due to differences in arginine codon usage ( fig. 2 and supplementary table S1, Supplementary Material online). To improve the performance of the algorithm, we wondered whether we could also include codon covariation, which is a welldocumented phenomenon in yeast (Cannarozzi et al. 2010;Hussmann and Press 2014). We find complex covariations of codon pairs are present in all analyzed species ( fig. 6 and   supplementary fig. S4, Supplementary Material online), which do not seem to be related to tRNA recycling, but rather, to similar codon optimality throughout a sequence ( fig. 7  Codon covariation is likely a consequence of the co-occurrence of "optimal" and "nonoptimal" codons, in highly and lowly expressed proteins, respectively. Codon covariation for Saccharomyces cerevisiae as depicted in figure 6, highlighting those pairs that are formed by two optimal codons (dark green), two nonoptimal codons (red), and codons with intermediate optimality (yellow). Optimal and nonoptimal codons have been defined as those that are highly abundant and lowly abundant in highly expressed proteins, and their relative abundance is shown for each individual amino acid and codon. See also supplementary figure S5, Supplementary Material online for the analysis for all amino acids S. cerevisiae, as well as for same analysis performed in Escherichia coli. Novoa et al. . doi:10.1093/molbev/msz124 MBE andLiao 2016;Lu et al. 2017). These methods exploit the uniqueness of base composition-from single to oligonucleotide levels-found across the genomes of different taxonomic entities, and have been implemented in tools such as PhyloPhythia (McHardy et al. 2007), TETRA (Teeling et al. 2004), and TACOA (Diaz et al. 2009), amongst many others. However, the underlying evolutionary principles governing the observed k-mer variation across species remains largely unknown. Consequently, it may be difficult to improve such algorithms without a working hypothesis of which additional variables might affect performance. Here, we find that taxonomically related species display common covariation patterns, although these patterns are not domain-specific. In the light of these findings, we propose that in addition to considering features such as overall k-mer counts (Teeling et al. 2004;Lu et al. 2017), codon covariation may be used to further improve the performance of current composition-based metagenomics binning algorithms.

Gene Sequences
The full set of EMBL CDS sequences was downloaded in July 2015 from ftp://ftp.ebi.ac.uk/. Species were clustered by its corresponding species and strain or subspecies (when available). To avoid overrepresented subsets of sequences (e.g., housekeeping genes from organisms for which only few sequences are included in the data set), species for which only few sequences were available were discarded. More specifically, species for which there were >1,000 genes in the EMBL CDS data set for that same species were selected for further analysis. The final set of EMBL CDS sequences analyzed consisted in 1,625 genomes, which included over 14 million sequences. The distribution of sequences per phylum can be found in supplementary table S2, Supplementary Material online.

Training and Test Set Preparation
The filtered set of EMBL CDS sequences was divided into training (10%) and test set (90%). Each training set sequence was individually analyzed in terms of codon usage bias, and converted into a 59-element vector (one for each codon, excluding Met, Trp, and stop codons) of RSCU values. RSCU was computed as defined in Sharp et al. (1986). PCA was applied on the training set matrix of RSCU values to reduce the dimensionality of the data. Scree's test (Cattell 1966) was used to determine the number of significant principal components to be retained. PCA scores of the selected subset of principal components were used as new vectors to define each sequence. Each sequence was assigned to a domain (Eukarya, Bacteria, Archaea) based on its taxonomical annotation in NCBI taxonomy.
Machine Learning PCA scores of the EMBL CDS training set with its corresponding domain annotations were used as input to train a SVM using the e1071 library from R, using a C-classification method and a class.weights vector to compensate for asymmetric class sizes. Parameters were optimized using the tune.svm function. The final SVM model was validated using 5-fold cross-validation. EMBL CDS test set sequences were converted into PCA scores by applying the same PCA loadings that were generated upon PC analysis on the training set, and its corresponding domains were predicted using the SVM model built on the training set. Our model correctly predicted the domain of the EMBL CDS test set sequences with an overall accuracy of 85%.

Codon Autocorrelation
Codon autocorrelation standard deviations were computed as described in Cannarozzi et al. (2010). Briefly, for each sequence, the number of consecutive pairs of codons for a same amino acid were counted. The expected numbers of pairs were computed as the products of the frequencies of the individual codons. Codons were Z-transformed by Codon Usage Signatures and Autocorrelation . doi:10.1093/molbev/msz124 MBE subtracting the expected counts from the observed and divided by the standard deviations from the expected value. The results were expressed as standard deviations from the expected value. Code for computing codon autocorrelation can be found in https://github.com/enovoa/codon Autocorrelation.

RSCPU
We define the RSCPU as the ratio of the observed frequency (f obs_pair ) of a given codon pair to the expected frequency of the codon pair (f exp_pair ) (eq. 1). The expected frequency of the codon pair is defined as the product of the individual codon frequencies (f obs_codon1 and f obs_codon2 ) observed in the genome (eq. 2).
RSCPU ¼ f obs pair =f exp pair (eq. 1) f exp pair ¼ f obs codon1 Ã f obs codon2 (eq. 2) Therefore, if the observed frequency matches the expected frequency, the RSCPU will have a value of 1. If the RSCPU is higher than 1, the pair is seen more frequently than what would be expected considering the individual codon frequencies, whereas if it is lower than 1, the pair is observed less frequently than what would be expected.

Protein Abundances
Protein abundance values of S. cerevisiae, used to build figures 4 and 7, were taken from the work of Newman et al. (2006). Protein abundance values of E. coli, used to build supplementary figure S2, Supplementary Material online, were taken from Lu et al. (2007). The fitting of RSCU values relative to protein abundances (log) has been done using the loess function in R, which uses Local Polynomial Regression fitting.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.