Horizontal gene transfer (HGT) is a major force in microbial evolution. Previous studies have suggested that a variety of factors, including restricted recombination and toxicity of foreign gene products, may act as barriers to the successful integration of horizontally transferred genes. This study identifies an additional central barrier to HGT—the lack of co-adaptation between the codon usage of the transferred gene and the tRNA pool of the recipient organism. Analyzing the genomic sequences of more than 190 microorganisms and the HGT events that have occurred between them, we show that the number of genes that were horizontally transferred between organisms is positively correlated with the similarity between their tRNA pools. Those genes that are better adapted to the tRNA pools of the target genomes tend to undergo more frequent HGT. At the community (or environment) level, organisms that share a common ecological niche tend to have similar tRNA pools. These results remain significant after controlling for diverse ecological and evolutionary parameters. Our analysis demonstrates that there are bi-directional associations between the similarity in the tRNA pools of organisms and the number of HGT events occurring between them. Similar tRNA pools between a donor and a host tend to increase the probability that a horizontally acquired gene will become fixed in its new genome. Our results also suggest that frequent HGT may be a homogenizing force that increases the similarity in the tRNA pools of organisms within the same community.
Horizontal gene transfer (HGT), the passage of genetic material between organisms and lineages, is a major force in microbial evolution (1–3). It is estimated that >80% of gene families in prokaryotic genomes underwent HGT at some point of their evolutionary history (1). One fundamental question in this context is what determines the evolutionary success of HGT events and supports the fixation of the transferred genes. Previous studies in the field have identified several factors that influence the chance of successful HGT, including toxicity of the acquired gene product (2), the ability of the transferred segment to integrate into the host genome by recombination (4), the number of protein interactions of the transferred gene product (5) and the function of the gene(s), since genes that confer specific advantages are more likely to be retained. (6).
Recently, Kudla et al. (7) cloned differently encoded green fluorescent protein (GFP) variants into Escherichia coli and showed that over-expressing a single gene with incompatible codon usage can be severely deleterious to the organism. It is therefore perhaps not surprising that the codon usage in many viruses is adapted to the tRNA pools of their hosts (8). Although looking for genes with atypical codons is a common method for detecting HGT, Medrano-Soto and coworkers demonstrated that many horizontally-acquired genes were in fact compatible with the recipient genome’s codon usage (9). Codon bias, however, is not necessarily driven by translation efficiency alone, and can sometimes be the product of constraints such as GC content, and/or amino acid usage.
In this work, we use a direct measure of translation efficiency and show (while controlling for other variables) that there is an association between translation efficiency and HGT. In the first three sections we show that this association exists at the levels of gene families, pairs of organisms and communities of organisms. Specifically, in the first section, we show that genes that have comparable tRNA adaptation values across organisms tend to be exchanged more often. In the second section, we show that organisms with similar tRNA pools tend to share genes more frequently, generating dynamic networks of increased gene exchange. In the third section, we show that pairs of organisms that share a community (and thus usually undergo more frequent HGT) have similar tRNA pools.
In the two last sections we study the potential forces of selection underlying the associations described above. Specifically, we suggest that the bi-directional relation between translation efficiency and HGTs is related to the relation between ribosome allocation and translation speed (7,10). In section four, we explain why the adaptation of the codon bias of the transferred gene to the tRNA pool of the host is a major determinant of the success of HGT. In section five we explain how frequent HGTs can homogenize the tRNA pools of organisms within a community.
MATERIALS AND METHODS
Information about gene HGT and gene sharing
Data about gene sharing and inferred HGT between ancestral and extant organisms were obtained from Dagan et al. (1). Data about genes that underwent HGT (6) was kindly provided by Y. Nakamura. Additional data about gene sharing were obtained from (11).
Information about community structure
These data were downloaded from (12).
All the coding sequences were downloaded from the NCBI site (http://www.ncbi.nlm.nih.gov/Ftp/).
tRNA copy numbers
Gene expression in E. coli
These data were downloaded from (14).
The phylogenetic tree for reconstructing the ancestral tRNA pool
The phylogenetic tree was obtained from (15). This tree was reconstructed based on the rRNA operon (16S, 23S and 5S) from all 190 genomes using the maximum likelihood approach. The final phylogenetic tree includes 190 organisms. The list of organisms and their taxonomy appears in (1). We used Neyman’s two state model (16), a version of Jukes Cantor (JC) model (17) for inferring the edge lengths of the tree (the probability of gain/loss of a gene family) by maximum likelihood, as implemented in PAML (18). The edge lengths correspond to the probabilities that a protein family will appear/disappear along the corresponding lineage.
Reconstruction of ancestral tRNA copy number
Let denote the probability of gain/loss of a gene family along the edge tree edge . Let denote the copy number of tRNA in node (genome or ancestral genome) of the evolutionary tree. We inferred the ancestral tRNA copy number using a generalized maximum parsimony method, whose penalty for a change in the number of genes corresponding to a tRNA along the tree edge .
Computing tAI and tRs
tAI was computed following the work of dos Reis et al. (19), who defined this index. This measure gauges the availability of tRNAs for each codon.
The similarity in tRNA pools between two organisms, tRs, is defined as the non-parametric Spearman correlation between the two vectors (of length 61) of their codons’ tAI values (denoted by tRs; see also Supplementary Note S6). The tAI of a gene is the geometric mean of the tAI of all its codons (additional technical details are provided in Supplementary Note S1).
Connecting evolutionary changes in tRNA copy numbers and HGT
Let (x,y) denote an edge in the evolutionary tree described above where x is the node that is closer to the root of the tree.
For each pair of edges, (x1,y1) and (x2,y2), in the evolutionary tree we computed the number of HGT along the corresponding evolutionary interval (x1 to y1 and x2 to y2; i.e. the HGT between x1 and x2 or between y1 and y2) and the change in the tRs along the edge. We correlated these values to show that along different paths of the evolutionary tree an increased number of HGT corresponds to increase in tRs. This also represents an additional control for phylogenetic proximity.
Estimating the number of HGTs a gene family has undergone
We ranked the gene families (COGs) according to their tendency to undergo HGT in the following way: first we mapped all the genes to their COGs; next we used the data of Nakamura et al. (6) to count the number of times (number of organisms) a gene corresponding to each COG was detected as a recently transferred gene.
Measures for the variability and robustness of the tAI of a gene family
We considered two measures for the robustness of the tAI of particular gene families:
The first measure, VtAI, is the standard deviation (SD) of the tAI of a COG across organisms. COGs with higher SD of tAI are those whose codon bias and/or the tRNA pool that recognizes their codons is more variable between organisms, suggesting different levels of translation efficiency of gene homologs in different organisms. The second measure, tRNA robusteness, RtAI, was computed as follows: First, we computed for each codon the SD of the tAI score of that codon across all the analyzed organisms; let denote the SD of the tAI of codon . Next, we computed for each COG () in each organism the mean SD of the tAI of its codons:is the mean SD of the tAI of the codon defined by the k’th triplet on gene g; and is the length of COG in organism . is the number of organisms in the database. A COG with higher RtAI has a codon bias that is less robust to changes in the tRNA pool (e.g. due to HGT). To control for the fact that some of the COGs appear only in very few organisms we considered only COGs that appear in at least 85 organisms (i.e. >50% of the organisms in the data set; higher cut-offs gave very similar results). In addition, the number of organisms possessing a COG was always included as one of the covariates in the partial correlations.
Comparison between the number of HGT events and the variability of the tAI among COGs
To compare the variability in tAI versus the number of HGT events that are related to each COG, we counted the HGT predictions that are related to each COG (6), and the mean VtAI and RtAI corresponding to each COG, and subsequently correlated these measures.
The RVtAI—a combination of VtAI and RtAI
RVtAI was computed in the following way: first, we ranked all the COGs by their VtAI and RtAI. Next, for each COG we averaged the ranks of these two parameters. Thus, this index is based ‘both’ on the variability of the tAI of a gene among the analyzed organisms (VtAI) and the robustness of the translation rate of the gene to changes in the tRNA pools (RtAI) and better reflects the robustness of the gene to changes in the tRNA pool than either VtAI or RtAI individually.
Controls for possible covariates
Controls when comparing the tRs of pairs of organism to the number of HGTs between them or to community similarity
The following variables were used as covariates (together) in the multivariate analysis of tRs versus the number of HGT events/sharing a niche (see more details below).
For each organism we computed the G+C (GC) content (the percentage of G+C in its genome). To control for the possibility that the correlation between tRs and the number of HGT events is related to GC content, we computed for each pair of organisms the absolute value of the difference in their GC content.
Amino acid usage
For each organism we computed the amino acid usage (the vector of frequencies of all the 20 amino acids in its coding sequences). To control for amino acid usage, we computed for each pair of organisms the correlation between their vectors of amino acid usage, ARs.
To control for phylogenetic distance, we computed for each pair of organisms their distance in the evolutionary tree based on the inferred edge lengths (see above).
Control for the correlation between codon bias and translation efficiency
We computed for each organism the correlation between the codon bias and tAI of codons; next, we computed for each pair of organisms the absolute value of the difference in their CB-tAI correlations.
Controls for growth rate
The generation times of 214 organisms were downloaded from (20). To control for the possibility that the correlation between tRs and the number of HGT events is related to growth rate, we computed for each pair of organisms the absolute value of the difference in their generation times.
Controls for genome size
We considered two variables: the difference between the number of genes in each pair of genomes and the sum of the number of genes in each pair of genomes.
Controls for the comparison of COGs
The following variables were used as covariates (together) in the multivariate analysis of RVtAI and the number of HGT of a COG (see more details below).
GC content control
We computed for each COG in each organism the GC content of the COG in the organism. Next, we computed the SD of the GC of the COG.
Amino acid control
We computed for each amino acid, in each organism, the expected tAI, EtAI, which is the weighted average of the tAI (taking into account the codon bias of the organism) of all the codons that code for this amino acid. The EtAI of a COG in a certain organism is the geometric mean of the EtAI of all the amino acids in the COG. Next, we computed the SD of the EtAI of the COG.
Control for expression levels
It is well know that the expression levels of genes is strongly correlated with their tAI (10,21,22). Thus, to control for the possibility that the correlation between RVtAI and the number of HGT may be explained by the expression levels of the COG we first ranked the COGs according to their tAI in each organism (in each organism we computed the for each COG the percentage of COGs whose expression is higher than the expression level of the COG).
Next, we computed for each COG its mean tAI rank across all the organisms. For example, when we analyzed the 1594 COGs that appear in more than 50 organisms we found that the top 100 COGs with the highest mean tAI ranks include 16 ribosomal proteins (enrichment P-value = 1.6 × 10−7) and the lowest 100 COGs include no ribosomal proteins. It is known that the expression levels of ribosomal proteins are very high [see, for example, (23)] thus, this fact demonstrates that indeed our measure is a good approximation of expression levels.
Non-parametric multivariate analysis
Let X and Y denote two variables and Z = [Z1, Z2, Z3,…] denote a set of variables.
The non-parametric multivariate analysis that is reported in this paper includes partial Spearman correlations of the form R(X,Y|Z). Roughly, if such a correlation is significant it means that there is a relation between X and Y that can not be explained by the variable Z.
In the first section of the paper we tried to find distinct correlation with ‘genes shared among organisms’ (X in our case) we considered each of the variables defined above as the explaining variables (Y, specifically, the tRs); where in each case the rest of the variables were used as covariates (Z). In the second and the third sections similar analyses were performed for the number of HGT of a COG (X, second section) and ‘community co-membership’ (X, third section).
Finally, note that in our case (as is usually the case when biological data are analyzed) the data points are not independent, and this may affect the results (increase or decrease the correlations).
Statistical tests and correlations
We used the following non-parametric tests: Spearman correlation, partial Spearman correlation and KS-test. The statistical analysis was performed in MATLAB. We reported both asymptotical P-values and empirical P-values for the Spearman correlations. The empirical P-values for a Spearman correlation, R(X,Y|Z) were computed by randomly permuting X and Yn = 1000 times and counting the number of times the correlation of the permuted vectors was higher or equal to the original correlations (let m denote this number). The empirical P-value is pe = m/n.
A network of connections between pairs of organism with high tRs, modularity and significance
This network includes a node for each organism [a total of 648 organisms whose tRNA copy number was available (13)]; in this network, two organisms were connected by an edge if their tRs score was >97% of the tRs scores.
The modularity score of a network is a number between 0 and 1, where larger numbers correspond to stronger signal of modularity [see details in (24)]. The modularity score of the original network of tAI similarity was very high: 0.6518. We compared the modularity score of the original network to the score of 100 random networks that maintain the degree distribution of the original network; the random networks were generated similarly to the way such a network was generated in (25). In the case of the random networks the score was much lower (mean = 0.1235, SD = 0.0196). These results suggest that the modularity score of the network of tAI similarity is significantly higher than what is expected from random networks with similar properties (empirical P < 0.01).
We also compared the modularity score of the original network to the score of 20 random networks that were generated in a different way: each random network was generated as before but based on a permuted version of the tAI values of the codons of each organism. In this case also, the modularity score of the random networks was significantly lower (mean = 0.2451, SD = 0.0054; empirical P < 0.05).
Optimization of the correlation between the tAI and the expression levels
We computed the tRNA pool that optimizes the correlation between the tAI of genes and their mRNA levels [R(tAI, mRNA)] by performing a heuristic, hill climbing process. The procedure maintained the total sum of tRNA copy numbers and the tRNA copy numbers that are equal to zero [as was performed in (26)]. The starting point used was the original tRNA pool of the genome.
Gene families with more universally conserved tAI across organisms are exchanged more frequently
Based on the genomic tRNA copy number, a proxy for the expression levels of tRNAs [(27–30) and Supplementary Note S1], we computed the tRNA adaptation index (tAI) for 190 prokaryotes that appear in (1). The tAI of a codon [which is equal to its w-value (15); see ‘Materials and Methods’ section and Supplementary Note S1] is a measure of its translation efficiency (19,22) that combines the tRNA gene copy number with the efficiency of the codon–anticodon binding by the tRNA. This measure is more informative than the copy number of the tRNAs that recognize that codon (‘Materials and Methods’ section).
The tAI of a gene is the geometric mean of the tAI of its codons (19,22) which ranges between 0 (non-efficient translation) and 1 (highly efficient translation; see ‘Materials and Methods’ section and Supplementary Note S1). Unlike other measures such as codon bias or the codon adaptation index (CAI), the tAI is not only a good predictor of protein abundance and translation rate (10,19,21,22), but also measures the co-adaptation between codon usage and the tRNA pool directly (31) (Supplementary Table S1 provides the tRNA copy number and tAI scores of all the codons in all the analyzed organisms).
If a gene’s tAI is a major determinant of its HGT frequency, then one would expect to see that genes with more uniform levels of tAI across species tend to have higher levels of HGT. To test this hypothesis, we used the data of Nakamura et al. (6), who performed a large scale study of single genes that were horizontally transferred fairly recently [see also (32) and Supplementary Note S2]. We analyzed 4614 COGs (33) across 163 prokaryotic genomes, counting only the COGs that appear in more than half of the organisms (i.e. at least in 85 organisms). We considered two measures of tAI variance: the first one is simply the variance in the gene’s tAI, i.e. to what degree does the level of translational efficiency of a gene vary across taxa (this measure was names VtAI, see ‘Materials and Methods’ section for details). Indeed, the correlation between the VtAI of a gene and the number of times that this gene has been horizontally transferred (controlling for the fact that some genes appear in fewer organisms; see ‘Materials and Methods’ section) is significant and negative (r = −0.2958; P < 10−16; pe < 0.001; n = 917).
The second measure, RtAI, measures the robustness of the translation efficiency (tAI) of each COG to different tRNA pools (a higher RtAI means less robustness). This measure is based on the fact that some codons are more robust than others to changes in the tRNA pool across the analyzed genomes (‘Materials and Methods’ section). When we compared RtAI and the number of HGT events of each COG (again controlling for the fact that some genes appear in fewer organisms; see ‘Materials and Methods’ section) we, once more obtained a highly significant correlation (r = −0.4125 and P < 10−16, pe < 0.001; n = 917). This result suggests that genes whose translation efficiency is less sensitive to changes in the tRNA pool (i.e. to the host where they are expressed) have been transferred more often.
To better understand how different features of a COG explain the number of times it has been horizontally transferred, we computed a combined and more robust measure of tAI robustness, RVtAI, (a combination of the two previous measures, where higher RVtAI means less robustness to changes in the tRNA pool, see ‘Materials and Methods’ section). As can be seen in Figure 1A, the correlation between the RVtAI of a COG and the number of times that COG was horizontally transferred is very high (r = −0.4340; P < 10−16). Specifically, the 20% of the COGs with the lowest RVtAI scores have been horizontally transferred over 7-fold more frequently than the 20% of the COGs with the top RVtAI (Figure 1A) demonstrating again the strong relation between robustness to tRNA pools and HGT.
Thus, for example, the type I restriction-modification enzyme S subunit (COG0732V), known to be frequently mobile (34) and the thymydilate synthase protein ThyX (COG1351F), shown recently to be transferred multiple times in evolution (35), both have low RVtAI. In contrast, several ribosomal proteins (S2, S6, L7/12, L9, L21, L24P, L25, L36) considered to be rarely transferred (36), have high RVtAI values and poor robustness, although they tend to have high tAI values in most genomes (COGs with extreme RVtAI appear in Supplementary Table S2).
As before, we considered other COG features that may potentially explain the correlation above: (i) An estimation of the mean expression level of the COG across the analyzed organisms [based on the tAI, which is often a better predictor of protein abundance than mRNA levels (21,22), which generally correlate well with tAI; see ‘Materials and Methods’ section]. This is pertinent as it is known that highly expressed genes are involved in less HGT events and are also under stronger selection for translation efficiency (7,10,37). (ii) The variation in the GC content of the COG across all organisms. (iii) The variation in the amino acid usage across these organisms (as the set of organisms used for computing the RVtAI is identical in all COGs there is no need here to control for the genome size or effective population size).
As depicted in Figure 1B, the correlation between the RVtAI of a COG and the number of times a COG was horizontally transferred remains significant even when controlling for all the three variables above ‘together’ (r = −0.1125; P = 6.6 × 10−4, pe < 0.001); most of the decrease in correlation is due to the estimation of expression levels (Figure 1C).
Organisms that are involved in more HGT have more similar tRNA pools
The translational efficiency pattern (TEP) of the codons in each organism (based on the tRNA pool) can be described by a vector of length 61 (the number of coding triplets), where each entry denotes the tRNA adaptation index (tAI) of the respective codon (which is more informative then just the tRNA copy number, see ‘Materials and Methods’ section).
The similarity in the TEP of tRNA pools between any two organisms is then measured by the Spearman correlation between their corresponding TEP vectors, denoted as tRs. Accordingly, two organisms have a higher tRs if they have a more similar rank order of codons’ translation efficiency.
We used the network of gene sharing from (1), in which a gene is defined as shared by two organisms if these organisms have relatively similar orthologs of that gene. Gene sharing between a pair of organisms can arise when the two are evolutionarily close and the gene was inherited vertically by both of them, or when a gene was transferred horizontally between these organisms or their ancestors. It is important to note that this network is protein-based, i.e. derived from ‘amino acid alignments’, and therefore is independent of codon information. Analyzing this network, we observed that tRs of organisms that share more genes tend to be more similar (Figure 2A and B; Spearman correlation above 0.32, P < 10−30, empirical P < 0. 001). Similar findings were obtained when analyzing the data from Beiko et al. [(11); see Supplementary Note S3].
We aimed to distinguish between two possible explanations for this observation that are not necessarily mutually exclusive: (i) organisms that are involved in HGT are part of the same community and thus are under similar ecological constrains that select for the use of similar amino acids [e.g. see (38)] or a similar GC content, similar growth rates, etc. Due to this environmental similarity, the need to optimize the co-adaptation between their codon bias and the tRNA pool subsequently drives their tRNA pools to become similar. The association between HGT and tRNA similarity could therefore be merely a consequence of the shared niches and/or other variables. (ii) There are more successful HGT events between organisms with similar tRNA pools, by virtue of their codon compatibility, and this is not merely a by-product of the communities shared.
To distinguish between these two hypotheses we examined the association between tRs and gene sharing while controlling for six possible variables that may explain this association: (i) genome size (the sum of the genome sizes for each pair and the differences in genome size for each pair) which is correlated with population size of the corresponding organisms (39); (ii) phylogenetic distance (i.e. distance on the tree of life, see ‘Materials and Methods’ section); (iii) similarity in the amino acid usage; (iv) selection for translation efficiency (the correlation between codon bias and their translation efficiency); (v) similarity in the GC content; (vi) similarity in the optimal growth rate (20); and (vii) in addition, we controlled for a variable corresponding to the community associations of the organisms [a binary variable: 1/0—depending on whether or not these organisms have been shown to reside in the same community (12)].
Remarkably, the Spearman correlation between the number of shared genes and tRs for pairs of organisms remains significant also after controlling for all confounding factors defined above ‘together’ (as listed in the previous subsection). This result shows that both within and outside a community, HGT occurs more frequently between organisms with higher tRs (r = 0.212; P = 1.18 × 10−8; empirical P-value pe < 0.001 when controlling for all the variables above ‘together’). Figure 2C and D details how the different variables above contribute to the relation between the number of HGT events [90% similarity cut-off for gene sharing (1)] and the tRs (similarity in the tRNA pools) of pairs of organisms. Similar correlations were obtained for lower similarity cut-offs definitions of gene sharing (1).
Based on extant species tRNA abundances, tRNA pools for ancestral organisms (i.e. ancestral nodes in the phylogenetic tree) can be reconstructed (‘Materials and Methods’ section). When we compared the ancestral) HGT frequencies between pairs of organism and their ancestral tRs along the analyzed phylogenetic tree (‘Materials and Methods’ section), the correlation obtained was 0.25 (P < 10−16, pe < 0.001); see Figure 2E. Specifically, pairs of organisms whose tRs is high (mean tRs = 0.82) have been involved in 24 times more HGT events than organisms with very low tRs (mean tRs = 0.76; Figure 2E).
In addition, as an additional control for phylogenetic proximity, we computed the correlation between the HGT along pairs of edges and the change in the tRs along the pairs of edges (see details in the ‘Materials and Methods’ section). Indeed, we found a significant correlation between these variables showing again that more HGT corresponds to an increase in tRs (r = 0.42; P < 10−16, pe < 0.001). The results reported above can therefore be extended to longer evolutionary timescales and to different periods of the evolution of the analyzed organisms.
Organisms that live in the same environment have similar tRNA pools
In the previous sections, we showed that there is an association between HGT and similarity in tRNA pool measured by the tRs. It is known that organisms that share the same environment are involved in more HGT (for example, the correlation between community co-membership and the number of shared genes when controlling for all the other variables defined in the pervious section is r = 0.23, P = 2.4 × 10−10, pe < 0.001). Thus, we aimed to compare the similarity in the tRNA pools, measured by the tRs, within and between communities, using the environmental partitioning established by Freilich et al. (12). Indeed, microorganisms that occupy similar niches have significantly higher tRs than those that live in different ones (P = 2 × 10−30; Figure 3A).
In the next stage, we aimed to distinguish between the two possible explanations for this observation that have already been discussed before (in the section about the relation between gene sharing and similar tRNA pool). To this end, we considered the variables defined in the previous section and performed a multivariate non-parametric analysis to understand how the different variables are associated with ‘community co-membership’ and to study how they affect the relation between ‘community co-membership’ and tRs (‘Materials and Methods’ section). Figure 3B depicts the correlations between community co-membership and ‘each of the variables’ when controlling for the rest of the variables (i.e. the distinct correlation between community co-membership and each of the variables; ‘Materials and Methods’ section). As depicted in Figure 3B, the correlation between the community co-membership and tRs remains significant even when controlling for all the variables above combined (r = 0.076; P = 0.001, empirical pe < 0.001; ‘Materials and Methods’ section); similarly to the analysis reported in the previous subsection for HGT, most of the decrease in correlation is due to the contribution of genome sizes, phylogenetic distance and similarity in amino acid usage (Figure 3C). When we added to the list of covariate variables the number of shared genes (shown to be correlated with tRs in the previous sub-section) the correlation became not statistically significant(r = 0.018; P = 0.63; pe = 0.3030). This is in contrast to the number of HGT events that was significantly correlated with tRs even when controlling for community co-membership (see the previous sub-section).
Finally, the fact that there is a significant relation between similarity in tRNA pools and a shared community, suggests that similarity in tRNA pools (e.g. the tRs) can be used for clustering organisms into their communities. Indeed, an initial analysis reported in Supplementary Note S4 supports this conjecture.
The effect of codon (dis)similarity of transferred genes on the recipient organism
Over-expressing a gene with maladapted codons is deleterious to the organism, because ribosomes would then spend too much time on its slow translation, leading to inefficient ribosome allocation and delayed growth (31). Accordingly, a significant positive correlation was previously shown between codon optimality of highly transcribed genes [measured by the CAI (31)] and fitness (r = 0.54; P < 10−13) (7); see Figure 4A for the corresponding relation between tAI and fitness (r = 0.52; P < 1.7 × 10−11, P < 0.001). As a general rule, the fraction of the genes acquired by HGT that is highly transcribed is much lower than the fraction of highly expressed native genes [see, for example, (40,41)], mostly due to their foreign promoter regions. However, some horizontally acquired genes, especially of phage origin, can be highly transcribed [(14), see the next section]. Those genes can reduce host fitness if their codons are highly maladapted even if the protein they encode does not have a functional role in the recipient organism (42,43). In contrast, there are rare examples [Sorek et al. (2)] in which a ‘specific’ protein encoded by a foreign gene is somehow deleterious to the recipient organism. In these special cases, having highly compatible codons is likely to result in higher expression levels (more abundant deleterious proteins) leading to ‘increased damage’ to the host. Analyzing the data of Sorek et al., who identified a relatively small number of genes (65 genes) that could not be cloned in E. coli due to the ‘toxicity’ of the corresponding proteins (2), we indeed found a significant ‘positive’ correlation between the number of genes from other organisms that cannot be cloned in E. coli and the similarity in the tRNA pools of donor organisms and E. coli (r = 0.34, P = 0.0066); see Figure 4B. Thus, as one would expect, the closer the tRNA pools are, the functional effects of genes whose proteins are toxic becomes larger, and hence the number of such ‘forbidden’, potentially toxic HGT events increases.
The results reported in the previous sections (the positive correlation between similarity the tRs and the number of HGTs) suggest that genes that produce highly deleterious proteins in the new host are ‘relatively rare’. Thus, in general, it appears that beneficial genes and highly-transcribed neutral genes are frequent enough to create a selective pressure for similar tRNA pools in organisms from the same community and/or those that are involved in HGT, overcoming the inverse pressure exerted by the few genes whose proteins are ‘deleterious’ to the recipient organism.
HGTs may create a selective pressure for similar tRNA pools
In the previous sections we showed that organisms that share a community have more similar tRNA pools (i.e. more similar tRs), but how has this similarity evolved?
To try and answer that question we returned to the data set of Nakamura et al. (6), which contains 214 715 genes in the organisms analyzed (163 organisms in total) 12 897 of which underwent a relatively recent transfer [i.e. these genes have a foreign nucleotide composition and are not fully ameliorated (41)]. On average, in each of the organisms analyzed around 6% of the genes are ‘recent’ acquisitions. The method of Nakamura et al. (6) is based on codon bias; therefore, this is probably a ‘lower bound’ on the actual number of HT genes [see also (3)]. As we explain below, this ‘non-negligible’ fraction of genes that has been acquired can trigger changes in the copy number of some of the tRNA genes.
In all organisms that were analyzed before there is significant correlation between the expression levels of genes and translation efficiency [as measured, for example, by the tAI; see (10,19,21,22,26,37)]. One of the main reasons for this correlation is the fact that highly expressed genes potentially occupy more ribosomes. Thus, mutations that improve their translation efficiency are likely to have a higher effect on the organism’s ‘fitness’. As a result, these genes are under stronger selection for higher translation efficiency [see, for example, (7,10,37)]. In other words, improving the correlation between the expression level of genes and their translation efficiency (tAI) should improve the overall allocation of ribosomes and fitness of the organism. This correlation can be improved locally, by increasing the number of efficient codons in a highly expressed gene or globally, by changing the tRNA pool of the organism such that the correlation will increase. Changes in the tRNA pool may be due to duplication/deletion of a tRNA gene but also due to transfer of a tRNA(s) (possibly in the same HGT event) from organism(s) in the same environment/community.
To try and demonstrate how HGT can trigger selection for similar tRNA pools, we used E. coli [for which the mRNA levels for most genes, under known conditions, are available (14)] as a model organism. According to Nakamura et al. (6), 768 out of the 4376 E. coli genes underwent recent transfer. Figure 4C depicts the distribution of expression levels in E. coli genes, distinguishing between recently acquired genes and the rest of the genome. As can be seen, the average expression levels of recently acquired genes are lower than the mean expression level of the rest of the genes (Figure 4C). However, there is a non-negligible fraction of recently acquired genes that is highly expressed (6% of the recently acquired genes versus 11% of the rest of the genome). Hence, we expected that adjusting the tRNA pool to better fit the expression levels of these new genes, while maintaining the efficiency of the older and more established genes, should improve the fitness of the organism. A more efficient ribosome allocation should be reflected by a higher correlation between mRNA levels of genes and their tAI. In addition, it is important to note that even a ‘small’ change in this correlation may have a substantial effect on the fitness of the organism.
Quite surprisingly, in accordance with this hypothesis, the correlation between the expression levels and tAI is indeed higher when considering ‘all’ the genes (r = 0.47; P < 10−16, pe < 0.001) than when considering ‘only’ the non-recently transferred genes (r = 0.45; P < 10−16, pe < 0.001; Figure 4D; see Figure 4E for the distribution of tAI in the non-transferred genes and the transferred genes). This suggests that the tRNA pool of E. coli has underwent an adaptation to the new HT genes. Further support for this finding comes by observing that the correlation between the expression levels and tAI of all the genes (recently transferred and non-recently transferred) is significantly higher than when replacing the transferred genes with random groups of genes of similar size (whose codon biased underwent selection to fit their expression levels—controlling for the fact that the adaptation can be at the level of codon bias or that the HT genes have more compatible codon bias; see the previous subsection), or by randomly permuting the mRNA values of only the transferred genes (empirical P-value = 0.01 and 0.03, respectively). These results remain significant also when we sampled random groups with similar mean mRNA levels as the transferred genes (controlling for the fact that the recently acquired genes generally have lower mRNA levels; empirical P-value = 0.01).
In addition, we computed the tRNA pool that optimizes (‘Materials and Methods’ section) this correlation when considering all the genes and when considering only genes that ‘did not’ undergo recent HGT. These correlations were compared to the correlation obtained when using the actual tRNA pool. Again, the ‘optimal’ correlation was 33% ‘closer’ to the actual one obtained when considering ‘all’ the genes (Figure 4D).
These results may suggest that the tRNA pool of E. coli was shaped by the expression levels and codon bias of the transferred genes and not only its ancestral genes. More generally, these results hint that in practice there may be a sufficient level of HGT to trigger selection for changes in an organism’s tRNA pool—i.e. such changes that make it more similar to the tRNA pools of its partners for gene exchange (see also Supplementary Note S5). The fact that we have observed this findings in recently transferred genes that have not had time to ameliorate, may suggest that such changes in the tRNA pool are relatively rapid. Specifically, these changes are faster than the time required for full amelioration, which was estimated to be around 300 million years (44).
This study shows that there is a bi-directional association between translation efficiency and HGT: genes tend to be transferred between organisms that have similar tRNA pools (tRs) and frequent HGT between organisms can in turn homogenize their tRNA pools. Likewise, genes whose tAI is similar among many organisms tend to be transferred more frequently. The fact that the relation between the similarity in tRNA pools and HGT was observed in all the analyzed data sets demonstrates the robustness of this relation.
We show that this relation remains significant after controlling for many other possible variables (e.g. GC content, amino acid usage, phylogenetic distance and more). It is important to remember, however, that it is impossible to completely tease apart some of the variables mentioned above, as they are inherently inter-dependent. For example, similarity in tRNA pool (and thus codon composition) will be reflected in a similar GC content and amino acid usage; similarly, phylogenetic proximity can increase the number of HGT events due to similarity in the tRNA pools.
Thus, the correlations that were obtained after controlling for these variables (usually around 0.1–0.2) are only a ‘lower bound’ on the actual effect of translation efficiency on gene transfer. The fact that these correlations are significant, demonstrates that there is ‘a distinct’ effect of translation efficiency on HGT.
The correlation obtained without the controls (more than 0.4), on the other hand, represents an ‘upper bound’ for the effect of translation efficiency on HGT, suggesting overall that the actual association between HGT and compatibility of the tRNA pools is very substantial.
In summary, based on the results presented in this article, we suggest that when the tRNA pool of the donor organism is more similar to that of the recipient, this in turn increases the chance of successful HGT events (i.e. there is a higher probability that transferred genes will be fixed). Another possibility that is consistent with the data, is that when an organism receives genes from other organisms, most of them in its community/environment, its tRNA pool can also undergo selection to fit the new genes, especially those that are highly transcribed, thus improving the fitness of the organism. This scenario would be more likely when multiple genes are acquired from a single source together, as in the acquisition of a large plasmid. Thus, this alternative mechanism will also cause the tRNA pool of the organism to become more similar to other organisms in the community (Figure 4F). One can therefore speculate that a positive-feedback loop will exist, accounting for the increasing similarity in the tRNA pool of organisms in the same community/niche due to HGT, in turn promoting HGT between organisms in the same niche (Figure 4F).
Furthermore, this scenario is supported by reasonable population genetic considerations such as the effective size of bacteria, the fitness advantage of such changes in the tRNA pool, or the fitness disadvantage of receiving a gene with maladaptive codons (see details in Supplementary Note S5). These results encourage further studies in this direction, for example, by performing in vitro evolution experiments where bacteria receive a plasmid containing highly expressed genes and their tRNA pools are examined by genome re-sequencing every 1000 generations.
The results presented in this paper suggest that methods that detect (recent) HGT events based on difference in the codon bias of gene acceptor and the gene donor [see, for example, (6,45)] underestimate the number of horizontally transferred genes. Such methods search for genes whose codon bias is different from that of the host; however, as we report here, the donors of many of the horizontally transferred genes have tRNA pools (and similar codon bias) that are similar to the tRNA pools of the acceptors, making HGT detection difficult. This work therefore supports the conclusions of Medrano-Soto et al. (9), and stresses the importance of relying on phylogenetic tree reconstruction data as much as possible when detecting HGT, or when this is impossible, applying multiple HGT detection methods, as previously suggested (46). This is especially true when aiming to infer HGT between closely related organisms (with similar tRNA pool) that are either phylogenetically related or live in the same niche. Additionally, since a higher compatibility in tRNA pools facilitates gene transfer, this contributes to a higher level of gene exchange between related organisms. Thus, the apparent vertical phylogenetic signal one often associates with a tree of life can in fact be maintained by preferential bias of gene transfer among related taxa, as previously suggested by Gogarten et al. (47) for homologous recombination.
Finally, our results imply that considering that more HGTs are expected among organisms that have similar tRNA pools (and other cellular features), one can improve the current algorithms for phylogenetic network reconstruction (11,15,48,49) and for inferring ecological niches [see, for example, (12)], similarly to the way in which information about co-evolution (25,50) has been recently shown to improve ancestral gene reconstruction (51).
Supplementary Data are available at NAR Online.
T.T. is a Koshland Scholar at Weizmann Institute of Science and is supported by a travel fellowship from EU grant PIRG04-GA-2008-239317. M.K. was supported by grants from the Israel Science Foundation and the German-Israel Binational Fund (GIF). U.G. is supported by a grant from the Israeli Ministry of Health. E.R. is supported by grants from the Israel Science Foundation. M.K., U.G. and E.R. are supported by a grant from the McDonnell Foundation.
Conflict of interest statement. None declared.
We would like to thank Tal Dagan and Yoji Nakamura for providing us with datasets of horizontal gene transfer. We also would like to thank Hila Gingold, Johann Peter Gogarten, Amos Tanay, Elchanan Mossel, and Yitzhak Pilpel for very helpful and useful discussions. We wish to assert here that UG and ER have equally contributed to the paper.