-
PDF
- Split View
-
Views
-
Cite
Cite
Ofir Cohen, Uri Gophna, Tal Pupko, The Complexity Hypothesis Revisited: Connectivity Rather Than Function Constitutes a Barrier to Horizontal Gene Transfer, Molecular Biology and Evolution, Volume 28, Issue 4, April 2011, Pages 1481–1489, https://doi.org/10.1093/molbev/msq333
Close - Share Icon Share
Abstract
Horizontal gene transfer (HGT) is a prevalent and a highly important phenomenon in microbial species evolution. One of the important challenges in HGT research is to better understand the factors that determine the tendency of genes to be successfully transferred and retained in evolution (i.e., transferability). It was previously observed that transferability of genes depends on the cellular process in which they are involved where genes involved in transcription or translation are less likely to be transferred than metabolic genes. It was further shown that gene connectivity in the protein–protein interaction network affects HGT. These two factors were shown to be correlated, and their influence on HGT is collectively termed the “Complexity Hypothesis”. In this study, we used a stochastic mapping method utilizing advanced likelihood-based evolutionary models to quantify gene family acquisition events by HGT. We applied our methodology to an extensive across-species genome-wide dataset that enabled us to estimate the overall extent of transfer events in evolution and to study the trends and barriers to gene transferability. Focusing on the biological function and the connectivity of genes, we obtained novel insights regarding the “complexity hypothesis.” Specifically, we aimed to disentangle the relationships between protein connectivity, cellular function, and transferability and to quantify the relative contribution of each of these factors in determining transferability. We show that the biological function of a gene family is an insignificant factor in the determination of transferability when proteins with similar levels of connectivity are compared. In contrast, we found that connectivity is an important and a statistically significant factor in determining transferability when proteins with a similar function are compared.
Introduction
Comparative genomics have revealed vast and surprising variability in gene content even among closely related species (Berg and Kurland 2002; Mira et al. 2002; Konstantinidis and Tiedje 2004; Koonin and Wolf 2008). The dynamics of genomes remodeling include drastic genome erosions by gene losses (Moran et al. 2009) and acquisition of novel genetic material by gene gains through horizontal gene transfer (HGT) (Syvanen 1994; Hacker and Carniel 2001). A pivotal role for HGT was demonstrated in the adaptation of organisms to new ecological niches (Gogarten and Townsend 2005), acquisition of novel functions (Pennisi 2004; Gogarten and Townsend 2005), metabolic networks expansions (Pal et al. 2005), and speciation (Lawrence 1999). The transfer of genes among bacteria also bears significant medical implications as the emergence of new virulent strains as well as their resistance to antibiotics is mainly attributed to HGT (Holden et al. 2004; Gal-Mor and Finlay 2006). Thus, studying HGT dynamics and the factors that determine gene transferability is important for evolutionary, ecological, and molecular biology studies.
Although most genes are susceptible to HGT (Sorek et al. 2007), it is well established that the tendency to undergo HGT is highly variable among genes (Nakamura et al. 2004; Cohen et al. 2008; Hao and Golding 2008). Over a decade ago, it was suggested that the biological process in which a gene is involved strongly affects its transferability. It was shown that informational genes are less transferable than operational genes. Later, it was additionally shown that the number of protein–protein interactions (PPIs) is an important factor in determining transferability. The dependency of transferability on these two factors, the biological process and the network connectivity, is now collectively referred to as the “complexity hypothesis” (Rivera et al. 1998; Doolittle 1999; Jain et al. 1999, 2002; Sicheritz-Ponten and Andersson 2001; Gogarten et al. 2002; Brown 2003; Wellner et al. 2007; Lercher and Pal 2008).
Since it was suggested, the complexity hypothesis was in the center of the discussion regarding gene transferability: The hypothesis was extended (Aris-Brosou 2005) and received support from both bioinformatic analyses (Lercher and Pal 2008) and experimental studies (Wellner and Gophna 2008). Nevertheless, it was also debated and criticized (Brochier et al. 2000; Nesbo et al. 2001).
Testing the validity of the complexity hypothesis requires accurate inference of HGT events. There are three widely used computational approaches to detect HGT, each tailored toward the detection of only a subset of all transfer events. The first detects genes with phylogenetic incongruence as compared with the inferred ribosomal trees or trees that are supposed to represent the organismal evolutionary history. This approach is only suitable for relatively widespread genes with “not too much or too little” sequence divergence (e.g., Graybeal 1994). The second detects genes that are significantly different from the rest of the genome in some compositional attributes such as G+C content or codon usage. This approach can only detect recent transfer events due to sequence amelioration (Koski et al. 2001; Wang 2001). The third uses a presence-absence matrix of gene families across multiple genomes (phyletic pattern) to detect acquisition events of gene families along the assumed phylogeny. Although this approach is suitable for the detection of both recent and ancient events of all gene families, it is only capable of detecting transfer events that resulted in the acquisition of the first copy of a particular gene family. For example, this approach ignores xenologous gene replacements or HGT events that result in additional paralogs. This subset of transfer events may be only a fraction of all HGTs, but it is of a particular evolutionary significance as the acquisition of a novel gene family increases the proteomic repertoire of the recipient and holds the greatest potential for functional innovations and adaptations.
HGT inference from phyletic patterns has been classically inferred based on the parsimony criterion (Yang 1996; Mirkin et al. 2003; Cordero et al. 2008; Lercher and Pal 2008). Recently, more statistically robust models for phyletic pattern analysis were developed in which the dynamics of gain and loss of gene families is modeled as a stochastic process (Hao and Golding 2006). Several improvements to such evolutionary models were developed (Cohen et al. 2008; Hao and Golding 2008; Spencer and Sangaralingam 2009; Cohen and Pupko 2010), and we have utilized these maximum-likelihood models to develop a methodology to accurately detect branch-specific HGT events (Cohen and Pupko 2010).
Here, we apply our HGT detection methodology to characterize the factors that determine transferability in genome-wide data. Specifically, we test the complexity hypothesis and disentangle cellular function and the number of protein interactions as factors that determine transferability. Notably, in this manuscript, for brevity, we use the general term HGT in lieu of the more accurate expression, gene family acquisition by HGT. Thus, our conclusions regarding the complexity hypothesis are limited to this type of HGT events.
Methods
Phyletic Pattern and Phylogeny
The presence-absence matrix of gene family was extracted from the Clusters of Orthologous Groups of proteins (COG) database (Tatusov et al. 2003), which contains 4,873 gene families across 66 species (50 bacteria, 13 archaea, and 3 fungi). In this research, we focused on HGT inference from the set of 50 bacterial genomes. Previous research has shown that gain and loss dynamics is different in parasitic bacteria versus free living organisms (Spencer and Sangaralingam 2009). Because the models used here do not allow branch-specific changes in the evolutionary process, we removed the 12 known parasitic bacteria from our analysis (Mycoplasma pulmonis, Mycop. pneumoniae, Mycop. genitalium, Ureaplasma urealyticum, Buchnera sp. APS, Rickettsia prowazekii, R. conorii, Treponema pallidum, Borrelia burgdorferi, Chlamydia trachomatis, Chlamydophila pneumoniae, and Mycobacterium leprae), retaining 38 bacterial genomes. The COG data set definition of gene family requires its presence in at least three genomes. Therefore, the exclusion of species from the original COG database dictates retaining only gene families that are present in at least three genomes within our data set. After this filtering criterion, 3,915 gene families were retained.
The analysis was based on the assumed tree topology of Ciccarelli et al. (2006), which was constructed from a set of “core” genes that are assumed to be resistant to gene transfer. As a control, the analysis was repeated with a topology constructed based on ribosomal RNA (rRNA) sequences (Yarza et al. 2008). In both cases, branch lengths are re-estimated from the phyletic pattern using the evolutionary models.
Inference of HGT Events Using Stochastic Mapping
The gain and loss dynamics were modeled using gain loss mixture model (Cohen and Pupko 2010) in which variability in the gain and loss rates is allowed among gene families. Based on the evolutionary model and the assumed phylogeny, the stochastic mapping approach (Minin and Suchard 2008) allows for the inference of gain and loss events for each gene family along each branch (Cohen and Pupko 2010). This methodology allows for the computation of both the expected number of events and the probability of occurrence of gain and loss events.
The overall tendency of a gene family to undergo acquisition by HGT is measured by the posterior expectation of the number of gain events over all branches. We classify gene families to either transferable or not. Transferable gene families are those for which there is a high probability of HGT events during their evolution. To be conservative, we demand a gain event (HGT) in at least two branches as described in a previous research (Cohen and Pupko 2010). The transferability cutoff value is determined by limiting the number of false-positive predictions of gain events to 5% based on simulations. In this study, the transferability cutoff corresponds to a posterior probability of 0.25 for a gain event. Notably, the cutoff values that result with 5% false-positive predictions may vary with respect to simulation assumptions (Cohen and Pupko 2010). Thus, in this study, a relative strict cutoff value was used, which may result with less than 5% of false positives under realistic assumptions. Moreover, the computations were repeated with both more strict and more permissive cutoff values.
We classify all gain events to either ancient or recent. Recent gains are those that are mapped to external branches, (i.e., branches leading to extant organisms). Gain events mapped to all other branches are considered ancient (i.e., gain events mapped to internal branches).
Network and Protein Interactions
The PPI network and the number of interactions for each gene family were extracted from the STRING database version 8.3 (Jensen et al. 2009). This comprehensive PPI network is based on known interactions from several databases covering several model organisms (Salwinski et al. 2004; Alfarano et al. 2005; Joshi-Tope et al. 2005; Chatr-aryamontri et al. 2007; Kerrien et al. 2007; Vastrik et al. 2007; Breitkreutz et al. 2008; Kanehisa et al. 2008) and is augmented by methods that accurately predict interactions (Harrington et al. 2008; Skrabanek et al. 2008). As a control, the PPI network of the Database of Interacting Proteins (DIP) (Xenarios et al. 2000; Salwinski et al. 2004) version June 2010 was used. Unlike STRING, this network only considers experimentally validated interactions (i.e., it does not include predicted interactions).
Interactions in the STRING data set are given confidence score based on benchmarking with manually curated interaction maps (Kanehisa et al. 2008) in which for each pair of gene families the interaction confidence is denoted by a value in the range 0–1,000. Protein families in which all interactions have a zero confidence score may reflect lack of data rather than genuine non-interacting protein families. These were excluded from the analysis resulting in 2,442 gene families. Notably, in our analyses, we only consider interactions with a confidence score above a certain threshold. For example, a protein family having one reported interaction with a confidence score of 400 will be analyzed as having a single interaction when the threshold is 150 and zero interactions when the threshold is 700.
Functional Categories
The biological process in which each gene family is involved (functional category) is extracted from the COG database in which there are 25 specific categories grouped into four meta-categories. We limited our analysis to functional categories with at least three gene families. Thus, we retain the four meta-categories (Information storage and processing, Cellular processes and signaling, Metabolism, and Poorly characterized) and 20 specific categories (Translation, ribosomal structure, and biogenesis; Transcription; Replication, recombination, and repair; Cell cycle control, cell division, and chromosome partitioning; Defense mechanisms; Signal transduction mechanisms; Cell wall/membrane/envelope biogenesis; Cell motility; Intracellular trafficking, Secretion, and vesicular transport; Posttranslational modification, protein turnover, and chaperones; Energy production and conversion; Carbohydrate transport and metabolism; Amino acid transport and metabolism; Nucleotide transport and metabolism; Coenzyme transport and metabolism; Lipid transport and metabolism; Inorganic ion transport and metabolism; Secondary metabolites biosynthesis, transport, and catabolism; General function prediction only; and Function unknown). In the analysis of functional categories, 158 gene families have more than one functional category label. These gene families were included independently in each functional category analysis.
Statistical Analysis of Function and Transferability Association
To test for association between a functional category and transferability, we computed the ratio between the fraction of transferable genes in this category to the fraction of transferable genes not in this category. We term this ratio “relative transferability,” which is equivalent of the often used term “relative risk.” A relative transferability significantly higher than one suggests a higher propensity for a gene family to be transferred when included in this functional category compared with all other functional categories. Statistical significance is determined using Fisher's exact test. The classification of a gene family as transferable is based on stochastic mapping (see above).
We additionally compute the relative transferability while accounting for variable levels of connectivity. Specifically, we treated the data as stratified by the number of PPIs. We thus computed the relative transferability in each functional category accounting for this stratification using the Mantel–Haenszel test. Gene families were stratified into 45 levels of connectivity, in which each stratum has at least ten gene families. Notably, similar results were obtained when that data were stratified to 94 levels of connectivity (at least three gene families in each stratum) or to seven levels of connectivity (at least 100 gene families in each stratum). This indicates that the results are highly robust to the choice of stratification resolution (data not shown).
The stratification of the gene family according to their connectivity was done as follows. All gene families are sorted according to their connectivity (number of PPIs). The first stratum comprises the group of gene families with the lowest number of interactions. We incrementally add gene families with the next lower levels of connectivity to this stratum until at least ten gene families are included. All gene families with the exactly the same level of connectivity are added to a stratum, even if the size increases above 10. Once a stratum is defined, we build the next stratum.
Results and Discussion
High Number of Protein Interactions Acts As Barrier to HGT
Gain and loss dynamics of gene families were studied using the gain loss mixture model (Cohen and Pupko 2010). The ML estimate for each of the model parameters is given in supplementary table S1, Supplementary Material online. This model was used to infer gain (HGT) events for each gene family and for each branch using stochastic mapping. It was previously shown that the number of protein interactions (connectivity) is associated with gene family transferability by comparing transferability of genes with low versus high connectivity levels (Wellner et al. 2007; Davids and Zhang 2008; Lercher and Pal 2008). To gain further insight regarding this “connectivity barrier” (i.e., protein interactions that hinder or reduce transferability), instead of using such a binning approach, we directly computed the correlation between connectivity and transferability (fig. 1). Our analysis indicates that more HGT events are expected for protein families with low connectivity levels (Spearman coefficient computed for all gene families R = −0.422, P value <8.18 × 10−105, table 1).
HGT as a function of PPIs. Each dot represents one gene family. The X axis is the overall number of PPIs of the gene family with other gene families. The Y axis corresponds to the posterior number of HGT events.
Each interaction is given a confidence value (see Methods). We performed the same computation with different cutoff levels for the inclusion of interactions. In all cases, the negative correlation was highly significant (supplementary table S2A, Supplementary Material online). Notably, increasing the threshold reduces the number of included interactions (overall number of network interactions 62,642 and 5,861 for the lowest and highest confidence threshold, respectively) and the corresponding R coefficients (R = −0.422 and −0.298 for the lowest and the highest confidence threshold, respectively). Although the increase in the confidence cutoff may have reduced the number of false interactions, the decrease in the correlation strength suggests that using a stringent cutoff value significantly reduced the number of true interactions as well. Notably, we found very similar correlation levels between HGT and connectivity with exclusively low and mid confidence levels. These results (supplementary table S2B, Supplementary Material online) suggest that with the STRING confidence score method, even interactions with low confidence contribute to the connectivity barrier.
It may be claimed that the connectivity barrier is mainly the result of the most extreme cases in the connectivity spectrum, that is, that the majority of the signal arises from isolated gene families and hub gene families. We repeated the correlation analysis, removing from the analysis gene families with less than one and higher than 50 interactions, respectively. This analysis shows that the connectivity is also informative with respect to HGT for intermediate level of interactions (R = −0.362, P value <9.2 × 10−66).
The Biological Functional As a Factor Determining HGT Extent
Others and we have previously shown that the biological function of a gene family is important in determining its propensity to undergo HGT (Rivera et al. 1998; Nakamura et al. 2004; Merkl 2006; Choi and Kim 2007; Hao and Golding 2008; Kanhere and Vingron 2009; Cohen and Pupko 2010). Here, we show that the mean number of HGT events dramatically changes among various functional categories (table 2, HGT columns). In agreement with previous studies, the lowest HGT levels are observed for the informational genes (involved in transcription and translation), where the most pronounced trend was found in genes associated with the ribosome and related with translation (COG functional category: “translation, ribosomal structure, and biogenesis”). The average expected number of HGT events per gene family in this category was below 0.47, substantially lower than the 1.73 events per gene family, which is the average over all gene families. Statistical significant differences were found among the 20 specific functional categories and also among the four “meta-categories” (P values <1.7 × 10−40 and 4.52 × 10−34, Kruskal–Wallis test, respectively).
We further studied the association between functional category and the propensity for HGT by computing the relative transferability factor of each functional category (see Methods for more details). In table 3, we summarize the relative transferability of all functional categories and find that several functional categories have relative transferability that is significantly different than one. In agreement with the lower computed average HGT, the relative transferability value of the function “translation, ribosomal structure, and biogenesis” is 0.276, which is highly significant even after correction for multiple testing (P value <3.55 × 10−6, Fisher's exact test).
The classification of a gene family as transferable is dependent on an estimated “transferability cutoff” (see Methods). To verify that the obtained results are robust in this respect, we perform additional computations with both more strict and more permissive cutoffs. Changing the cutoff substantially affects the estimation of the overall percentage of transferable genes from 32.31% to 23.91% and 42.51% for the more strict and more permissive cutoffs, respectively. However, the relative transferability factors of the various functional categories were very similar (supplementary tables S3A and Supplementary Data, Supplementary Material online).
Disentangling Biological Functional and Connectivity in Determining HGT Frequency
The complexity hypothesis relates two biological factors to the tendency of gene families to undergo HGT—the connectivity and the function (or biological process). Our results above show that HGT is influenced by each of these factors when analyzed separately. However, the various functional categories vary in terms of their connectivity. In table 2, we show the average connectivity of each functional category. Substantial differences are observed among the functional categories, with averages ranging from as high as 84.51 for the ”translation, ribosomal structure, and biogenesis” category to only 7.11 for the “function unknown” category. This observed difference among categories is statistically significant in the comparison of both the specific categories and the meta-categories (P values <7.77 × 10−60 and 5.48 × 10−60, Kruskal–Wallis test, respectively). Given this strong association between functional category and connectivity, HGT dependence on the function may be a side effect of this variance in connectivity. Alternatively, it is possible that the observed effect of connectivity over HGT propensity is a by-product of the differences in functionality. Here, we tried to test for the effect of each of these factors, controlling for the effect of the other.
We computed the correlation between connectivity and transferability for each functional group separately. Our results show that the connectivity barrier holds even when the functional category factor is accounted for. Specifically, for the vast majority of functional categories, a significant negative correlation was observed between connectivity and transferability (table 1). However, the impact of the connectivity barrier was different across functional groups. We found that connectivity was the most influential in informational genes with Spearman's coefficient of −0.518, while in both metabolic and cellular groups, the coefficients were lower: −0.39 and −0.353, respectively. The lowest correlation between connectivity and transferability was found for the “poorly characterized” meta-category and for the “function unknown” category with Spearman's coefficients of −0.244 (P value 2.35 × 10−09) and −0.152 (P value 0.0096), respectively. The only two functional categories in which the correlation was not significant after correction for multiple testing are “cell motility” and “lipid transport and metabolism”.
The Spearman's Correlation (R) between Connectivity and Transferability Computed Separately for Various Functional Categories.
| Functional Category | R | P | Number of Gene Families |
| All | −0.422 | 8.18 × 10−105 | 2,442 |
| Information storage and processing | −0.518 | 1.08 × 10−26 | 382 |
| Cellular processes and signaling | −0.353 | 5.95 × 10−15 | 487 |
| Metabolism | −0.39 | 2.94 × 10−37 | 1,017 |
| Poorly characterized | −0.244 | 2.35 × 10−09 | 626 |
| Translation, ribosomal structure, and biogenesis | −0.213 | 0.0112 | 150 |
| Transcription | −0.349 | 4.78 × 10−04 | 105 |
| Replication, recombination, and repair | −0.444 | 1.98 × 10−07 | 133 |
| Cell cycle control, cell division, and chromosome partitioning | −0.464 | 0.00727 | 35 |
| Defense mechanisms | −0.474 | 0.00723 | 34 |
| Signal transduction mechanisms | −0.235 | 0.0265 | 93 |
| Cell wall/membrane/envelope biogenesis | −0.289 | 9.72 × 10−04 | 138 |
| Cell motility | −0.255 | 0.0577 | 57 |
| Intracellular trafficking, secretion, and vesicular transport | −0.331 | 0.0105 | 63 |
| Posttranslational modification, protein turnover, and chaperones | −0.411 | 1.03 × 10−05 | 115 |
| Energy production and conversion | −0.407 | 2.50 × 10−08 | 185 |
| Carbohydrate transport and metabolism | −0.461 | 2.35 × 10−09 | 162 |
| Amino acid transport and metabolism | −0.387 | 6.65 × 10−09 | 223 |
| Nucleotide transport and metabolism | −0.471 | 1.79 × 10−05 | 81 |
| Coenzyme transport and metabolism | −0.513 | 5.47 × 10−10 | 139 |
| Lipid transport and metabolism | −0.207 | 0.0831 | 71 |
| Inorganic ion transport and metabolism | −0.174 | 0.0359 | 151 |
| Secondary metabolites biosynthesis, transport, and catabolism | −0.329 | 0.0203 | 52 |
| General function prediction only | −0.299 | 1.57 × 10−07 | 314 |
| Function unknown | −0.152 | 0.00968 | 312 |
| Functional Category | R | P | Number of Gene Families |
| All | −0.422 | 8.18 × 10−105 | 2,442 |
| Information storage and processing | −0.518 | 1.08 × 10−26 | 382 |
| Cellular processes and signaling | −0.353 | 5.95 × 10−15 | 487 |
| Metabolism | −0.39 | 2.94 × 10−37 | 1,017 |
| Poorly characterized | −0.244 | 2.35 × 10−09 | 626 |
| Translation, ribosomal structure, and biogenesis | −0.213 | 0.0112 | 150 |
| Transcription | −0.349 | 4.78 × 10−04 | 105 |
| Replication, recombination, and repair | −0.444 | 1.98 × 10−07 | 133 |
| Cell cycle control, cell division, and chromosome partitioning | −0.464 | 0.00727 | 35 |
| Defense mechanisms | −0.474 | 0.00723 | 34 |
| Signal transduction mechanisms | −0.235 | 0.0265 | 93 |
| Cell wall/membrane/envelope biogenesis | −0.289 | 9.72 × 10−04 | 138 |
| Cell motility | −0.255 | 0.0577 | 57 |
| Intracellular trafficking, secretion, and vesicular transport | −0.331 | 0.0105 | 63 |
| Posttranslational modification, protein turnover, and chaperones | −0.411 | 1.03 × 10−05 | 115 |
| Energy production and conversion | −0.407 | 2.50 × 10−08 | 185 |
| Carbohydrate transport and metabolism | −0.461 | 2.35 × 10−09 | 162 |
| Amino acid transport and metabolism | −0.387 | 6.65 × 10−09 | 223 |
| Nucleotide transport and metabolism | −0.471 | 1.79 × 10−05 | 81 |
| Coenzyme transport and metabolism | −0.513 | 5.47 × 10−10 | 139 |
| Lipid transport and metabolism | −0.207 | 0.0831 | 71 |
| Inorganic ion transport and metabolism | −0.174 | 0.0359 | 151 |
| Secondary metabolites biosynthesis, transport, and catabolism | −0.329 | 0.0203 | 52 |
| General function prediction only | −0.299 | 1.57 × 10−07 | 314 |
| Function unknown | −0.152 | 0.00968 | 312 |
Note.—The data were partitioned into groups of gene families based on the COG functions. The P values were corrected for multiple testing using false discovery rate method (Benjamini and Hochberg 1995).
The Spearman's Correlation (R) between Connectivity and Transferability Computed Separately for Various Functional Categories.
| Functional Category | R | P | Number of Gene Families |
| All | −0.422 | 8.18 × 10−105 | 2,442 |
| Information storage and processing | −0.518 | 1.08 × 10−26 | 382 |
| Cellular processes and signaling | −0.353 | 5.95 × 10−15 | 487 |
| Metabolism | −0.39 | 2.94 × 10−37 | 1,017 |
| Poorly characterized | −0.244 | 2.35 × 10−09 | 626 |
| Translation, ribosomal structure, and biogenesis | −0.213 | 0.0112 | 150 |
| Transcription | −0.349 | 4.78 × 10−04 | 105 |
| Replication, recombination, and repair | −0.444 | 1.98 × 10−07 | 133 |
| Cell cycle control, cell division, and chromosome partitioning | −0.464 | 0.00727 | 35 |
| Defense mechanisms | −0.474 | 0.00723 | 34 |
| Signal transduction mechanisms | −0.235 | 0.0265 | 93 |
| Cell wall/membrane/envelope biogenesis | −0.289 | 9.72 × 10−04 | 138 |
| Cell motility | −0.255 | 0.0577 | 57 |
| Intracellular trafficking, secretion, and vesicular transport | −0.331 | 0.0105 | 63 |
| Posttranslational modification, protein turnover, and chaperones | −0.411 | 1.03 × 10−05 | 115 |
| Energy production and conversion | −0.407 | 2.50 × 10−08 | 185 |
| Carbohydrate transport and metabolism | −0.461 | 2.35 × 10−09 | 162 |
| Amino acid transport and metabolism | −0.387 | 6.65 × 10−09 | 223 |
| Nucleotide transport and metabolism | −0.471 | 1.79 × 10−05 | 81 |
| Coenzyme transport and metabolism | −0.513 | 5.47 × 10−10 | 139 |
| Lipid transport and metabolism | −0.207 | 0.0831 | 71 |
| Inorganic ion transport and metabolism | −0.174 | 0.0359 | 151 |
| Secondary metabolites biosynthesis, transport, and catabolism | −0.329 | 0.0203 | 52 |
| General function prediction only | −0.299 | 1.57 × 10−07 | 314 |
| Function unknown | −0.152 | 0.00968 | 312 |
| Functional Category | R | P | Number of Gene Families |
| All | −0.422 | 8.18 × 10−105 | 2,442 |
| Information storage and processing | −0.518 | 1.08 × 10−26 | 382 |
| Cellular processes and signaling | −0.353 | 5.95 × 10−15 | 487 |
| Metabolism | −0.39 | 2.94 × 10−37 | 1,017 |
| Poorly characterized | −0.244 | 2.35 × 10−09 | 626 |
| Translation, ribosomal structure, and biogenesis | −0.213 | 0.0112 | 150 |
| Transcription | −0.349 | 4.78 × 10−04 | 105 |
| Replication, recombination, and repair | −0.444 | 1.98 × 10−07 | 133 |
| Cell cycle control, cell division, and chromosome partitioning | −0.464 | 0.00727 | 35 |
| Defense mechanisms | −0.474 | 0.00723 | 34 |
| Signal transduction mechanisms | −0.235 | 0.0265 | 93 |
| Cell wall/membrane/envelope biogenesis | −0.289 | 9.72 × 10−04 | 138 |
| Cell motility | −0.255 | 0.0577 | 57 |
| Intracellular trafficking, secretion, and vesicular transport | −0.331 | 0.0105 | 63 |
| Posttranslational modification, protein turnover, and chaperones | −0.411 | 1.03 × 10−05 | 115 |
| Energy production and conversion | −0.407 | 2.50 × 10−08 | 185 |
| Carbohydrate transport and metabolism | −0.461 | 2.35 × 10−09 | 162 |
| Amino acid transport and metabolism | −0.387 | 6.65 × 10−09 | 223 |
| Nucleotide transport and metabolism | −0.471 | 1.79 × 10−05 | 81 |
| Coenzyme transport and metabolism | −0.513 | 5.47 × 10−10 | 139 |
| Lipid transport and metabolism | −0.207 | 0.0831 | 71 |
| Inorganic ion transport and metabolism | −0.174 | 0.0359 | 151 |
| Secondary metabolites biosynthesis, transport, and catabolism | −0.329 | 0.0203 | 52 |
| General function prediction only | −0.299 | 1.57 × 10−07 | 314 |
| Function unknown | −0.152 | 0.00968 | 312 |
Note.—The data were partitioned into groups of gene families based on the COG functions. The P values were corrected for multiple testing using false discovery rate method (Benjamini and Hochberg 1995).
Connectivity and HGT Propensity of All Gene Families and Specific Functional Categories.
| Functional Category | Mean HGT | SE HGT | Mean PPI | SE PPI |
| All | 1.731 | 0.04 | 25.65 | 1.02 |
| Information storage and processing | 1.188 | 0.1 | 56.5 | 3.89 |
| Cellular processes and signaling | 1.54 | 0.08 | 26.45 | 2.94 |
| Metabolism | 1.781 | 0.06 | 21.16 | 0.97 |
| Poorly characterized | 2.107 | 0.08 | 16.75 | 2.2 |
| Translation, ribosomal structure, and biogenesis | 0.469 | 0.08 | 84.51 | 6.22 |
| Transcription | 1.374 | 0.17 | 43.35 | 9.45 |
| Replication, recombination, and repair | 1.856 | 0.21 | 47.03 | 7.56 |
| Cell cycle control, cell division, and chromosome partitioning | 1.704 | 0.35 | 16.8 | 3.62 |
| Defense mechanisms | 2.49 | 0.47 | 15.94 | 5.35 |
| Signal transduction mechanisms | 1.559 | 0.16 | 33.35 | 9.14 |
| Cell wall/membrane/envelope biogenesis | 1.671 | 0.16 | 16.97 | 2.49 |
| Cell motility | 1.008 | 0.16 | 15.12 | 2.62 |
| Intracellular trafficking, secretion, and vesicular transport | 0.961 | 0.14 | 14.19 | 2.54 |
| Posttranslational modification, protein turnover, and chaperones | 1.394 | 0.16 | 45.41 | 8.98 |
| Energy production and conversion | 1.952 | 0.14 | 24.41 | 2.42 |
| Carbohydrate transport and metabolism | 2.309 | 0.17 | 19.87 | 2.21 |
| Amino acid transport and metabolism | 1.625 | 0.12 | 23.91 | 2.34 |
| Nucleotide transport and metabolism | 1.336 | 0.2 | 33.69 | 5.66 |
| Coenzyme transport and metabolism | 1.337 | 0.15 | 17.39 | 1.79 |
| Lipid transport and metabolism | 1.287 | 0.18 | 31.99 | 4.04 |
| Inorganic ion transport and metabolism | 1.889 | 0.16 | 15.61 | 2.37 |
| Secondary metabolites biosynthesis, transport, and catabolism | 2.137 | 0.29 | 18.08 | 4.67 |
| General function prediction only | 1.998 | 0.11 | 26.33 | 4.28 |
| Function unknown | 2.217 | 0.11 | 7.112 | 0.67 |
| Functional Category | Mean HGT | SE HGT | Mean PPI | SE PPI |
| All | 1.731 | 0.04 | 25.65 | 1.02 |
| Information storage and processing | 1.188 | 0.1 | 56.5 | 3.89 |
| Cellular processes and signaling | 1.54 | 0.08 | 26.45 | 2.94 |
| Metabolism | 1.781 | 0.06 | 21.16 | 0.97 |
| Poorly characterized | 2.107 | 0.08 | 16.75 | 2.2 |
| Translation, ribosomal structure, and biogenesis | 0.469 | 0.08 | 84.51 | 6.22 |
| Transcription | 1.374 | 0.17 | 43.35 | 9.45 |
| Replication, recombination, and repair | 1.856 | 0.21 | 47.03 | 7.56 |
| Cell cycle control, cell division, and chromosome partitioning | 1.704 | 0.35 | 16.8 | 3.62 |
| Defense mechanisms | 2.49 | 0.47 | 15.94 | 5.35 |
| Signal transduction mechanisms | 1.559 | 0.16 | 33.35 | 9.14 |
| Cell wall/membrane/envelope biogenesis | 1.671 | 0.16 | 16.97 | 2.49 |
| Cell motility | 1.008 | 0.16 | 15.12 | 2.62 |
| Intracellular trafficking, secretion, and vesicular transport | 0.961 | 0.14 | 14.19 | 2.54 |
| Posttranslational modification, protein turnover, and chaperones | 1.394 | 0.16 | 45.41 | 8.98 |
| Energy production and conversion | 1.952 | 0.14 | 24.41 | 2.42 |
| Carbohydrate transport and metabolism | 2.309 | 0.17 | 19.87 | 2.21 |
| Amino acid transport and metabolism | 1.625 | 0.12 | 23.91 | 2.34 |
| Nucleotide transport and metabolism | 1.336 | 0.2 | 33.69 | 5.66 |
| Coenzyme transport and metabolism | 1.337 | 0.15 | 17.39 | 1.79 |
| Lipid transport and metabolism | 1.287 | 0.18 | 31.99 | 4.04 |
| Inorganic ion transport and metabolism | 1.889 | 0.16 | 15.61 | 2.37 |
| Secondary metabolites biosynthesis, transport, and catabolism | 2.137 | 0.29 | 18.08 | 4.67 |
| General function prediction only | 1.998 | 0.11 | 26.33 | 4.28 |
| Function unknown | 2.217 | 0.11 | 7.112 | 0.67 |
Note.—The PPI (connectivity) and HGT values are computed for each functional category and for all gene families as reference. SE, standard error.
Connectivity and HGT Propensity of All Gene Families and Specific Functional Categories.
| Functional Category | Mean HGT | SE HGT | Mean PPI | SE PPI |
| All | 1.731 | 0.04 | 25.65 | 1.02 |
| Information storage and processing | 1.188 | 0.1 | 56.5 | 3.89 |
| Cellular processes and signaling | 1.54 | 0.08 | 26.45 | 2.94 |
| Metabolism | 1.781 | 0.06 | 21.16 | 0.97 |
| Poorly characterized | 2.107 | 0.08 | 16.75 | 2.2 |
| Translation, ribosomal structure, and biogenesis | 0.469 | 0.08 | 84.51 | 6.22 |
| Transcription | 1.374 | 0.17 | 43.35 | 9.45 |
| Replication, recombination, and repair | 1.856 | 0.21 | 47.03 | 7.56 |
| Cell cycle control, cell division, and chromosome partitioning | 1.704 | 0.35 | 16.8 | 3.62 |
| Defense mechanisms | 2.49 | 0.47 | 15.94 | 5.35 |
| Signal transduction mechanisms | 1.559 | 0.16 | 33.35 | 9.14 |
| Cell wall/membrane/envelope biogenesis | 1.671 | 0.16 | 16.97 | 2.49 |
| Cell motility | 1.008 | 0.16 | 15.12 | 2.62 |
| Intracellular trafficking, secretion, and vesicular transport | 0.961 | 0.14 | 14.19 | 2.54 |
| Posttranslational modification, protein turnover, and chaperones | 1.394 | 0.16 | 45.41 | 8.98 |
| Energy production and conversion | 1.952 | 0.14 | 24.41 | 2.42 |
| Carbohydrate transport and metabolism | 2.309 | 0.17 | 19.87 | 2.21 |
| Amino acid transport and metabolism | 1.625 | 0.12 | 23.91 | 2.34 |
| Nucleotide transport and metabolism | 1.336 | 0.2 | 33.69 | 5.66 |
| Coenzyme transport and metabolism | 1.337 | 0.15 | 17.39 | 1.79 |
| Lipid transport and metabolism | 1.287 | 0.18 | 31.99 | 4.04 |
| Inorganic ion transport and metabolism | 1.889 | 0.16 | 15.61 | 2.37 |
| Secondary metabolites biosynthesis, transport, and catabolism | 2.137 | 0.29 | 18.08 | 4.67 |
| General function prediction only | 1.998 | 0.11 | 26.33 | 4.28 |
| Function unknown | 2.217 | 0.11 | 7.112 | 0.67 |
| Functional Category | Mean HGT | SE HGT | Mean PPI | SE PPI |
| All | 1.731 | 0.04 | 25.65 | 1.02 |
| Information storage and processing | 1.188 | 0.1 | 56.5 | 3.89 |
| Cellular processes and signaling | 1.54 | 0.08 | 26.45 | 2.94 |
| Metabolism | 1.781 | 0.06 | 21.16 | 0.97 |
| Poorly characterized | 2.107 | 0.08 | 16.75 | 2.2 |
| Translation, ribosomal structure, and biogenesis | 0.469 | 0.08 | 84.51 | 6.22 |
| Transcription | 1.374 | 0.17 | 43.35 | 9.45 |
| Replication, recombination, and repair | 1.856 | 0.21 | 47.03 | 7.56 |
| Cell cycle control, cell division, and chromosome partitioning | 1.704 | 0.35 | 16.8 | 3.62 |
| Defense mechanisms | 2.49 | 0.47 | 15.94 | 5.35 |
| Signal transduction mechanisms | 1.559 | 0.16 | 33.35 | 9.14 |
| Cell wall/membrane/envelope biogenesis | 1.671 | 0.16 | 16.97 | 2.49 |
| Cell motility | 1.008 | 0.16 | 15.12 | 2.62 |
| Intracellular trafficking, secretion, and vesicular transport | 0.961 | 0.14 | 14.19 | 2.54 |
| Posttranslational modification, protein turnover, and chaperones | 1.394 | 0.16 | 45.41 | 8.98 |
| Energy production and conversion | 1.952 | 0.14 | 24.41 | 2.42 |
| Carbohydrate transport and metabolism | 2.309 | 0.17 | 19.87 | 2.21 |
| Amino acid transport and metabolism | 1.625 | 0.12 | 23.91 | 2.34 |
| Nucleotide transport and metabolism | 1.336 | 0.2 | 33.69 | 5.66 |
| Coenzyme transport and metabolism | 1.337 | 0.15 | 17.39 | 1.79 |
| Lipid transport and metabolism | 1.287 | 0.18 | 31.99 | 4.04 |
| Inorganic ion transport and metabolism | 1.889 | 0.16 | 15.61 | 2.37 |
| Secondary metabolites biosynthesis, transport, and catabolism | 2.137 | 0.29 | 18.08 | 4.67 |
| General function prediction only | 1.998 | 0.11 | 26.33 | 4.28 |
| Function unknown | 2.217 | 0.11 | 7.112 | 0.67 |
Note.—The PPI (connectivity) and HGT values are computed for each functional category and for all gene families as reference. SE, standard error.
The Relative Transferability of Gene Families in Each Functional Category.
| Functional Category | Relative Transferability | P | Relative Transferability (MH) | P value (MH) |
| Information storage and processing | 0.608 | 0.00104 | 0.774 | 0.344 |
| Cellular processes and signaling | 0.866 | 0.408 | 0.874 | 0.617 |
| Metabolism | 1.078 | 0.614 | 1.138 | 0.498 |
| Poorly characterized | 1.314 | 0.0216 | 1.072 | 0.971 |
| Translation, ribosomal structure, and biogenesis | 0.276 | 2.84 × 10−06 | 0.418 | 0.0864 |
| Transcription | 0.668 | 0.318 | 0.73 | 0.617 |
| Replication, recombination, and repair | 1.05 | 0.9 | 1.241 | 0.617 |
| Cell cycle control, cell division, and chromosome partitioning | 0.883 | 0.903 | 0.899 | 0.977 |
| Defense mechanisms | 1.094 | 0.88 | 0.955 | 0.977 |
| Signal transduction mechanisms | 0.861 | 0.782 | 0.905 | 0.977 |
| Cell wall/membrane/envelope biogenesis | 1.13 | 0.726 | 1.116 | 0.977 |
| Cell motility | 0.321 | 0.0263 | 0.314 | 0.0909 |
| Intracellular trafficking, secretion, and vesicular transport | 0.633 | 0.408 | 0.564 | 0.349 |
| Posttranslational modification, protein turnover, and chaperones | 0.8 | 0.596 | 0.899 | 0.977 |
| Energy production and conversion | 1.206 | 0.408 | 1.33 | 0.344 |
| Carbohydrate transport and metabolism | 1.37 | 0.156 | 1.364 | 0.344 |
| Amino acid transport and metabolism | 0.984 | 0.943 | 1.067 | 0.977 |
| Nucleotide transport and metabolism | 0.836 | 0.782 | 0.962 | 0.977 |
| Coenzyme transport and metabolism | 0.769 | 0.408 | 0.79 | 0.617 |
| Lipid transport and metabolism | 0.78 | 0.614 | 0.928 | 0.977 |
| Inorganic ion transport and metabolism | 1.027 | 0.903 | 0.98 | 0.977 |
| Secondary metabolites biosynthesis, transport, and catabolism | 1.134 | 0.853 | 1.073 | 0.977 |
| General function prediction only | 1.18 | 0.408 | 1.059 | 0.977 |
| Function unknown | 1.334 | 0.0587 | 1.057 | 0.977 |
| Functional Category | Relative Transferability | P | Relative Transferability (MH) | P value (MH) |
| Information storage and processing | 0.608 | 0.00104 | 0.774 | 0.344 |
| Cellular processes and signaling | 0.866 | 0.408 | 0.874 | 0.617 |
| Metabolism | 1.078 | 0.614 | 1.138 | 0.498 |
| Poorly characterized | 1.314 | 0.0216 | 1.072 | 0.971 |
| Translation, ribosomal structure, and biogenesis | 0.276 | 2.84 × 10−06 | 0.418 | 0.0864 |
| Transcription | 0.668 | 0.318 | 0.73 | 0.617 |
| Replication, recombination, and repair | 1.05 | 0.9 | 1.241 | 0.617 |
| Cell cycle control, cell division, and chromosome partitioning | 0.883 | 0.903 | 0.899 | 0.977 |
| Defense mechanisms | 1.094 | 0.88 | 0.955 | 0.977 |
| Signal transduction mechanisms | 0.861 | 0.782 | 0.905 | 0.977 |
| Cell wall/membrane/envelope biogenesis | 1.13 | 0.726 | 1.116 | 0.977 |
| Cell motility | 0.321 | 0.0263 | 0.314 | 0.0909 |
| Intracellular trafficking, secretion, and vesicular transport | 0.633 | 0.408 | 0.564 | 0.349 |
| Posttranslational modification, protein turnover, and chaperones | 0.8 | 0.596 | 0.899 | 0.977 |
| Energy production and conversion | 1.206 | 0.408 | 1.33 | 0.344 |
| Carbohydrate transport and metabolism | 1.37 | 0.156 | 1.364 | 0.344 |
| Amino acid transport and metabolism | 0.984 | 0.943 | 1.067 | 0.977 |
| Nucleotide transport and metabolism | 0.836 | 0.782 | 0.962 | 0.977 |
| Coenzyme transport and metabolism | 0.769 | 0.408 | 0.79 | 0.617 |
| Lipid transport and metabolism | 0.78 | 0.614 | 0.928 | 0.977 |
| Inorganic ion transport and metabolism | 1.027 | 0.903 | 0.98 | 0.977 |
| Secondary metabolites biosynthesis, transport, and catabolism | 1.134 | 0.853 | 1.073 | 0.977 |
| General function prediction only | 1.18 | 0.408 | 1.059 | 0.977 |
| Function unknown | 1.334 | 0.0587 | 1.057 | 0.977 |
NOTE.—Relative transferability refers to the fraction of transferable gene families within each functional category divided by the fraction of transferable gene families among other gene families. This computation is repeated, once when all connectivity levels are aggregated and once when accounting for connectivity stratification using Mantel–Haenszel (MH) test. The P values were corrected for multiple testing using the false discovery rate method (Benjamini and Hochberg 1995).
The Relative Transferability of Gene Families in Each Functional Category.
| Functional Category | Relative Transferability | P | Relative Transferability (MH) | P value (MH) |
| Information storage and processing | 0.608 | 0.00104 | 0.774 | 0.344 |
| Cellular processes and signaling | 0.866 | 0.408 | 0.874 | 0.617 |
| Metabolism | 1.078 | 0.614 | 1.138 | 0.498 |
| Poorly characterized | 1.314 | 0.0216 | 1.072 | 0.971 |
| Translation, ribosomal structure, and biogenesis | 0.276 | 2.84 × 10−06 | 0.418 | 0.0864 |
| Transcription | 0.668 | 0.318 | 0.73 | 0.617 |
| Replication, recombination, and repair | 1.05 | 0.9 | 1.241 | 0.617 |
| Cell cycle control, cell division, and chromosome partitioning | 0.883 | 0.903 | 0.899 | 0.977 |
| Defense mechanisms | 1.094 | 0.88 | 0.955 | 0.977 |
| Signal transduction mechanisms | 0.861 | 0.782 | 0.905 | 0.977 |
| Cell wall/membrane/envelope biogenesis | 1.13 | 0.726 | 1.116 | 0.977 |
| Cell motility | 0.321 | 0.0263 | 0.314 | 0.0909 |
| Intracellular trafficking, secretion, and vesicular transport | 0.633 | 0.408 | 0.564 | 0.349 |
| Posttranslational modification, protein turnover, and chaperones | 0.8 | 0.596 | 0.899 | 0.977 |
| Energy production and conversion | 1.206 | 0.408 | 1.33 | 0.344 |
| Carbohydrate transport and metabolism | 1.37 | 0.156 | 1.364 | 0.344 |
| Amino acid transport and metabolism | 0.984 | 0.943 | 1.067 | 0.977 |
| Nucleotide transport and metabolism | 0.836 | 0.782 | 0.962 | 0.977 |
| Coenzyme transport and metabolism | 0.769 | 0.408 | 0.79 | 0.617 |
| Lipid transport and metabolism | 0.78 | 0.614 | 0.928 | 0.977 |
| Inorganic ion transport and metabolism | 1.027 | 0.903 | 0.98 | 0.977 |
| Secondary metabolites biosynthesis, transport, and catabolism | 1.134 | 0.853 | 1.073 | 0.977 |
| General function prediction only | 1.18 | 0.408 | 1.059 | 0.977 |
| Function unknown | 1.334 | 0.0587 | 1.057 | 0.977 |
| Functional Category | Relative Transferability | P | Relative Transferability (MH) | P value (MH) |
| Information storage and processing | 0.608 | 0.00104 | 0.774 | 0.344 |
| Cellular processes and signaling | 0.866 | 0.408 | 0.874 | 0.617 |
| Metabolism | 1.078 | 0.614 | 1.138 | 0.498 |
| Poorly characterized | 1.314 | 0.0216 | 1.072 | 0.971 |
| Translation, ribosomal structure, and biogenesis | 0.276 | 2.84 × 10−06 | 0.418 | 0.0864 |
| Transcription | 0.668 | 0.318 | 0.73 | 0.617 |
| Replication, recombination, and repair | 1.05 | 0.9 | 1.241 | 0.617 |
| Cell cycle control, cell division, and chromosome partitioning | 0.883 | 0.903 | 0.899 | 0.977 |
| Defense mechanisms | 1.094 | 0.88 | 0.955 | 0.977 |
| Signal transduction mechanisms | 0.861 | 0.782 | 0.905 | 0.977 |
| Cell wall/membrane/envelope biogenesis | 1.13 | 0.726 | 1.116 | 0.977 |
| Cell motility | 0.321 | 0.0263 | 0.314 | 0.0909 |
| Intracellular trafficking, secretion, and vesicular transport | 0.633 | 0.408 | 0.564 | 0.349 |
| Posttranslational modification, protein turnover, and chaperones | 0.8 | 0.596 | 0.899 | 0.977 |
| Energy production and conversion | 1.206 | 0.408 | 1.33 | 0.344 |
| Carbohydrate transport and metabolism | 1.37 | 0.156 | 1.364 | 0.344 |
| Amino acid transport and metabolism | 0.984 | 0.943 | 1.067 | 0.977 |
| Nucleotide transport and metabolism | 0.836 | 0.782 | 0.962 | 0.977 |
| Coenzyme transport and metabolism | 0.769 | 0.408 | 0.79 | 0.617 |
| Lipid transport and metabolism | 0.78 | 0.614 | 0.928 | 0.977 |
| Inorganic ion transport and metabolism | 1.027 | 0.903 | 0.98 | 0.977 |
| Secondary metabolites biosynthesis, transport, and catabolism | 1.134 | 0.853 | 1.073 | 0.977 |
| General function prediction only | 1.18 | 0.408 | 1.059 | 0.977 |
| Function unknown | 1.334 | 0.0587 | 1.057 | 0.977 |
NOTE.—Relative transferability refers to the fraction of transferable gene families within each functional category divided by the fraction of transferable gene families among other gene families. This computation is repeated, once when all connectivity levels are aggregated and once when accounting for connectivity stratification using Mantel–Haenszel (MH) test. The P values were corrected for multiple testing using the false discovery rate method (Benjamini and Hochberg 1995).
We next tested if the biological function is a determining factor for transferability when controlling for the connectivity level. We thus computed the relative transferability in each functional category accounting for different levels of connectivity using Mantel–Haenszel test (see Methods). Our results show that when controlling for connectivity, the impact of functional category on transferability drastically diminishes and becomes not significant for all the functional categories (table 3). For example, when accounting for connectivity levels, the relative transferability of informational genes is raised from 0.61 to 0.82. Similarly, for poorly characterized genes, the relative transferability decreases from 1.31 to 1.03. Importantly, using Mantel–Haenszel test, after correction for multiple testing, none of the functional categories is found to have relative transferability that is significantly different from one. The only exception is the functional category “translation, ribosomal structure, and biogenesis” in which the relative transferability is significantly lower than one when the more permissive criterion for transferability is used (supplementary table S3B, Supplementary Material online). This result is not surprising because these gene families are known to be among the so called “core” of the genome, which is highly resistant to HGT (e.g., Ciccarelli et al. 2006; Sorek et al. 2007). To conclude, these results demonstrate that when the connectivity level is taken into account, the functional category is not a significant factor in determining the propensity of gene families to undergo HGT events.
The Connectivity Barrier Holds Both for Recent and Ancient Acquisitions
The stochastic mapping methodology infers branch-specific gain events. We tested whether the connectivity barrier exists both for recent and for ancient transfers by partitioning the branches of the tree to two groups, recent and ancient. Figure 2 depicts the phylogeny used in this research with branches color-coded as either recent or ancient. Our results show that this is indeed the case: connectivity is a strong predictor for transferability for both recent and ancient HGT events with Spearman's coefficients of −0.39 and −0.43, and P values of 6.01 × 1089 and 2.9 × 10−108, respectively. Notably, protein interaction data were derived from contemporary experimental observations, and thus, our observations show that current information regarding connectivity is highly informative for ancient HGT events. These results may be explained by a slow evolutionary rate of the PPI network, that is, the connectivity of gene families in current microbes highly resemble that of hypothetical ancestral lineages. This interpretation is in agreement with the findings of Lercher and Pal (2008).
The phylogeny with branches color-coded as recent or ancient. The phylogenetic tree used in this research. Recent branches are colored red and ancient branches are colored gray.
Controls and Additional Tests
The above results were validated with respect to several assumptions. First, we inferred HGT events assuming the tree topology of Ciccarelli et al. (2006). The results obtained were qualitatively the same when all computations were repeated assuming the rRNA tree (Yarza et al. 2008), with detailed results in supplementary tables S4 and Supplementary Data, Supplementary Material online. The Cicarelli tree was chosen to be the main reference as it obtained higher maximum log-likelihood value compared with the rRNA tree (supplementary table S1, Supplementary Material online). Second, connectivity was inferred based on the STRING database (Jensen et al. 2009). The conclusions were essentially the same when interactions were extracted from the DIP database instead (Salwinski et al. 2004), with detailed results in supplementary table S2, Supplementary Material online.
Conclusions
Since it was suggested, the complexity hypothesis was debated: It was shown that for cases of homologous gene acquisition, the complexity barrier may be low (Wellner et al. 2007; Omer et al. 2010). However, here, we demonstrate that gene family acquisition apparently has very different evolutionary characteristics and involves a substantial complexity barrier that is not restricted to particular protein functions. Our results are based on robust statistical models and methodologies and on a large corpus of phyletic data, which are radically different than those that were available when the complexity hypothesis was first suggested. Using these data and methods, we were able to quantify the extent to which HGT of gene families is determined by the functional category and the number of protein–protein connections that characterize them. When assessing barriers to HGT and the importance of these factors in determining transferability, we found that high connectivity hinders HGT events. Finally, we demonstrated that the functional category of a gene family is an insignificant factor in determining HGT, once the connectivity level factor is neutralized.
This study focused on the elucidation of factors that determine HGT. We note that an interesting direction for future research is to apply the methodology presented here to quantify and characterize gene family loss dynamics, that is, to elucidate the factors that determine the propensity for gene family loss (dispensability). The importance of gene loss in shaping microbial genomes in evolution was studied and quantified both computationally (Charlebois and Doolittle 2004; Csuros and Miklos 2006; Marri et al. 2006; Borenstein et al. 2007; Wapinski et al. 2007) and experimentally (Moran et al. 2009) and both gene function and network connectivity had been suggested to play an important role (Krylov et al. 2003; Pal et al. 2006; Wolf et al. 2006; Ochman et al. 2007; Yosef et al. 2009). Notably, because gene loss dynamics is known to be much more common in parasitic bacteria, models that account for a covarion-like type of evolution with regard to gain and loss parameters (heterotachy) should be more suitable to analyze gene loss dynamics. An important step forward in this direction is the recent work of Spencer and Sangaralingam (2009), which clearly shows that a covarion-type model of evolution can better capture gene gain and loss dynamics when reductive evolution in some lineages is evident.
Another interesting direction for future research is to build evolutionary models that explicitly consider the association between connectivity and the gain (and loss) rates. Such models are becoming more and more interesting as the volume of microbial genomic data accumulates and the knowledge regarding PPI becomes more accurate.
We thank Daniel Yekutieli for his help with the statistical analysis. We thank Matthew Spencer for reviewing this paper and for providing helpful criticism and suggestion that significantly improved this manuscript. We thank Nimrod Rubinstein for critically reading the manuscript. T.P. is supported by a grant from the Israel Science Foundation (878/09) and by the National Evolutionary Synthesis Center (NESCent), National Science Foundation #EF-0905606. O.C. is a fellow of the Edmond J. Safra program in bioinformatics.
References
Author notes
Associate editor: Andrew Roger

