-
PDF
- Split View
-
Views
-
Cite
Cite
Eran Elhaik, Giddy Landan, Dan Graur, Can GC Content at Third-Codon Positions Be Used as a Proxy for Isochore Composition?, Molecular Biology and Evolution, Volume 26, Issue 8, August 2009, Pages 1829–1833, https://doi.org/10.1093/molbev/msp100
Close -
Share
Abstract
The isochore theory depicts the genomes of warm-blooded vertebrates as a mosaic of long genomic regions that are characterized by relatively homogeneous GC content. In the absence of genomic data, the GC content at third-codon positions of protein-coding genes (GC3) was commonly used as a proxy for the GC content of isochores. Oddly, in the postgenomic era, GC3 is still sometimes used as a proxy for the GC composition of isochores. Here, we use genic and genomic sequences from human, chimpanzee, cow, mouse, rat, chicken, and zebrafish to show that GC3 only explains a very small proportion of the variation in GC content of long genomic sequences flanking the genes (GCf), and what little correlation there is between GC3 and GCf was found to decay rapidly with distance from the gene. The coefficient of variation of GC3 was found to be much larger than that of GCf and, therefore, GC3 and GCf values are not comparable with each other. Comparisons of orthologous gene pairs from 1) human and chimpanzee and 2) mouse and rat show strong correlations between their GC3 values, but very weak correlations between their GCf values. We conclude that the GC content of third-codon position cannot be used as stand-in for isochoric composition.
Introduction
Isochores were first defined by Macaya et al. (1976) as long (>300 kb) genomic domains with homogeneous GC content. The genomes of warm-blooded vertebrates (mammals and birds) were described as a mosaic of isochores of alternating low and high GC contents, as opposed to the genomes of cold-blooded vertebrates (fishes and amphibians) that were supposed to lack GC-rich isochores (Bernardi et al. 1985; Bernardi 2000).
In the absence of genomic sequences, the GC composition at third-codon positions of protein-coding genes (GC3) was commonly used as a proxy for the GC composition of the isochore in which the gene resides (Bernardi et al. 1985; Aota and Ikemura 1986; Mouchiroud et al. 1991; Kadi et al. 1993; Duret et al. 1995; Zoubak et al. 1996; Bernardi et al. 1997; Robinson et al. 1997; Galtier and Mouchiroud 1998). In recent years, genomic sequences became available and various methods for genome segmentation into compositionally homogeneous segments have been proposed. Oddly, however, the practice of using GC3 as a proxy for the GC content of flanking isochores (GCf) still persists (Bernardi 2001; Ponger et al. 2001; Alvarez-Valin et al. 2002; D'Onofrio 2002; D'Onofrio et al. 2002; Scaiewicz et al. 2006; Costantini and Bernardi 2008), even though protein-coding regions, from which the value of GC3 is computed, comprise less than 5% of the human genome (IHGSC 2001) and about 10% of chicken genome (ICGSC 2004).
In support of this common practice, several small-scale analyses have been conducted (Aissani et al. 1991; Clay et al. 1996; Musto et al. 1999; Eyre-Walker and Hurst 2001). For example, Eyre-Walker and Hurst (2001) found a strong correlation between the GC3 values in 369 genes located on human chromosomes 21 and 22 and the GC content of upstream and downstream flanking regions of size of 25 kb. Moreover, it has been argued that GC3 is a more suitable indicator of flanking GC content than the mean GC content of all three codon positions (Bernardi 2000; Eyre-Walker and Hurst 2001).
The presumed relationship between GC3 and isochores has been used numerous times in the literature to study isochore function and evolution (Aota and Ikemura 1986; Kadi et al. 1993; Duret et al. 1995; Zoubak et al. 1996; Bernardi et al. 1997; Robinson et al. 1997; Galtier and Mouchiroud 1998; Eyre-Walker and Hurst 2001; Alvarez-Valin et al. 2002; Duret et al. 2002; Vinogradov 2003; Chojnowski et al. 2007). The purpose of this study is to test the appropriateness of GC3 as a stand-in for GC content of isochores.
Methods
Data Retrieval and Filtering
Coding sequences from RefSeq database are annotated as: “inferred,” “model,” “predicted,” “provisional,” “reviewed,” or “validated.” We included only genes that are annotated as predicted, provisional, reviewed, or validated (PPRV) to increase the reliability of our data. We used only fully sequenced eukaryotic genomes that have more than 3,000 PPRV coding sequences. We only used PPRV coding sequences larger than 300 bp, which had at least 200 kb upstream and 200 kb downstream. Six species met our criteria: Homo sapiens (build 36.2), Bos taurus (build 3.1), Danio rerio (build 1.1), Gallus gallus (build 2.1), Mus musculus (build 36.1), and Rattus norvegicus (build 3.4). The genomes were downloaded from the NCBI ftp web site (ftp://ftp.ncbi.nlm.nih.gov/genomes/). Using NCBI data file gene2accession (version 01/16/07), we retrieved for every genome all the coding sequences and their chromosomal location. We then extracted the coding sequences from RefSeq database and their flanking sequences from the downloaded genomic sequences. Introns were ignored because, to the best of our knowledge, they have not been used to predict “isochores.” Our data set is shown in table 1.
GC3 and GC123 for Six Vertebrate Taxa
| Species | No. of Genes | GC3 | GC123 | ||||
| Mean (%) | σ | Range (%) | Mean (%) | σ | Range (%) | ||
| Homo sapiens | 17,451 | 60 | 17 | 22–97 | 45 | 6 | 32–80 |
| Bos taurus | 5,522 | 62 | 16 | 25–97 | 43 | 6 | 33–76 |
| Mus musculus | 17,009 | 59 | 11 | 21–96 | 43 | 5 | 27–76 |
| Rattus norvegicus | 8,983 | 59 | 11 | 23–96 | 42 | 6 | 33–73 |
| Gallus gallus | 3,036 | 56 | 15 | 28–99 | 42 | 5 | 36–80 |
| Danio rerio | 4,344 | 56 | 8 | 27–92 | 35 | 2 | 34–68 |
| Species | No. of Genes | GC3 | GC123 | ||||
| Mean (%) | σ | Range (%) | Mean (%) | σ | Range (%) | ||
| Homo sapiens | 17,451 | 60 | 17 | 22–97 | 45 | 6 | 32–80 |
| Bos taurus | 5,522 | 62 | 16 | 25–97 | 43 | 6 | 33–76 |
| Mus musculus | 17,009 | 59 | 11 | 21–96 | 43 | 5 | 27–76 |
| Rattus norvegicus | 8,983 | 59 | 11 | 23–96 | 42 | 6 | 33–73 |
| Gallus gallus | 3,036 | 56 | 15 | 28–99 | 42 | 5 | 36–80 |
| Danio rerio | 4,344 | 56 | 8 | 27–92 | 35 | 2 | 34–68 |
The mean, standard deviation (σ), and range are shown for each measure.
GC3 and GC123 for Six Vertebrate Taxa
| Species | No. of Genes | GC3 | GC123 | ||||
| Mean (%) | σ | Range (%) | Mean (%) | σ | Range (%) | ||
| Homo sapiens | 17,451 | 60 | 17 | 22–97 | 45 | 6 | 32–80 |
| Bos taurus | 5,522 | 62 | 16 | 25–97 | 43 | 6 | 33–76 |
| Mus musculus | 17,009 | 59 | 11 | 21–96 | 43 | 5 | 27–76 |
| Rattus norvegicus | 8,983 | 59 | 11 | 23–96 | 42 | 6 | 33–73 |
| Gallus gallus | 3,036 | 56 | 15 | 28–99 | 42 | 5 | 36–80 |
| Danio rerio | 4,344 | 56 | 8 | 27–92 | 35 | 2 | 34–68 |
| Species | No. of Genes | GC3 | GC123 | ||||
| Mean (%) | σ | Range (%) | Mean (%) | σ | Range (%) | ||
| Homo sapiens | 17,451 | 60 | 17 | 22–97 | 45 | 6 | 32–80 |
| Bos taurus | 5,522 | 62 | 16 | 25–97 | 43 | 6 | 33–76 |
| Mus musculus | 17,009 | 59 | 11 | 21–96 | 43 | 5 | 27–76 |
| Rattus norvegicus | 8,983 | 59 | 11 | 23–96 | 42 | 6 | 33–73 |
| Gallus gallus | 3,036 | 56 | 15 | 28–99 | 42 | 5 | 36–80 |
| Danio rerio | 4,344 | 56 | 8 | 27–92 | 35 | 2 | 34–68 |
The mean, standard deviation (σ), and range are shown for each measure.
The orthologous genes for H. sapiens (NCBI36) and Pan troglodytes (CHIMP2.1), and for M. musculus (NCBIM37) and R. norvegicus (RGSC3.4) were identified by the BioMart tool (http://www.biomart.org/biomart/martview/) using the Ensembl implementation (Kasprzyk et al. 2004). The genomes were downloaded from Ensembl and the flanking sequences were extracted as previously described. Our data set included 13,078 and 15,344 pairs of orthologous genes and their flanking regions for Homo-Pan and Mus-Rattus, respectively.
Statistical Tests
We employed three analyses to test the relationship between GC3 and the GC content of the flanking regions of the gene (GCf). In the first analysis, we calculated four genic measures: the GC content at each of the three codon positions (GC1, GC2, and GC3) and the average GC content (GC123). Next, we calculated the GC content of 40 nonoverlapping 5-kb windows upstream and downstream of the gene. For each genome, we calculated the coefficient of determination (r2) between every genic measure and the GCf of every window. The significance of r2 was tested with the Bonferroni correction (Sokal and Rohlf 1995, pp. 240, 702–703) to adjust for multiple comparisons.
To test the effect of window size on the correlations, we used windows ranging from 5 to 100 kb, but the results did not change. We also repeated all calculations by using only genes that their 200-kb flanking regions did not overlap either with other 200-kb flanking regions or with other known genes. The results were unaffected.
In the second analysis, we compared the breadth of the distribution of GC3 and GCf by using coefficient of variation (Sokal and Rohlf 1995, pp. 57–59; Zar 1999, p. 40). We used flanking regions of size 200 kb upstream and downstream of the gene to estimate GCf.
In the third analysis, we compared the orthologous gene pairs from Homo–Pan and Mus–Rattus and calculated the relationship for GC1, GC2, GC3, GC123, and GCf pairs. We used nonoverlapping flanking regions of 5 kb up to 200 kb upstream and downstream of the gene. The significance of r2 was tested with the Bonferroni correction (Sokal and Rohlf 1995, pp. 240, 702–703) to adjust for multiple comparisons.
Results
Means, standard deviations, and ranges of GC3 and GC123 for the different genomes are shown in table 1. We calculated the coefficient of determination between GC1, GC2, GC3, and GC123, on the one hand, and GCf, on the other, and found that for most genomes the GCf variation cannot be explained by any of these measures (fig. 1a).
GC3 cannot predict GCf. Coefficients of determination (r2) between GC1 (green), GC2 (turquoise), GC3 (blue), and GC123 (black), on the one hand, and GCf in 5-kb windows upstream and downstream of the gene, on the other. Calculations were carried (a) for all genes and (b) for genes that their 200-kb flanking regions do not overlap with other 200-kb flanking regions or with other genes. The number of genes is noted.
The trend of decreasing r2 values with increasing distance from the gene was observed in all genomes for both upstream and downstream directions. In the human and cow genomes, this trend was clearly observed as a sharp decrease in r2 values for flanking regions within close range of the gene followed by a moderate decrease for the distant flanking regions. In these genomes, GC3 only explained a very small proportion of the variation in GC content of long genomic sequences flanking the genes (GCf). The GCf variation was not explained at all by any genic measure in mouse, rat, chicken, and zebrafish genomes. When we eliminated genes with overlapping flanking regions (fig. 1b), the coefficients of determination decreased but the overall trends remained the same.
When comparing the explanatory abilities of the four genic measures, we see that GC123 is a stronger predictor of GCf than GC3, although the difference is not significant. With the exception of cow and chicken, in all other genomes, the four measures follow the inequality r(GC123,GCf)2 >r(GC3,GCf)2 > r(GC1,GCf)2 > r(GC2,GCf)2. Additionally, we did not observe any correlation between coding sequence size and GCf.
The distributions of GC3 and mean GCf for 200 kb upstream and downstream of the gene are shown in figure 2. The shape of the GCf distribution is not affected by the size of the flanking regions (results are not shown) and, therefore, we only present the distribution for 200 kb. We note that, on average, the coefficient of variation for GCf is considerably smaller than that for GC3.
Frequency distribution of GC3 (blue) and 200-kb GCf (red). Coefficients of variation are shown.
The frequency distribution of human GC content at all codon positions as well as in flanking regions of size 200 kb is plotted in figure 3. Interestingly, the distributions of GC2 and GCf are very similar in all the genomes although GC2 only explains less than 9% of the variation in GCf. All other genomes show a similar pattern of distributions, and are therefore, not shown.
GC content in codon positions: GC1 (green), GC2 (turquoise), GC3 (blue), and 200-kb flanking regions (dashed red) in human.
Another way to test the evolutionary relationship between GC3 and GCf is to compare the GC3 and GCf of orthologous genes from two genomes. If the claim that the same natural processes occurred in both GC3 and GCf is true, then GC3 should be a good predictor of GCf and both GC3 and GCf would be highly correlated. We found a strong relationship between all genic measures (GC1, GC2, GC3, and GC123) of the orthologous genes. For the pair Homo–Pan: , , , and . For the pair Mus–Rattus: , , , and . In contrast, the correlation between GCf values was very weak. Figure 4 presents the r2 between GC3 and GCf of orthologous genes for Homo–Pan and Mus–Rattus. For Homo–Pan, the range of is from 0.21 to 0.45 with a mean of 0.3. For Mus–Rattus, the range of is from 0.01 to 0.2 with a mean of 0.03. All results were significant at a 0.01 significance level. The decrease in r2 values shows that GCf is not conserved among orthologous genes. The differences in r2 values between the upstream and downstream directions were insignificant for all genomes.
Coefficient of determination (r2) between GCf values (circles) surrounding orthologous genes in Homo-Pan (left panel) and Mus-Rattus (right panel). The r2 for GC3 is shown as a square at 0 on the x-axis. The correlation of GC3 values for orthologous genes is shown in the inset.
Discussion
GC3 is routinely used as a proxy for the GC composition of isochores (Bernardi 2001; Ponger et al. 2001; Alvarez-Valin et al. 2002; D'Onofrio 2002; D'Onofrio et al. 2002; Scaiewicz et al. 2006; Costantini and Bernardi 2008), although to the best of our knowledge, the relationship between GC3 and the GC content of very long flanking regions (the presumed size of isochores) has never been tested on a large genomic or taxonomic scale. Previous analyses used few genes and flanking regions that were so short as to be completely irrelevant to the definition of isochores (Aissani et al. 1991; Clay et al. 1996; Musto et al. 1999; Eyre-Walker and Hurst 2001).
Our analyses tested the ability of four genic composition measures: GC1, GC2, GC3, and GC123 to predict the GC content in flanking regions 5′ and 3′ of the gene. Because GC3 is mostly unconstrained by functional requirements, that is, by the need to code specific amino acids, the third-codon position is a natural candidate for a predictive proxy of flanking GC content. We note, however, that a proxy must be able to explain most of the variation in GCf, not merely be correlated with it. Our analyses reveal that GC3 explains very little of the variation in GC content of large flanking regions. Moreover, we see that the predictive power either decreases rapidly the further one gets upstream and downstream of the gene or does not exist at all. Our orthologous gene pair analysis indicates that different evolutionary processes affect codon usage (GC3) and flanking regions (isochores) and, therefore GC3 cannot be used to predict GCf. Finally, we note that the predictive power of GC3 is almost nonexistent in nonhuman vertebrates.
We suggest that all associations between isochores and genic features (e.g., gene length, gene density, and chromosomal bands) that have been reported or suggested in the literature should be reevaluated if GC3 was used as a proxy for the GC content of isochores, as it was almost invariably done in the past.
Sometimes, GC3 is used when genomic sequences are not available (Galtier 2003; Hamada et al. 2003; Montoya-Burgos et al. 2003; Romero et al. 2003; Cruveiller et al. 2004; Federico et al. 2004; Gu and Li 2006; Chojnowski et al. 2007; Fortes et al. 2007; Chojnowski and Braun 2008). We show here that in all probability GC3 lacks predictable power as far as large flanking regions are concerned.
This work was supported in part by National Science Foundation grant DBI-0543342 to D.G.
References
Author notes
Takashi Gojobori, Associate Editor



