Variations in GC content between genomes have been extensively documented. Genomes with comparable GC contents can, however, still differ in the apportionment of the G and C nucleotides between the two DNA strands. This asymmetric strand bias is known as GC skew. Here, we have investigated the impact of differences in nucleotide skew on the amino acid composition of the encoded proteins. We compared orthologous genes between animal mitochondrial genomes that show large differences in GC and AT skews. Specifically, we compared the mitochondrial genomes of mammals, which are characterized by a negative GC skew and a positive AT skew, to those of flatworms, which show the opposite skews for both GC and AT base pairs. We found that the mammalian proteins are highly enriched in amino acids encoded by CA-rich codons (as predicted by their negative GC and positive AT skews), whereas their flatworm orthologs were enriched in amino acids encoded by GT-rich codons (also as predicted from their skews). We found that these differences in mitochondrial strand asymmetry (measured as GC and AT skews) can have very large, predictable effects on the composition of the encoded proteins.
Genome-wide biases in nucleotide content have been extensively studied in a wide variety of organisms. 1 During the past decade, there has been accumulating evidence that these variations at the DNA level can result in parallel changes in the frequencies of amino acids in the encoded proteins. For example, the GC content of bacterial genomes has been shown to influence the amino acid composition of the proteome. 2 , 3
In addition to the variations in GC content, however, bacterial genomes can also show significant compositional asymmetry between the two DNA strands. 4 , 5 This strand asymmetry is usually measured as GC or AT skew 6 (see Methods) and it can be due to such factors as different substitutional patterns between the leading and lagging strands during replication. 4 , 7 , 8 By comparing the amino acid compositions of proteins encoded on the leading strand with those encoded on the lagging strand, it has been shown that these nucleotide skews can also affect the amino acid composition of bacterial proteins. 9 , 10
Among eukaryotes, the correlations between nucleotide content and amino acid compositions have been studied primarily in animal mitochondrial genomes. Animal mitochondria show considerable variation between species in both GC content and GC skew. 6 It has been shown that the variations in GC content affect the amino acid composition of the encoded proteins 11 but the effects of variations in nucleotide skew between genomes have not been studied. The recent availability of a very large number of completely sequenced mitochondrial genomes allows us to fill this gap. Mammalian mitochondria, including those of humans, are characterized by negative GC skews and positive AT skews, i.e. the major coding strand of mammalian mitochondria is relatively rich in the nucleotides C and A, and correspondingly poor in G and T. For instance, as noted by Perna and Kocher (1995), 6 although the human mitochondrial genome contains more than 40% GC pairs, the frequency of G on the coding strand is only 5%. The strength of the skew varies between species, and some species of invertebrate show opposite skews to those found in mammals, i.e. the coding strand is rich in G and poor in C. 6
In recent years, there has been an explosion of data on mitochondrial whole genome sequences. Since our goal was to measure the effect of strand asymmetry on amino acid composition, we chose a group of species, the Platyhelminthes (flatworms), in which the GC and AT skews are opposite to those seen in mammals. 12 , 13 Among the flatworms, the coding sequences are rich in G and T and correspondingly poor in C and A, i.e. a complete contrast to the patterns seen in mammals. For example, the main mitochondrial coding strand in Schistosoma mansoni (a flatworm) contains only 6.7% C, but it contains 25% G. 12 Despite the contrasting GC and AT skew patterns, however, both mammals and flatworms contain essentially the same set of homologous mitochondrial genes and they span similar ranges of GC content. Thus a comparison between these two groups provides us with an opportunity to assess the potential effects of DNA strand bias on the amino acid composition of a well-characterized set of orthologous proteins.
We downloaded all of the publicly available complete mitochondrial (mt) genome (mtDNA) sequences from both mammals and Platyhelminthes from the NCBI RefSeq organelle genome database ( http://www.ncbi.nlm.nih.gov/genomes/ORGANELLES/mztax_short.html ) (released in July, 2006). There were a total of 170 mt genomes from mammals and 13 mt genomes from Platyhelminthes (Supplementary Table S1).
The nucleotide frequencies and compositional biases of mtDNA sequences were calculated from the ‘ + ’-strand (the major coding strand used by NCBI for annotation) for each species. The GC and AT asymmetry is measured in terms of GC- and AT-skews according to the following formulae given in Perna and Kocher 6 : GC-skew = (G−C)/(G + C); AT-skew = (A − T)/(A + T), where C, G, A, and T are the occurrences of the four nucleotides.
We predicted the amino acid compositions based on the mitochondrial genetic code (Table 1 ) by partitioning the mitochondrial codons into CA-rich codons, GT-rich codons, and other codons. The frequency of synonymous codon usage was measured by the nucleotide content of G + T or C + A at the third codon positions of fourfold degenerate codon families: GGN (Glycine), GTN (Valine), CGN (Arginine), ACN (Threonine), GCN (Alanine), and CCN (Proline) (Supplementary Fig. S1).
GT-rich codons (italic) include GT, TG, GG, TT codons at the first two codon positions. TTA and TTG are excluded as there are other codons for Leu. CA-rich codons (bold) include CA, AC, CC, AA codons at the first two codon positions. Different codon assignments in Mammals and Platyhelminthes are underlined. In Platyhelminthes, AGA and AGG code for Ser, ATA for Ile, and AAA for Asn. The numbers following each codon are codon usage per thousand codons in the 11 conserved proteins of mammals (the first number) and Platyhelminthes (the second number).
The statistical significance of the average differences in amino acid composition between the two groups of species was scored using a Student's t -test.
Results and discussion
We first confirmed the published reports (see Introduction) of contrasting GC and AT skews in the mitochondria of mammals and flatworms. As can be seen from Fig. 1 , there is a negative GC skew, and a positive AT skew, in the major coding strand among the mammalian species, whereas the opposite is true among the flatworms. Despite the variations between species within both groups, there is a large average difference between them and this difference between the groups is statistically highly significant ( P < 0.0001). As previously noted by Perna and Kocher 6 and Le et al., 12 the major coding sequence of mammalian mtDNA is relatively rich in C and A, whereas G and T are much more common in the flatworm coding sequences.
Despite their differences in strand asymmetry, the major coding strands of the two groups encode essentially the same set of 11 conserved mitochondrial proteins including cytochrome b, three subunits of cytochrome c oxidase (subunit 1, 2, and 3), six subunits of NADH dehydrogenase (subunit 1, 2, 3, 4, 4L, and 5), and ATP synthase F0 subunit 6 (ATP6). The exception is that NADH dehydrogenase subunit 6 (ND6) is encoded on the major coding strand in flatworms, but is encoded on the opposite strand in mammals.
Given the contrasting patterns of mitochondrial DNA strand asymmetry between these two groups of animals, we wished to investigate if there was a corresponding difference between the two groups in the frequencies of encoded amino acids. Specifically, because of their negative GC skews and positive AT skews (which reduce the frequencies of G and T nucleotides on the coding strand), we expected the mammalian coding strands to encode proteins that are relatively low in the proportions of cysteine (C), Valine (V), phenylalanine (F), Glycine (G), and Tryptophan (W), all of which are encoded by GT-rich codons (Table 1 ). This is exactly what we observed (Fig. 2 A). On average, the combined proportions of these five amino acids among the mammals are approximately half the value observed in the flatworm orthologs. Since these GC and AT skews result in a corresponding enrichment of C and A nucleotides on the mammalian coding strand, we expected the mammalian proteins to show a corresponding relative increase in the proportions of Glutamine (G), Threonine (T), Proline (P), Histidine (H), Asparagine (N), and Lysine (K), all of which are encoded by CA-rich codons. Again, this prediction is borne out (Fig. 2B) and again the average difference between the two groups of species is approximately twofold. Not only are the average differences in the predicted direction, but they are statistically highly significant ( P < 0.0001) and they are consistent over all species within each group (see details in Supplementary Table S1). In addition to the consistency over species, there is also a consistency over all amino acids within the two codon groups. These differences are surprisingly large, given that we are dealing with a set of conserved orthologous proteins. Although taken as a group, the amino acid frequencies show approximately a twofold difference between the two groups (Fig. 2 ), at the level of individual amino acids, some of these differences are threefold or greater (Fig. 3 ). Again, the individual amino acid differences shown in Fig. 3 are highly statistically significant ( P < 0.0001). Thus, we can conclude that strand asymmetries at the level of DNA have had a major influence on the composition of these mitochondrial protein sequences.
These results show that differences in the patterns of strand asymmetry between the coding and template strands of a mitochondrial gene can produce very significant changes in the amino acid composition of the encoded proteins. The magnitude of these changes is comparable to those noted previously by Foster et al. 11 for the effects of differences in mitochondrial nucleotide composition, i.e. GC content. It should be noted, however, that the strand asymmetries described in this study do not affect the same subsets of the amino acids as those affected by changes GC content. As an illustrative example, we can compare a single pair of mitochondrial genomes, one mammal and one flatworm, those of the red deer ( Cervus elaphus ) and the liver fluke ( Fasciola hepatica ). In both species, the overall nucleotide content of the mitochondrial coding sequences is virtually identical at 38% GC and proportions of GC-rich (GARP) and AT-rich (FYMINK) amino acids (see Foster et al. 11 for details) are also very similar between these two species. In other words, their similarity in nucleotide composition is reflected in a similarity in the proportions of GC-rich and AT-rich amino acids. But when we compare the same two species for the levels of CA-rich (QTPHNK) and GT-rich (CVFGW) amino acids, we see large differences reflecting the differences in strand asymmetry between the mammals and the flatworms. For example, the liver fluke proteins contain more than twice as many Valine residues and more than five times as many cysteine residues as do their orthologs in the red deer. On the other hand, the liver flukes have approximately a third as many Glutamines and Threonines as are found in the red deer. These differences are also statistically highly significant ( P < 0.001) and they are entirely consistent with what we see for the average differences between mammals and flatworms (Fig. 3 and Supplementary Table S1).
A possible alternative explanation for these results is that the amino acid differences are the cause, rather than the consequence, of the strand asymmetries. We can exclude this possibility, however, for two reasons. First, there is an even larger strand asymmetry at the synonymous codon sites (Supplementary Fig. S1) suggesting that the nucleotide skew is counterbalanced, to some extent, by functional constraint at the protein level. In other words, protein function does have an effect, but as a constraint rather than a cause. Another way to illustrate this point is to calculate the GC skew at each codon position separately. The results (Supplementary Table S2) show that the greatest differences in GC skew occur at the third codon position and the least skew occurs at the second position. This is consistent with the fact that many changes at the third codon position alter the codon usage but do not affect the protein sequence. Secondly, in mammalian mitochondria, one gene ( ND6 ) is encoded on the opposite strand from the other 12 genes and, in accordance with our prediction, the amino acid composition of this mammalian protein displays a pattern that is similar to that of the flatworm proteins rather than the other 12 mammalian proteins (Supplementary Fig. S2). Both of these observations indicate that the primary effect is at the level nucleotide asymmetry between the two strands of the mitochondrial genome and that this DNA bias causes a secondary effect at the level of protein composition.
We performed a number of further tests to explore the interplay between functional constraint at the protein level and nucleotide skew at the DNA level. Since we limited our comparison to orthologous mitochondrial genes, we have eliminated the effect of different types of proteins in the two groups of species. Moreover, since mitochondrial function is highly conserved in metazoan animals, we expect that the orthologous proteins are performing essentially the same functions in the two species groups. There is still, however, the remote possibility that there are differences in physiological conditions between the mitochondria of mammals and those of flatworms and that these differences could contribute to the differences in protein composition which we observe between the two groups. To control for this possibility, we examined the sequences of three mitochondrial ribosomal protein genes: L11, L15, and L20 (Supplementary Table S2). Although these proteins function in the mitochondria, they are encoded by genes in the nuclear genome. The results show that the gene sequences do not show the characteristic mitochondrial gene skews and there is no significant difference between mammals and Platyhelminthes, either at the DNA level or at the protein level. This indicates that the key factor underlying the species differences is not the functional environment of the proteins but rather the location of the genes encoding those proteins.
Since our results point to a mutational force at the DNA level that is counteracted by functional constraint at the protein level, we asked what would happen if we confined our analysis to a region of the protein where there was reason to believe that the functional constraint would be especially strong. The transmembrane domain portions of mitochondrial proteins provide an opportunity to do such an analysis. Specifically, we looked at the patterns of GC skew and amino acid composition in the transmembrane domain regions of the COXI gene. The results (Supplementary Table S2) show that the increased functional constraint on amino acid composition does indeed lead to a decreased difference in GC skew between the two species groups. For example, the difference in GC skew between mammals and flatworms for the complete dataset is 0.84, whereas it is reduced to 0.57 for the transmembrane domains, but the same pattern remains both at the DNA and protein levels. For instance, even within the transmembrane domains, the mammalian sequences have less cysteine and valine, and more glutamine and threonine than do their flatworm orthologs, reflecting the same patterns as shown in Figure 3 . In other words, increased functional constraint decreases the effect of nucleotide skew, but it does not completely eliminate it.
Overall, our results show that the DNA strand asymmetry of animal mitochondrial genomes affects the amino acid composition of encoded proteins. A similar effect has been noted previously in bacterial genomes 9 , 10 but it has not been reported in animal or plant genomes. Since we infer that the DNA strand bias is the cause rather than the consequence of the protein differences, this raises the question of what causes the DNA strand bias in the first place. In bacterial genomes, there is good evidence that it is related to DNA replication, based primarily on the fact that the direction of the bias switches at the origin of replication. 4 , 7 , 8 Recent work indicates that although DNA strand biases are widespread in prokaryotes, eukaryotes, and viruses, the magnitude and the direction of the bias is variable, suggesting that the underlying causes are multifactorial. 14 , 15 In animal mitochondrial genomes, the strand bias appears to be caused by varying durations of time that the heavy strand spends in the mutagenic single-strand state during replication. 16 In addition to the replication-associated effects, there is evidence that transcription can also generate DNA strand asymmetry in eukaryotes due to transcription-coupled mutations. 17 , 18
A recent study has shown that strand bias in mitochondrial sequences can lead to artefactual results in phylogenetic reconstructions. 19 This misleading result can be minimized by a recoding scheme that excludes transitions at rapidly evolving neutral sites. 19 , 20 Our results, however, show that mitochondrial strand biases can also have significant effects at non-neutral sites that change the amino acid sequence. This means that some degree of bias remains even after the correction has been implemented. As stated by Jones et al. (2007) 20 ‘In the case of mitochondrial genes, strand-bias should be of particular concern and the previous use of mitochondrial genomes in resolving deep phylogenies requires critical re-evaluation’.
It remains to be seen if DNA strand asymmetry can also affect the composition of proteins encoded by eukaryotic nuclear genes, as has been shown for biases in the GC content of nuclear genes. 21 , 22 It is already known that DNA strand asymmetries exist in nuclear genes 17 , 18 , 23 , 24 but their effects on the composition of nuclear-encoded proteins have not been studied.
Supplementary data are available online at http://dnaresearch.oxfordjournals.org .
This work was supported by a grant to DAH from the Natural Science and Engineering Research Council of Canada.