Abstract

Human promoters divide into 2 classes, the low CpG (LCG) and the high CpG (HCG), based on their CpG dinucleotide content. The LCG class of promoters is hypermethylated and is associated with tissue-specific genes, whereas the HCG class is hypomethylated and associated with broadly expressed genes. By analyzing several chordate genomes separated for hundreds of millions of years, here we show that the divide between low CpG and high CpG promoters is conserved in several distantly related vertebrate taxa (including human, chicken, frog, lizard, and fish) but not in close invertebrate outgroups (sea squirts). Furthermore, LCG and HCG promoters are distinctively associated with tissue-specific and broadly expressed genes in these distantly related vertebrate taxa. Our results indicate that the function of DNA methylation on gene expression is conserved across these vertebrate taxa and suggest that the 2 classes of promoters have evolved early in vertebrate evolution, as a consequence of the advent of global DNA methylation.

Introduction

Analyses of human and mouse promoters have revealed an intriguing structural and functional bimodality (Carninci 2006; Saxonov et al. 2006; Tang and Epstein 2007; Weber et al. 2007). Structurally, they divide into 2 distinct classes based on CpG dinucleotide content—the low CpG content (LCG) and the high CpG content (HCG) promoters; the latter being associated with CpG islands (Saxonov et al. 2006). At the functional level, LCG genes tend to be tissue specific, whereas HCG genes are broadly expressed. A common explanation for this phenomenon is germ line DNA methylation (Antequera 2003; Saxonov et al. 2006; Weber et al. 2007) because methylation influences both sequence property and gene expression.

In terms of its effect on sequence, DNA methylation is highly mutagenic. In vertebrate genomes, methylation occurs almost exclusively at the cytosines in CpG dinucleotides. Methylated cytosines undergo rapid deamination to become thymines, causing C-to-T transitions (or G-to-A transitions in the complementary strand) (Coulondre et al. 1978). In the human and chimpanzee genomes, for example, methylation-origin transitions occur almost 15-fold more frequently than other single-nucleotide substitutions (Elango et al. 2008). As a consequence, CpG dinucleotides are depleted from methylated regions over evolutionary timescale (Bird 1980). Thus, normalized CpG content (CpG O/E or “CpG content” in the rest of the paper; see Materials and Methods) is an indicator of the level of DNA methylation (low CpG content and high CpG content reflect hypermethylation and hypomethylation, respectively: Suzuki et al. 2007; Weber et al. 2007).

In terms of function, at least in mammals, promoter methylation can dampen gene expression. This is achieved either directly by interfering with transcription factor binding or indirectly through recruitment of methyl-CpG–binding proteins to alter chromatin structure (Jones and Takai 2001; Klose and Bird 2006). LCG promoters may therefore be associated with (somatic) tissue-specific genes as a consequence of germ line DNA methylation, whereas the promoters of broadly expressed genes remain unmethylated thereby forming the HCG class (Vinogradov 2005).

Although the role of mammalian promoter methylation and its effect on CpG content and expression breadth is well studied (Antequera 2003; Carninci 2006; Saxonov et al. 2006; Tang and Epstein 2007; Weber et al. 2007), the origin and evolution of such phenomenon is little understood. Here we investigated these aspects, by first focusing on the observation that the patterns of DNA methylation differ greatly among diverse taxa. Mammalian genomes exhibit a global DNA methylation pattern (∼80% of the CpGs are methylated) in most cell types (Tweedie et al. 1997 and the references therein) and thus are largely depleted of CpG dinucleotides (Bird 1980). Most of the vertebrate genomes analyzed are similarly globally methylated (Bird 1980; Tweedie et al. 1997).

Studies indicate that such “global” genomic methylation is restricted to vertebrates. Invertebrate animals distantly related to vertebrates such as Drosophila and Caenorhabditis elegans generally lack germ line DNA methylation (Tweedie et al. 1997). Genomes of close outgroup of vertebrates, such as those of invertebrates within the chordate phylum (e.g., urochordate sea squirt and cephalochordate amphixious), and echinoderms (e.g., sea urchin) exhibit a “mosaic” CpG methylation pattern with long methylated regions and equally long unmethylated regions. Based upon these, it is proposed that the transition from mosaic to global methylation pattern have occurred early in vertebrate evolution (Hendrich and Tweedie 2003).

Interestingly, the aforementioned functional role of promoter DNA methylation may also be unique to vertebrates. A recent study (Suzuki et al. 2007) showed that methylation at CpG sites in the urochordate Ciona intestinalis, which exhibits a mosaic methylation pattern, is targeted to intragenic regions of a subset of genes. Thus, intragenic regions (in contrast to promoters in humans) fall into low-CpG and high-CpG categories in this genome. Promoters analyzed in Suzuki et al. (2007) were not preferentially targeted by DNA methylation.

Several questions emerge when we synthesize these observations: does the structural bimodality in mammalian promoters exist in other vertebrates? If yes, does the relationship between structural bimodality and expression breadth hold in those species? When did the promoter bimodality evolve? If the structural bimodality coincides with the functional bimodality in distantly related vertebrate species, we may infer that DNA methylation is the underlying link for both phenomena, and that structural and functional promoter bimodality has evolved early in vertebrate evolution, rather than independently several times. In this study, we provide answers to some of these questions and propose a model for the evolution of bimodal vertebrate promoters.

Materials and Methods

Genome Sequences and Annotations

We analyzed 6 vertebrate and 3 invertebrate genomes, covering substantial phylogenetic depth. The genome build, source, and gene annotations used in the study are shown in Supplementary Table 1 (Supplementary Material online). We present results from the analyses of the following genomes in the main text: zebrafish (Danio rerio), frog (Xenopus tropicalis), chicken (Gallus gallus), human (Homo sapiens), and a sea-squirt (C. intestinalis) because these genomes had relatively large numbers of curated RefSeq (Pruitt et al. 2007) gene annotation data (Supplementary Table 1, Supplementary Material online). For the human genome, we verified our results by using experimentally characterized highly accurate transcrition start site (TSS) annotations available in the Database of Transcription Start Sites (DBTSS; Wakaguri et al. 2008)

In all the analyses, we removed genes with more than one transcript to avoid errors in TSS annotation. Promoters were defined as 600-bp regions upstream of TSS. Qualitative results of our analyses did not change when we used 1-kb upstream regions as promoters. We also removed promoters that lied within a distance of 3 kb from any other gene.

Because natural selection on coding sequence could potentially confound our results, exons were removed from all the intragenic analyses. First introns were also removed from our analyses because they often encode regulatory elements (Majewski and Ott 2002).

Non–first introns may still harbor some regulatory sequences; for example, some introns may carry CpG islands (Gardiner-Garden and Frommer 1987). The presence of intronic CpG islands may have caused the skew toward greater CpG O/E in human and chicken introns (fig. 1I,J).

FIG. 1.—

Contrasting distributions of normalized CpG contents (CpG O/E) of vertebrate and invertebrate promoters and introns. The distributions of normalized CpG contents (CpG O/E) in 600-bp region upstream of protein coding genes (AE) and introns (FK) of studied genomes. Approximate timescale and evolutionary relationships among the studied genomes are shown below the distributions (Hedges and Kumar 2003). The best-fitting normal distributions are shown above the observed distributions.

FIG. 1.—

Contrasting distributions of normalized CpG contents (CpG O/E) of vertebrate and invertebrate promoters and introns. The distributions of normalized CpG contents (CpG O/E) in 600-bp region upstream of protein coding genes (AE) and introns (FK) of studied genomes. Approximate timescale and evolutionary relationships among the studied genomes are shown below the distributions (Hedges and Kumar 2003). The best-fitting normal distributions are shown above the observed distributions.

Repetitive elements were identified using repeat masker annotations in the UCSC genome browser (Karolchik et al. 2008). We masked repetitive elements from our analyses because CpG contents in repetitive elements does not faithfully represent the historical methylation status of the region, due to differences in the time of insertions. However, including repetitive elements in the analyses did not change our qualitative results (results not shown). We note that in case of zebrafish, including repetitive elements has an effect of increasing overall CpG O/E, which is due to the recent origin of repetitive elements in this genome (Elango N and Yi S, unpublished data).

Measurement of Normalized CpG Contents

The “normalized CpG content” (CpG O/E) is defined as 

graphic
where PCpG, Pc, and PG are the frequencies of CpG dinucleotides, C nucleotides, and G nucleotides, respectively. Several studies (e.g., Suzuki et al. 2007; Weber et al. 2007) confirmed experimentally that CpG O/E is a reliable indicator of methylation status.

Statistical Test for Bimodal Distribution

The unimodality or bimodality of normalized CpG content distributions was tested using the NOCOM software (Ott 1992). Briefly, the software uses an expectation maximization algorithm to fit the data to both unimodal and bimodal distribution models and finds the maximum likelihood values (L0 and L1 for unimodal and bimodal models, respectively). To test if the bimodal distribution model is a better fit to the data as compared with the unimodal distribution model, a statistic G2 = 2[ln(L1)−ln(L0)] was calculated. This statistic approximately follows a Chi-square distribution with 2 degrees of freedom.

Analysis of Expression Data

Expression data were obtained from EST counts in the Unigene database (Wheeler et al. 2003). Genes with EST count ≥1 in a tissue were considered to be expressed in that tissue. The expression breadth of a gene is the number of tissues in which it is expressed. For the human genome, we additionally analyzed 2 microarray data sets. The first data set contains expression data from 79 tissues measured using 3′ arrays (Su et al. 2004). Genes with average difference value >200 in a tissue were considered expressed in that tissue (Su et al. 2004; Saxonov et al. 2006). The second data set contains exon array expression data from 6 tissues (Xing et al. 2007), namely heart, kidney, liver, muscle, spleen, and testis. The probes in the exon array are evenly spaced and dense (147 probes per gene, compared with 11 probes per gene in the 3′ array [Xing et al. 2007]). Therefore, exon arrays are considered to provide more accurate measures of gene expression compared with the 3′ arrays used in Su et al. 2004 (Kapur et al. 2007; Xing et al. 2007).

Results

Patterns of Intronic and Promoter Methylation in Invertebrate Genomes Closely Related to Vertebrates

We analyzed patterns of CpG dinucleotide depletion in upstream promoter regions and intragenic regions of several invertebrate and vertebrate genomes and related them to the known patterns of genomic methylation. Previous studies have shown that although most vertebrate genomes are globally methylated in many different tissue types (Tweedie et al. 1997 and references therein), invertebrates closely related to vertebrates such as urochordate (e.g., sea squirt) and echinoderms (e.g., sea urchin) exhibit “mosaic” pattern of genomic methylation (Tweedie et al. 1997; Simmen et al. 1999; Suzuki et al. 2007). In particular, a recent study demonstrated that in the genome of a sea squirt (C. intestinalis), methylation is targeted to intragenic regions of a subset of genes (Suzuki et al. 2007).

We compared normalized CpG content (CpG O/E) of upstream promoter regions (defined as 600-bp upstream of the transcription start site [TSS]: results remained the same when 1 kb instead of 600 bp of upstream regions were analyzed) and of introns. Because our purpose is to compare patterns of intragenic methylation versus upstream promoter regions, we did not include promoter regions downstream of TSS, which often includes first exons and introns. Intron CpG O/E serves as an indicator of the level of genome-wide methylation. Intergenic CpG O/E is not used for this purpose because these genomes differ greatly in terms of genome size and the amount of intergenic regions. In the relatively well-annotated human genome, for example, CpG O/E distributions of intergenic and intronic regions are similar (Supplementary text and Figure S1, see Supplementary Material online).

Introns of C. intestinalis show 2 distinctive distributions, one with the mean CpG O/E ∼1 and the other with the mean CpG O/E around 0.5 (Suzuki et al. 2007; fig. 1F). This observation is in accord with the finding that the genomic methylation pattern is mosaic in Ciona genome, where intragenic regions of a subset of genes are methylated and others are not methylated (Suzuki et al. 2007). The introns with high CpG O/E (∼1) represents nonmethylated genes, and those with low CpG O/E (∼0.5) showcases methylated genes (Suzuki et al. 2007; fig. 1F). In contrast, we found that CpG content of Ciona promoters follows a unimodal distribution with its mean ∼1 (CpG occurs at the expected frequency; fig. 1A) indicating that promoters are largely unmethylated in this genome.

We analyzed distributions of CpG O/E in promoters and introns of another sea squirt, Ciona savignyi. Genetic divergence between C. intestinalis and C. savignyi is known to be similar to that between human and chicken (Small et al. 2007). We obtained similar results (Figure S2, see Supplementary Material online). promoter regions follow a unimodal Gaussian distribution with mean ∼1, whereas intragenic regions show 2 distinctive curves.

We also analyzed data from sea urchin (Strongylocentrotus purpuratus), which also exhibits a patchy DNA methylation pattern (Tweedie et al. 1997). There are only 131 RefSeq genes from this species that satisfy the criterion of single transcript. Even with this small sample size, results from the promoter and intronic regions of this species are similar to those in the 2 Ciona genomes (Figure S2, see Supplementary Material online).

This above pattern in invertebrate genomes is opposite to that in the human genome. As discussed earlier, human promoters exhibit bimodality of hypo- and hypermethylated portions (fig. 1E), whereas introns show a unimodal distribution with mean around 0.2, reflecting heavy global methylation (fig. 1J).

Distribution of Promoter CpG Content Is Bimodal in Distantly Related Vertebrate Species

Next we investigated if the pattern of promoter and intron CpG depletion found in humans is conserved in nonmammalian vertebrates. We analyzed the following distantly related vertebrate genomes, in addition to the human genome: fugu (Takifugu rubripes), zebrafish (Danio rerio), frog (Xenopus tropicalis), lizard (Anolis carolinensis), and chicken (Gallus gallus). Among these species, results from fugu and lizard are presented in the Supplementary Material online (Figure S2) because of the potential inaccuracy of annotations. We will focus on results from the 4 remaining vertebrate genomes, which are relatively well annotated and cover sufficient phylogenetic depth (fig. 1).

Intronic CpG content of the 4 vertebrate genomes shows clear unimodal distributions (fig. 1F–J). The mean intronic CpG O/E are all below 1, indicating that these genomes are heavily methylated. The suppression of CpG frequency is most pronounced in the human and chicken introns, where the mean CpG O/E is 0.23 and 0.22, respectively (fig. 1I,J and table 1). In the zebrafish, CpG frequency is approximately 40% of the expected (fig. 1G). The observed interspecies differences in intronic CpG O/E could be due to differences in methylation levels or due to the variability of the efficiency of deamination, which is a key step in methylation-induced CpG mutations (Frederico et al. 1993).

Table 1

Normalized CpG Contents in Promoter and Intronic Regions of the 4 Vertebrate Genomes

Species Number of Genes Analyzed Promoter Intron 
LCG Mean (Median) HCG Mean (Median) Mean (Median) CpG O/E 
Zebrafish 4,974 0.34 (0.39) 0.83 (0.88) 0.38 (0.38) 
Frog 3,638 0.21 (0.21) 0.62 (0.63) 0.27 (0.26) 
Chicken 2,375 0.24 (0.25) 0.83 (0.87) 0.21 (0.22) 
Human 7,869 0.22 (0.20) 0.74 (0.76) 0.19 (0.17) 
Species Number of Genes Analyzed Promoter Intron 
LCG Mean (Median) HCG Mean (Median) Mean (Median) CpG O/E 
Zebrafish 4,974 0.34 (0.39) 0.83 (0.88) 0.38 (0.38) 
Frog 3,638 0.21 (0.21) 0.62 (0.63) 0.27 (0.26) 
Chicken 2,375 0.24 (0.25) 0.83 (0.87) 0.21 (0.22) 
Human 7,869 0.22 (0.20) 0.74 (0.76) 0.19 (0.17) 

Bimodal distribution of low and high CpG promoters (LCG and HCG, respectively) is strongly supported based upon statistical tests, whereas introns show unimodal distributions. The mean and medians of CpG O/E are shown.

In contrast, upstream promoter regions of all 4 vertebrate genomes follow 2 distinctive distributions of LCGs and HCGs (fig. 1 and table 1). We used an expectation–maximization (EM) algorithm to fit the observed distributions to unimodal, bimodal, and trimodal Gaussian distributions and compared the likelihoods (see Materials and Methods, table 1). Bimodality is consistently a far better fit to the observed distributions than unimodal distributions (P < 10−10 using likelihood ratio test in all species). A recent paper suggested a “trimodal” distribution rather than bimodal (Weber et al. 2007). However, the likelihood of bimodal model was also significantly greater than that of trimodal model in all the species analyzed (P < 10−10 using likelihood ratio test in all species). Therefore, bimodality of hypo- and hypermethylated promoters is a common feature in several distantly related vertebrate taxa separated by hundreds of millions of years.

It is known that the G and C nucleotide content (G+C content) and CpG O/E are correlated (Gardiner-Garden and Frommer 1987; Duret and Galtier 2000; Fryxell and Zuckerkandl 2000). However, the bimodality of CpG O/E is not caused by GC content bimodality; for example, Saxonov et al. (2006) showed that G+C contents in human promoter regions follow a Gaussian distribution with 1 mean. Similarly, we demonstrate that the bimodality of CpG content in promoters is not caused by the underlying distribution of G+C contents (Supplementary text and Figure S3, see Supplementary Material online).

To test whether the observed pattern is caused by inaccurate TSS annotation, we restricted our analyses to experimentally verified TSS only, using the data from DBTSS (Wakaguri et al. 2008). There are 4,277 human genes in the DBTSS that overlap with our data set. Analyses of this subset of genes show the same results as those from the whole data set (Supplementary Figure S4, see Supplementary Material online). Therefore, our finding is not caused by bias in TSS annotation.

LCGs Are Associated with Tissue-Specific Genes and HCGs Are Associated with Broadly Expressed Genes in Distantly Related Vertebrate Genomes

Next, we investigated the functional implication of the observed bimodality. We first analyzed microarray data from humans to compare with previous results. We analyzed the gene expression data from 79 tissues in gene atlas (Su et al. 2004). Genes with LCG promoters were expressed in fewer tissues (median: 38 tissues) than those with HCG promoters (median: 58 tissues). This difference is statistically significant (table 2, Mann–Whitney test, P < 10−3) and confirms earlier results (Saxonov et al. 2006). Analysis of exon array expression data (Kapur et al. 2007; Xing et al. 2007) from 6 tissues yielded similar results (Supplementary text and Figure S5, see Supplementary Material online).

Table 2

Expression Breadths of Genes with Low and High CpG Content in Upstream Regions

Species Number of Tissues Analyzed Median Number of Tissues Expressed (Proportion of Expressed Tissues to All Tissues) 
Bottom 100 Top 100 LCG HCG 
Zebrafish 14 3 (21.4%) 6 (42.8%) 4 (28.5%) 5 (35.7%) 
Frog 21 4 (19.0%) 8 (38.0%) 4 (19.0%) 6 (28.5%) 
Chicken 18 5 (27.7%) 9 (50.0%) 5 (27.7%) 8 (44.4%) 
Human 49 7 (14.2%) 33 (67.3%) 10 (20.4%) 30 (61.2%) 
Human (microarray) 79 31 (39.2%) 70 (88.6%) 38 (48.1%) 58 (73.4%) 
Species Number of Tissues Analyzed Median Number of Tissues Expressed (Proportion of Expressed Tissues to All Tissues) 
Bottom 100 Top 100 LCG HCG 
Zebrafish 14 3 (21.4%) 6 (42.8%) 4 (28.5%) 5 (35.7%) 
Frog 21 4 (19.0%) 8 (38.0%) 4 (19.0%) 6 (28.5%) 
Chicken 18 5 (27.7%) 9 (50.0%) 5 (27.7%) 8 (44.4%) 
Human 49 7 (14.2%) 33 (67.3%) 10 (20.4%) 30 (61.2%) 
Human (microarray) 79 31 (39.2%) 70 (88.6%) 38 (48.1%) 58 (73.4%) 

Results from zebrafish, frog, and chicken are from EST database. For humans, results from EST and microarray are both presented.

To determine whether the same pattern holds in other species, we chose to analyze EST data because they are available from all the species studied here, allowing a meaningful comparison. We obtained EST data from the Unigene database (Wheeler et al. 2003). We first compared 100 genes with the highest promoter CpG O/E (Top 100; table 2) and the lowest promoter CpG O/E (Bottom 100; table 2). The top 100 had significantly broader expression than the bottom 100 (table 2, Mann–Whitney test, P < 10−6 in all species). Second, we compared the median expression breadths of genes within LCG and HCG classes (obtained by dividing the data where the 2 Gaussian curves intersected). Genes associated with LCG promoters were expressed in significantly fewer tissues than those with HCG promoters (table 2, Mann–Whitney test, P < 10−3 in all species). Furthermore, the relative frequency of HCG increases with the expression breadth in all the 4 species analyzed (fig. 2).

FIG. 2.—

Relative frequency of HCG promoters increases with expression breadth in vertebrate genomes. For each vertebrate species analyzed, the genes were divided into 4 bins based on the percentage of tissues in which they are expressed. Within each bin, the proportion of genes with HCG promoters is plotted. The total number of tissues analyzed in each species is shown in table 2.

FIG. 2.—

Relative frequency of HCG promoters increases with expression breadth in vertebrate genomes. For each vertebrate species analyzed, the genes were divided into 4 bins based on the percentage of tissues in which they are expressed. Within each bin, the proportion of genes with HCG promoters is plotted. The total number of tissues analyzed in each species is shown in table 2.

Vinogradov (2005) has shown that intronic CpG content influences expression breadth, potentially as much as the absence and presence of promoter CpG islands (which is equivalent to the distinction between LCGs and HCGs). To gauge whether promoter bimodality affects expression breadth independent of intronic CpG content, we divided human genes into low and high intronic CpG groups (based on median intronic CpG O/E of 0.17). In each group, LCG genes are expressed in significantly fewer tissues than HCG genes (Supplementary Table 2, see Supplementary Material online). Notably, within each promoter class (LCG or HCG), the expression breadth of genes with low and high intronic CpG content are almost identical (Supplementary Table 2, see Supplementary Material online). Thus, HCG genes tend to be expressed in more tissues on average than LCG genes, and this distinction is independent of intronic CpG content.

Discussion

We have shown that the structural and functional distinctions between LCGs and HCGs are conserved in several distantly related vertebrate genomes. This indicates that the functional role of DNA methylation in gene expression is conserved in these taxa, which were separated for hundreds of millions of years (fig. 1). What is the underlying mechanistic basis of the association of LCG promoters with tissue-specific genes and HCG promoters with broadly expressed genes in these vertebrate genomes? As mentioned earlier, DNA methylation can suppress gene expression directly by interfering with transcription factors or indirectly by recruiting chromatin modification enzymes. A recent study (Vinogradov 2005) showed that in the human genome, CpG content is negatively correlated with chromatin condensation potential, suggesting that LCG promoters will be highly condensed in germ line, rendering the associated gene unsuitable for transcription. Thus, vertebrates may have adapted promoter DNA methylation to epigenetically suppress somatic tissue–specific genes in germ line. Indeed, we found that in the human genome, promoter CpG content is strongly positively correlated with germ line (testis) expression level (fig. 3; Spearman's correlation coefficient = 0.38; P < 10−6). The median germ line expression level of LCG genes is ∼6-fold lower than that of HCG genes (fig. 3).

FIG. 3.—

Positive relationship between promoter CpG content and germ line expression level in human genome. The promoter CpG contents of human genes are plotted against its germ line (testis) expression level (in log2 scale) from exon array data. LCGs and HCGs are colored black and red, respectively. The white crosses indicate the median expression level of LCGs and HCGs (6.1 and 8.9, respectively).

FIG. 3.—

Positive relationship between promoter CpG content and germ line expression level in human genome. The promoter CpG contents of human genes are plotted against its germ line (testis) expression level (in log2 scale) from exon array data. LCGs and HCGs are colored black and red, respectively. The white crosses indicate the median expression level of LCGs and HCGs (6.1 and 8.9, respectively).

The invertebrate chordates with mosaic genomic DNA methylation pattern analyzed here exhibit a single class of hypomethylated promoters (fig. 1 and Supplementary Figure S2, see Supplementary Material online). This finding, along with the observation that the structural and functional bimodality in promoters is conserved across several distantly related vertebrate species spanning zebrafish to human, suggests that the bimodality originated early in vertebrate evolution, potentially as a consequence of the transition from mosaic to global methylation of genomes. Hypomethylation of promoters, as found in Ciona, is likely to be the ancestral state because the mosaic DNA methylation pattern in Ciona is typical of methylated chordates (Suzuki et al. 2007), and other invertebrate genomes such as those of arthropods and nematodes are free of DNA methylation (Tweedie et al. 1997). In other words, LCG promoters are a derived feature of vertebrate genomes.

The fact that CpG contents of LCGs are similar to that of the rest of the genome whereas HCGs preserve CpG contents in several distantly related vertebrate genomes (fig. 1 and table 1; also see Supplementary text, see Supplementary Material online) provides a clue to the origin of the vertebrate LCG promoters. Specifically, it indicates that the level of DNA methylation in LCG promoters is similar to the genome-wide level. We propose that LCG promoters, and consequently the bimodal distribution of CpG contents in vertebrate promoters, have originated due to mutational decay of CpG dinucleotides following DNA methylation. Our functional analyses suggest that this process has occurred preferentially in upstream regions of tissue-specific genes.

HCG promoters, on the other hand, maintain high CpG contents despite the global genomic methylation. One possible explanation is that broadly expressed genes have selectively avoided DNA methylation and remained as HCGs because silencing of such genes due to DNA methylation would have been deleterious. For instance, aberrant promoter methylation in the human genome is highly deleterious, often associated with several diseases including cancer (Robertson and Wolffe 2000; Esteller and Herman 2002; Egger et al. 2004).

According to this model, the rate of CpG loss should be greater in LCG promoters than in HCG promoters. This is indeed the case in the human genome (Weber et al. 2007). Comparing expression patterns in human and mouse genomes has further revealed that CpG islands were preferentially being lost from promoters of tissue-specific genes (Jiang et al. 2007). These observations provide strong support to the role of DNA methylation on the origin and evolution of the bimodality of vertebrate promoters.

Supplementary Material

Supplementary text, tables, and figures are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

We thank comments from the Yi laboratory, especially from Brendan Hunt. Comments from anonymous reviewers on the previous versions of the manuscript provided helpful insights. This study is supported by funds from the Blanchard-Milliken Fellowship and the Alfred P. Sloan Foundation to S.Y.

References

Antequera
F
Structure, function and evolution of CpG island promoters
Cell Mol Life Sci
 , 
2003
, vol. 
60
 (pg. 
1647
-
1658
)
Bird
A
DNA methylation and the frequency of CpG in animal DNA
Nucl Acids Res
 , 
1980
, vol. 
8
 (pg. 
1499
-
1504
)
Carninci
P
Tagging mammalian transcription complexity
Trends Genet
 , 
2006
, vol. 
22
 (pg. 
501
-
510
)
Coulondre
C
Miller
JH
Farabaugh
PJ
Gilbert
W
Molecular basis of base substitution hotspots in Escherichia coli
Nature
 , 
1978
, vol. 
274
 (pg. 
775
-
780
)
Davuluri
RV
Suzuki
Y
Sugano
S
Zhang
MQ
CART classification of human 5′ UTR sequences
Genome Res
 , 
2000
, vol. 
10
 (pg. 
1807
-
1816
)
Duret
L
Galtier
N
The covariation between TpA deficiency, CpG deficiency, and G+C content of human isochores is due to a mathematical artifact
Mol Biol Evol
 , 
2000
, vol. 
17
 (pg. 
1620
-
1625
)
Egger
G
Liang
G
Aparicio
A
Jones
PA
Epigenetics in human disease and prospects for epigenetic therapy
Nature
 , 
2004
, vol. 
429
 (pg. 
457
-
463
)
Elango
N
Kim
S-H
Vigoda
E
Yi
SV
Mutations of different molecular origins exhibit contrasting patterns of regional substitution rate variation
PLoS Comput Biol
 , 
2008
, vol. 
4
 pg. 
e1000015
 
Esteller
M
Herman
JG
Cancer as an epigenetic disease: dNA methylation and chromatin alterations in human tumours
J Pathol
 , 
2002
, vol. 
196
 (pg. 
1
-
7
)
Frederico
LA
Kunkel
TA
Shaw
BR
Cytosine deaminataion in mismatched base pairs
Biochemistry
 , 
1993
, vol. 
32
 (pg. 
6523
-
6530
)
Fryxell
KJ
Zuckerkandl
E
Cytosine deamination plays a primary role in the evolution of mammalian isochores
Mol Biol Evol
 , 
2000
, vol. 
17
 (pg. 
1371
-
1383
)
Gardiner-Garden
M
Frommer
M
CpG islands in vertebrate genomes
J Mol Biol
 , 
1987
, vol. 
196
 (pg. 
261
-
282
)
Hedges
SB
Kumar
S
Genomic clocks and evolutionary timescales
Trends Genet
 , 
2003
, vol. 
19
 (pg. 
200
-
206
)
Hendrich
B
Tweedie
S
The methyl-CpG binding domain and the evolving role of DNA methylation in animals
Trends Genet
 , 
2003
, vol. 
19
 (pg. 
269
-
277
)
Jiang
C
Han
L
Su
B
Li
W-H
Zhao
Z
Features and trend of loss of promoter-associated CpG islands in the human and mouse genomes
Mol Biol Evol
 , 
2007
, vol. 
24
 (pg. 
1991
-
2000
)
Jones
PA
Takai
D
The role of DNA methylation in mammalian epigenetics
Science
 , 
2001
, vol. 
293
 (pg. 
1068
-
1070
)
Kapur
K
Xing
Y
Ouyang
Z
Wong
WH
Exon arrays provide accurate assessments of gene expression
Genome Biol
 , 
2007
, vol. 
8
 pg. 
R82
 
Karolchik
D
Kuhn
RM
Baertsch
R
, et al.  . 
(25 co-authors)
The UCSC genome browser database: 2008 update
Nucl Acids Res
 , 
2008
, vol. 
36
 (pg. 
D773
-
D779
)
Klose
RJ
Bird
AP
Genomic DNA methylation: the mark and its mediators
Trends Biochem Sci
 , 
2006
, vol. 
31
 (pg. 
89
-
97
)
Majewski
J
Ott
J
Distribution and characterization of regulatory elements in the human genome
Genome Res
 , 
2002
, vol. 
12
 (pg. 
1827
-
1836
)
Ott
J
NOCOM and COMPMIX programs release
 , 
1992
New York
Rockefeller University
Pruitt
KD
Tatusova
T
Maglott
DR
NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucl Acids Res
 , 
2007
, vol. 
35
 (pg. 
D61
-
D65
)
Robertson
KD
Wolffe
AP
DNA methylation in health and disease
Nat Rev Genet
 , 
2000
, vol. 
1
 (pg. 
11
-
19
)
Saxonov
S
Berg
P
Brutlag
DL
A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters
Proc Natl Acad Sci USA
 , 
2006
, vol. 
103
 (pg. 
1412
-
1417
)
Simmen
MW
Leitgeb
S
Charlton
J
Jones
SJ
Harris
BR
Clark
VH
Bird
A
Nonmethylated transposable elements and methylated genes in a chordate genome
Science
 , 
1999
, vol. 
283
 (pg. 
1164
-
1167
)
Small
KS
Brudno
M
Hill
MM
Sidow
A
A haplome alignment and reference sequence of the highly polymorphic Ciona savignyi genome
Genome Biol
 , 
2007
, vol. 
8
 pg. 
R41
 
Su
AI
Wiltshire
T
Batalov
S
, et al.  . 
(13 co-authors)
A gene atlas of the mouse and human protein-encoding transcriptomes
Proc Natl Acad Sci USA
 , 
2004
, vol. 
101
 (pg. 
6062
-
6067
)
Suzuki
MM
Kerr
ARW
De Sousa
D
Bird
A
CpG methylation is targeted to transcription units in an invertebrate genome
Genome Res
 , 
2007
, vol. 
17
 (pg. 
625
-
631
)
Tang
CSM
Epstein
RJ
A structural split in the human genome
PLoS ONE
 , 
2007
, vol. 
2
 pg. 
e603
 
Tweedie
S
Charlton
J
Clark
V
Bird
A
Methylation of genomes and genes at the invertebrate-vertebrate boundary
Mol Cell Biol
 , 
1997
, vol. 
17
 (pg. 
1469
-
1475
)
Vinogradov
AE
Dualism of gene GC content and CpG pattern in regard to expression in the human genome: magnitude versus breadth
Trends Genet
 , 
2005
, vol. 
21
 (pg. 
639
-
643
)
Wakaguri
H
Yamashita
Y
Suzuki
S
Sugano
R
Nakai
K
DBTSS: database of transcription start sites, progress report 2008
Nucl Acids Res
 , 
2008
, vol. 
36
 
Suppl 1
(pg. 
D97
-
D101
)
Weber
M
Hellmann
I
Stadler
MB
Ramos
L
Pääbo
S
Rebhan
M
Schübeler
D
Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome
Nat Genet
 , 
2007
, vol. 
39
 (pg. 
457
-
466
)
Wheeler
DL
Church
DM
Federhen
S
, et al.  . 
(11 co-authors)
Database resources of the National Center for Biotechnology
Nucl Acids Res
 , 
2003
, vol. 
31
 (pg. 
28
-
33
)
Xing
Y
Ouyang
Z
Kapur
K
Scott
MP
Wong
WH
Assessing the conservation of mammalian gene expression using high-density exon arrays
Mol Biol Evol
 , 
2007
, vol. 
24
 (pg. 
1283
-
1285
)

Author notes

David Irwin, Associate Editor