Abstract

Proximal promoters are fundamental genomic elements for gene expression. They vary in terms of GC percentage, CpG abundance, presence of TATA signal, evolutionary conservation, chromosomal spread of transcription start sites and breadth of expression across cell types. These properties are correlated, and it has been suggested that there are two classes of promoters: one class with high CpG, widely spread transcription start sites and broad expression, and another with TATA signals, narrow spread and restricted expression. However, it has been unclear why these properties are correlated in this way. We reexamined these features using the deep FANTOM5 CAGE data from hundreds of cell types. First, we point out subtle but important biases in previous definitions of promoters and of expression breadth. Second, we show that most promoters are rather nonspecifically expressed across many cell types. Third, promoters’ expression breadth is independent of maximum expression level, and therefore correlates with average expression level. Fourth, the data show a more complex picture than two classes, with a network of direct and indirect correlations among promoter properties. By tentatively distinguishing the direct from the indirect correlations, we reveal simple explanations for them.

INTRODUCTION

CAGE data and the nature of transcription initiation

CAGE (Cap Analysis of Gene Expression) is a powerful method for profiling transcription start sites (TSSs). In this method, large numbers of short sequence tags are obtained from the 5′-ends of capped RNA molecules. [short RNAs <100 bases are excluded by AMPure purification (1)]. These tags are then aligned to a reference genome sequence, and they indicate both the location of TSSs and the expression level, i.e. the proportion of RNA molecules starting at each site (2).

It is important to remember that CAGE tags are samples from the RNA population. For example, in a pineal gland, one TSS might have a true expression level of 7.62 parts per million (ppm), meaning that 7.62 ppm of all the capped RNAs in that pineal gland at that moment were transcribed from that TSS. However, if we obtain a million CAGE tags, then by chance, 6 (or 10) of them might match that TSS. This relationship between population and sample is illustrated in Figure 1.

Figure 1.

A fictional example of transcription initiation and CAGE tags. The lower graph shows the true proportions, in parts per million, of capped RNAs starting at each base on one strand of a short chromosomal segment. The upper graph shows the number of CAGE tags starting at each base. The horizontal gray lines show some ways of clustering the CAGE tags into four clusters, two clusters or three clusters.

Figure 1.

A fictional example of transcription initiation and CAGE tags. The lower graph shows the true proportions, in parts per million, of capped RNAs starting at each base on one strand of a short chromosomal segment. The upper graph shows the number of CAGE tags starting at each base. The horizontal gray lines show some ways of clustering the CAGE tags into four clusters, two clusters or three clusters.

In the classic textbook view, each gene has one promoter (or a few alternative promoters), and each promoter has one TSS at exactly one nucleotide. CAGE and other data have revealed that the reality is much more messy (Figure 1): transcription can initiate at many alternative nucleotides (2–4).

In some promoters, transcription starts are spread over a wide span of nucleotides, and in others, they are concentrated in a few nucleotides. Previous studies have suggested that TSS spread correlates with CpG islands (CGIs), non–cell-specific expression and other properties (2,5–8). In these studies, the raw CAGE data (e.g. upper graph in Figure 1) was first ‘clustered’, with the idea that each cluster corresponds to a promoter. Then, each cluster’s degree of spread could be measured.

The problem is that there are different ways to cluster CAGE tags, and it is not clear which is correct. In fact, it may not even be falsifiable. For example, in Figure 1, it could be argued that there are four tight clusters, or two loose clusters. In other words, there is some arbitrariness in defining discrete promoters, just as there is arbitrariness in defining discrete mountains in a rugged landscape.

The simplest clustering method is ‘distance-based clustering’, which just links CAGE start sites that are within (say) 20 bases of each other. Previous studies have often used ‘tag overlap clustering’, which links tags that overlap after alignment to the genome. Because the tags in these earlier studies had length 20 or 21, this is almost the same as distance-based clustering. These approaches are intuitively flawed because they produce wider clusters when there are more tags. For example, in Figure 1, they may produce three clusters, even though the true expression of the left-hand double-peak has the same shape as the right-hand double-peak. This introduces a spurious correlation between expression level and TSS spread.

An alternative clustering method that avoids this flaw is ‘density-based clustering’ (9), and there are also methods that consider the similarity of each TSS’s expression profile across cell types (10,34). However, we lack certainty that any of these clustering methods avoids all subjectivity and bias. In the present study, we attempt to measure TSS spread objectively, by not using any specific clustering.

Expression specificity

Some promoters are mainly expressed in one or a few cell types, and others are broadly expressed across many cell types. It is often desired to quantify a promoter’s expression specificity using a single number. This is another case where there are multiple ways to do it, and no one way is obviously best (11).

For example, suppose we have 100 cell types:

  • Promoter A is expressed at 100 ppm in one cell type and 0 ppm in the others,

  • B is expressed at 50 ppm in one and 20 ppm in the others,

  • C is expressed at 30 ppm in 50 cell types and 0 ppm in the others and

  • D is expressed at 10 ppm in all cell types.

Clearly A is the most and D the least specific, but opinions can differ about B and C.

This problem of measuring diversity has long been faced by ecologists and economists, who have proposed various diversity indices, such as Shannon entropy, Simpson index and Gini coefficient (11) (http://en.wikipedia.org/wiki/Diversity_index). More recently, Shannon entropy has been used to quantify gene expression specificity (12).

Another problem is that we need to estimate expression specificity from limited samples of CAGE tags. Unfortunately, Shannon entropy and other diversity indices have a systematic bias: when calculated from limited samples, the estimated specificity tends to be higher than the true specificity (13–16).

A different problem is that it can be unclear whether two cell types are different. For example, suppose we measure a promoter’s expression in aortic smooth muscle, bladder smooth muscle, bronchial smooth muscle, prostate smooth muscle, hepatocytes and macrophages. If we treat all these cell types as equally different, it seems we may get a biased picture of expression specificity. On the other hand, it has been suggested that smooth muscle in particular (in addition to fibroblasts and neurons) may represent an ‘uncalculated diversity of cell types’ (17). All existing expression atlases, including FANTOM5, are biased toward easily accessible cell types (such as muscle) and lack others (such as rare embryonic cell types).

CpG islands

Mammalian genomes are strongly depleted in CG dinucleotides (relative to the abundance of C and G), and CGIs are short segments that are less depleted in CG. The likely reason for CG depletion is that CG dinucleotides are usually methylated on the cytosine, and methylcytosine has a high mutation rate. CGIs often overlap proximal promoters, and promoter CG methylation is associated with transcriptional silencing. This leads to a simple explanation of CGIs: they reflect promoters that are active and thus unmethylated in germ line cells. (Recall that germ line cells are those whose DNA and mutations can pass to offspring.) In particular, ‘the pattern of CGIs in the genome should reflect a weighted average of methylation patterns in the germ line for which the weight is proportional to the time spent in the particular methylation state’ (18).

This is a beautifully parsimonious explanation of CGIs, which does not require ascribing any function to them: they are merely passive consequences of methylation and mutation patterns in germ cells. (Of course, this does not rule out that CGIs may occasionally contribute motifs that become exapted to function as, say, Sp1 binding sites.)

An alternative theory is that CGIs are functional elements. It has been suggested that they are nucleosome-destabilizing elements (19), and that they influence chromatin modification state (20). If CGIs are functional, they ought to experience evolutionary selection (19,20). A recent study examined this question using mathematical models of sequence evolution, concluding that hypomethylation, and not selection, largely explains CGIs (21). All in all, there is confusion regarding why CGIs exist.

The FANTOM5 data

The FANTOM5 phase 1 data has CAGE tags from 517 human samples (after pooling replicates). These include ‘primary cells’ (e.g. amniotic membrane cells, salivary acinar cells, tenocytes), ‘tissues’ (e.g. achilles tendon, adrenal gland, amygdala) and ‘cell lines’ (e.g. glioma cell line GI-1, teratocarcinoma cell line PA-1). The primary cells will not perfectly represent pure in vivo cell types because of the imperfect procedures used to obtain and handle them. The tissues and cell lines are useful because they include cell types and promoters missing from the primary cell collection. In this study, we used all the samples on an equal footing, except where stated otherwise.

There are also 290 mouse samples, with more tissues than the human samples, but fewer primary cells and much fewer cell lines. Twelve human samples (all cell lines) and no mouse samples are germ cells, according to the FANTOM5 ontology.

This work is part of the FANTOM5 project. Data downloads, genomic tools and copublished manuscripts are summarized here http://fantom.gsc.riken.jp/5/.

MATERIALS AND METHODS

We conservatively chose to use only CAGE tags near starts of RNAs in the RefSeq database (22,23). There are many more CAGE clusters than RefSeq starts, but it is unclear how many of them reflect promoters and how many reflect something else (e.g. artifacts of the CAGE method). There is a risk that RefSeq starts underrepresent promoters with low and/or highly cell-specific expression. Another danger is that RefSeq RNAs are based on RNA sequence data, which is not necessarily more reliable than CAGE and may suffer from the same artifacts. However, to make progress, we provisionally accept that these adequately represent true promoters.

Our first thought was to use CAGE tags starting up to forumla bases from each RefSeq start (d = 50). The problem is that RefSeq starts are sometimes slightly upstream of the main CAGE peak (perhaps because RefSeq uses the most-upstream transcript evidence). Therefore, we (i) found locally maximal CAGE start sites that have more tags than any other site up to forumla either side, (ii) discarded locally maximal sites more than forumla bases from a RefSeq start and (iii) used CAGE tags starting up to forumla bases from each locally maximal site. This gave us a set of 17 039 ‘promoters’.

To ensure our conclusions are robust, we also tried d = 20, d = 100 and d = 200, as well as mouse data with d = 50 (Supplementary Figures S1–S8). The conclusions do not change.

Sample ontology

To identify and pool replicate CAGE samples, we looked for samples that have identical ‘is_a’ terms in the FANTOM5 ontology.

Entropy

The expression level of promoter g in sample t is forumla, where forumla is the number of CAGE tags in that promoter and sample, and Nt is the number of mapped tags in the sample. To calculate entropy with pseudocounts, we used this alternative formula: forumla, where Ng is the number of promoters (17 039). The entropy of promoter g is forumla, where forumla.

Simulated promoters

Each simulated promoter is uniformly expressed in all cell types, 50% of cell types (chosen at random per promoter), 25% of cell types or 10% of cell types. Each simulated promoter has the same total tag count as the corresponding real promoter. Each tag was randomly assigned to a cell type, with probability proportional to the number of mapped tags for that cell type.

CpG and %G+C

We counted CpGs and %G+C in the length-201 sequence centered on each promoter.

TATA motifs

We found the highest-scoring TATA match within the length-(forumla) sequence centered on each promoter, on the coding strand. The score for base x at position k in the motif is forumla, where ckx is the count for base x at position k in JASPAR matrix MA0108.2 (24).

Percentage identity versus mouse

We measured the percentage identity for the length-100 sequence immediately upstream of each promoter’s central base. Percentage identity here means the percentage of human bases that are aligned to an identical mouse base, in the hg19 vsMm9 axtNet files from the University of California, Santa Cruz (UCSC) genome database (23).

Miscellaneous

We used RefSeq annotations from FANTOM5’s 1 January 2012 snapshot of the UCSC genome database. We calculated correlation coefficients and their P-values using the R function cor.test, and partial correlations using the R package ppcor (25).

RESULTS

Sampling depth confounds expression specificity

We attempted to quantify each promoter’s expression specificity, using entropy, which varies from 0 for promoters expressed in a single sample to forumla for promoters with perfectly uniform expression (12). As mentioned above, entropy estimates from limited samples have a systematic bias, but it is not obvious whether this will be significant or negligible in our case. To examine this, we performed a simulation: we took the 3287 promoters with forumla tags, randomly sampled 102, 103 or 104 tags from each promoter and calculated the entropies. The entropy tends to decrease as the sample size decreases (Figure 2, left column). The difference between 103 and 104 looks small, but it is large relative to the tightness of the distribution.

Figure 2.

Sampling depth confounds three measures of expression specificity. We took the 3287 promoters with at least 105 tags, and from each promoter, we randomly sampled (with replacement) 104 tags (upper row), 1000 tags (middle row) or 100 tags (lower row). The lower-right panel is empty because with 100 tags and >200 cell types the median expression is necessarily 0.

Figure 2.

Sampling depth confounds three measures of expression specificity. We took the 3287 promoters with at least 105 tags, and from each promoter, we randomly sampled (with replacement) 104 tags (upper row), 1000 tags (middle row) or 100 tags (lower row). The lower-right panel is empty because with 100 tags and >200 cell types the median expression is necessarily 0.

The underlying problem is that, if we get a limited sample of tags from a promoter with fairly uniform expression across cell types, the expression looks more spiky than it really is. One possible solution is to use pseudocounts, i.e. add one to each promoter’s tag count in each sample (12). In our simulation, pseudocounts did not solve the problem: the entropy still tends to decrease with decreasing sample size (Figure 2, middle column). Again, the decrease in entropy looks small, but it is large relative to the tightness of the distribution. Moreover, with 102 tags there are more pseudocounts than real counts, so the specificity estimates are not believable.

An alternative measure of specificity is a promoter’s maximum expression level in any sample divided by its median expression level (34). This varies from one for uniform expression, to large values for specific expression. In our simulation, this also exhibits a bias: the apparent specificity increases with decreasing sample size, even though the real specificity does not change (Figure 2, right column).

In summary, it is not straightforward to quantify expression specificity in a way that does not correlate artifactually with sampling depth. This artifact matters because tag count correlates with other properties, such as average expression level and CpG content. So, for example, it might introduce a spurious correlation between CGIs and apparent breadth of expression across cell types. An important message is that future studies need to be careful when assessing expression specificity. In this study, we solve the problem by using the entropy of a random sample of 100 tags from each promoter, thus fixing the sampling depth to a constant.

TSS spread

We quantified each promoter’s TSS spread using two standard measures of spread: interquartile range (IQR) and standard deviation (SD). That is, for every tag in the promoter, we noted its start coordinate, and calculated the IQR and SD of these coordinates. IQR shows a bimodal distribution (Figure 3A): there is one class of promoters with narrow spread (IQR < 7, peaking around 2), and another with wide spread (IQR > 7, peaking around 20). This agrees with previous findings (2,5). It is a bit disturbing that SD does not show clear bimodality (Figure 3B), but the distribution has a bulge at low SD that could be interpreted as the narrow-TSS-spread class.

Figure 3.

Histograms of 10 promoter properties: chromosomal spread of TSSs measured by IQR (A) and SD (B), observed/expected CpGs (C), G+C percentage (D), score of TATA motif match (E), evolutionary conservation (F), breadth of expression across cell types (G), number of CAGE tags (H), average expression level (I), and maximum expression level (J). Expression breadth is the entropy (in bits) of a random sample of 100 tags from a promoter.

Figure 3.

Histograms of 10 promoter properties: chromosomal spread of TSSs measured by IQR (A) and SD (B), observed/expected CpGs (C), G+C percentage (D), score of TATA motif match (E), evolutionary conservation (F), breadth of expression across cell types (G), number of CAGE tags (H), average expression level (I), and maximum expression level (J). Expression breadth is the entropy (in bits) of a random sample of 100 tags from a promoter.

Expression specificity

Most promoters show broad expression across many cell and tissue samples (Figure 3G). It might be objected that this is due to the tissue samples being mixtures of cell types. To address this concern, we also measured expression breadth using the ‘primary cell’ samples only. This hardly changes the picture (Figure 4A), although there is a small peak of promoters with highly cell-specific expression (entropy close to 0).

Figure 4.

Histograms of promoters’ expression breadth across cell types, using FANTOM5 ‘primary cells’ only. Expression breadth is the entropy (in bits) of a random sample of 100 tags from a promoter. (A) Real data. (B) Simulated data, where each promoter has uniform expression across all cell types. (C) Simulated data, where each promoter has uniform expression across 50% of the cell types (chosen at random) and 0 in the other 50%. (D) Simulated data: uniform expression across 25% of cell types and 0 in the others. (E) Simulated data: uniform expression across 10% of cell types and 0 in the others.

Figure 4.

Histograms of promoters’ expression breadth across cell types, using FANTOM5 ‘primary cells’ only. Expression breadth is the entropy (in bits) of a random sample of 100 tags from a promoter. (A) Real data. (B) Simulated data, where each promoter has uniform expression across all cell types. (C) Simulated data, where each promoter has uniform expression across 50% of the cell types (chosen at random) and 0 in the other 50%. (D) Simulated data: uniform expression across 25% of cell types and 0 in the others. (E) Simulated data: uniform expression across 10% of cell types and 0 in the others.

To calibrate our entropy scale, we show results for four simulated data sets: promoters with uniform expression in 100, 50, 25 and 10% of the cell types (Figure 4B–E). This confirms that most real promoters are broadly expressed across many cell types, though not uniformly expressed in all cells. For instance, the real promoters have a median entropy of 5.41 (Figure 4A), similar to the simulated promoters with uniform expression in 50% of cell types (Figure 4C). Only 18% of the real promoters have entropy <4 (similar to the simulated promoters with uniform expression in 10% of cell types), which might be considered somewhat specific. Promoters expressed in just one cell type have entropy = 0, and there are only 48 of those.

Curiously, 31 of the 48 cell-specific promoters are active in hepatocytes. They include promoters for ADH1A (alcohol dehydrogenase 1A), ALDOB (aldolase B), coagulation factors F9 and F12 and several cytochrome P450 genes.

That most promoters are broadly expressed was also shown using a richness index (34) (Supplementary Note 3). Richness is simply the number of samples in which we would expect to find one or more tags for a given promoter, if that promoter had (say) 100 tags in total (11). For most promoters, if we had 10 tags in total, they would likely be distributed among 9 or 10 different samples. Thus, most promoters are expressed almost as uniformly across samples as they could possibly be.

Other promoter properties

We also measured the following:

  • %G+C.

  • Observed/expected CpGs, i.e. the observed number of CG dinucleotides divided by the number expected from %G+C.

  • Score of strongest TATA motif match. We found the best match in every promoter, no matter how weak, so this will be a mixture of real signals and random matches.

  • Percentage sequence identity versus mouse.

  • Tag count.

  • Average expression level (tags per million, tpm).

  • Maximum expression level (tpm).

Their distributions are not too surprising (Figure 3). CpG richness is bimodally distributed (18). Some promoters have 0% identity versus mouse, either because they are in large evolutionary insertions or deletions, or because the UCSC genome alignments failed to align them.

Expression specificity is independent of maximum expression level

It is not obvious how expression specificity relates to expression level. One hypothesis is that promoters’ expression specificity should be independent of their average expression level across cell types. This fits a picture where each promoter has a default expression level, which can be upregulated in some cell types and downregulated in others. The more the regulation, the greater is the specificity. In this case, we would expect specificity to correlate with maximum expression level.

Another hypothesis is that promoters’ specificity should be independent of their maximum expression level. This fits a picture where each promoter has a maximum expression level, which is near-fully achieved in more or fewer cell types. The fewer the cell types where the promoter is near-fully on, the greater is the specificity. In this case, we would expect specificity to anticorrelate with average expression level.

In fact, expression breadth (entropy of a sample of 100 tags) is almost independent of maximum expression level, and correlates with average expression level (Figure 5). This fits the second hypothesis.

Figure 5.

Pairwise correlations among nine promoter properties. The numbers in the upper-right boxes are Kendall rank-order correlation coefficients (Txy). Statistically significant (two-sided forumla) coefficients are in red, and nonsignificant ones are in black.

Figure 5.

Pairwise correlations among nine promoter properties. The numbers in the upper-right boxes are Kendall rank-order correlation coefficients (Txy). Statistically significant (two-sided forumla) coefficients are in red, and nonsignificant ones are in black.

Correlations

We examined pair-wise correlations among nine promoter properties (Figure 5). The picture is complex: nearly all pairs are significantly correlated or anticorrelated. Some of these associations could be indirect. For example, there is a correlation between expression breadth and %G+C, but there are stronger correlations between each of these and germ cell expression. So the correlation between expression breadth and %G+C might be merely an indirect consequence of the stronger correlations.

We attempted to distinguish direct from indirect associations, by discarding all associations that can potentially be explained by stronger ones. This leads to an interesting model that explains all the correlations (Figure 6). Strikingly, almost all of the putatively direct associations have simple explanations:

Figure 6.

A model for correlations among several promoter properties. Solid lines represent positive correlations and dashed lines represent negative correlations. Line widths are proportional to correlation strength.

Figure 6.

A model for correlations among several promoter properties. Solid lines represent positive correlations and dashed lines represent negative correlations. Line widths are proportional to correlation strength.

Average tpm and germ cell average tpm

This is the strongest correlation in Figure 5. It is not surprising because average tpm includes germ cell tpm. (Therefore, average tpm must be forumlagerm cell average tpm.)

Average tpm and maximum tpm

This is the second-strongest correlation. It is also not surprising because average tpm must lie between maximum tpm and (maximum tpm)/517. Accordingly, the scatterplot lies in a diagonal band between these bounds (Figure 5).

Germ cell expression and expression breadth

In fact, expression breadth correlates not only with germ cell expression, it correlates with expression in any given cell type. For example, it correlates with expression in kidney (forumla) and lung (forumla) samples. This can be explained by the following two observations: (i) expression breadth is independent of maximum expression level; (ii) the greater a promoter’s breadth of expression, the more likely it is to be expressed near its maximum level in any given cell type.

Observed/expected CpGs and %G+C

This correlation can be explained by the methylation/mutation mechanism behind CpG depletion. This mechanism strongly depletes CG dinucleotides, and weakly depletes C and G mononucleotides (26). Because this mechanism acts more strongly in some promoters than others, it causes a correlation between %G+C and observed/expected CpGs. Another possible explanation is GC-biased gene conversion during recombination. This phenomenon increases %G+C in recombination-prone regions of the genome (27), whereas DNA methylation and transcriptional silencing suppress recombination (28).

Germ cell expression and observed/expected CpGs

This correlation fits the simple explanation of CGIs that they are due to nonmethylation of active promoters in germ line cells (29).

%G+C and TATA score

There are perhaps two reasons for this anticorrelation because our TATA matches include both real signals and random matches. The first is trivial: random TATA matches are likely to be weaker in GC-rich sequences. The second is less trivial: we suggest that real TATA signals are evolutionarily favored in GC-poor sequences. The assumption is that many promoters can function equally well with either a real TATA signal, or alternative non-TATA promoter signals. For such promoters, if mutational patterns tend to enrich for A and T, evolution is more likely to produce real TATA signals.

TSS spread and maximum tpm

TSS spread (IQR) is only weakly associated with the other properties (Figure 5). The strongest association is anticorrelation with maximum tpm. We cannot explain this as easily as the other associations. One possible explanation is that, when the maximum tpm is low, the CAGE tags might include a proportionally larger amount of diffuse ‘noise’.

Evolutionary conservation and maximum tpm

Evolutionary conservation (Percentage identity versus mouse) is only weakly associated with the other properties (Figure 5). The strongest correlation is with maximum tpm. This can be explained by the plausible hypothesis that promoters with higher maximum expression tend (slightly) to be more important and conserved in mammalian biology.

TSS spread and TATA score

TSS spread (IQR) is weakly anticorrelated with TATA score (Figure 5). This makes sense because the TATA motif is one of several signals that influence TSS position (9).

Partial correlations

We have suggested that many of the associations in Figure 5 are indirect, i.e. merely consequences of other direct associations. This can be tested using partial correlation. The partial correlation coefficient forumla indicates the correlation between x and y after eliminating the influence of z. Thus, if x and y are indirectly correlated via z, forumla will be near-zero and nonsignificant.

Unfortunately, there is a practical limitation. If x and y are indirectly correlated via z, but our measurements of z are not perfect, then forumla can be highly significant (30).

Table 1 shows some examples. It suggests that the anticorrelation between TATA score and observed/expected CpGs is mostly, but not entirely, explained by the associations of both with %G+C. One interpretation of this is that %G+C is an imperfect proxy for the mutational patterns that influence TATA evolution.

Table 1.

Partial correlations

xy  Txy (p forumla (p
CpG ∼ TATA  %G+C  –0.259 (0)  –0.0619 (9.01e–34) 
CpG ∼ %G+C  TATA  0.491 (0)  0.435 (0) 
TATA ∼ %G+C  CpG  –0.428 (0)  –0.358 (0) 
CpG ∼ ExB  GCE  0.366 (0)  0.208 (0) 
CpG ∼ GCE  ExB  0.407 (0)  0.28 (0) 
ExB ∼ GCE  CpG  0.493 (0)  0.404 (0) 
xy  Txy (p forumla (p
CpG ∼ TATA  %G+C  –0.259 (0)  –0.0619 (9.01e–34) 
CpG ∼ %G+C  TATA  0.491 (0)  0.435 (0) 
TATA ∼ %G+C  CpG  –0.428 (0)  –0.358 (0) 
CpG ∼ ExB  GCE  0.366 (0)  0.208 (0) 
CpG ∼ GCE  ExB  0.407 (0)  0.28 (0) 
ExB ∼ GCE  CpG  0.493 (0)  0.404 (0) 

Txy, Kendall rank-order correlation coefficient; forumla, Kendall partial rank-order correlation coefficient; p, two-sided P-value; ExB, expression breadth (entropy of a random sample of 100 tags); GCE, germ cell expression (average tpm). (The correlation between CpG and GCE differs from that in Figure 5 because here we omitted 7.6% of promoters with <100 tags.)

On the other hand, the correlation between expression breadth and observed/expected CpGs seems to be only partly explained by the correlation of both with germ cell expression (Table 1). However, our measurement of germ cell expression is extremely imperfect because our germ cell data comes entirely from cell lines. So it remains possible that the correlation between expression breadth and observed/expected CpGs is entirely indirect via germ line activity. In future, it will be informative to get CAGE data from various types of primary germ line cell, given their importance for sequence evolution.

A closer look at TATA motifs

There is more than one way to find TATA motifs. So far, we have looked for good matches to a TATA model relative to a uniform background model. It is also possible to use a per-promoter background model, which uses the base frequencies of each promoter. With a per-promoter background model, the score of a given TATA sequence is reduced if the promoter is AT-rich (because TATA matches are less surprising), and increased if the promoter is AT-poor. So it is not obvious whether our conclusions will be the same.

In fact, our conclusions do not change when we use a per-promoter background model. The anticorrelation between TATA score and %G+C is reduced, but not greatly so (Figure 7). This anticorrelation is not surprising because we observe an anticorrelation if we randomly shuffle each promoter sequence (Figure 7).

Figure 7.

Correlation between TATA motif and G+C composition. (A) Real promoters, uniform background model. (B) Real promoters, per-promoter background model. (C) Shuffled promoters, uniform background model. (D) Shuffled promoters, per-promoter background model. The inset numbers are Kendall rank-order correlation coefficients (Txy).

Figure 7.

Correlation between TATA motif and G+C composition. (A) Real promoters, uniform background model. (B) Real promoters, per-promoter background model. (C) Shuffled promoters, uniform background model. (D) Shuffled promoters, per-promoter background model. The inset numbers are Kendall rank-order correlation coefficients (Txy).

Promoter classes

It has been suggested that there are two (or perhaps three) classes of promoter: one class with wide TSS spread, often CGIs, fast evolution and broad expression across cell types; and another with narrow TSS spread, low CpG content, often TATA signals, slow evolution and cell-restricted expression (2,7,8). On one hand, the correlations in Figure 5 do not suggest clear-cut classes. On the other hand, Figure 3 suggests two CpG classes (high and low), and two TSS spread classes (wide and narrow). So the question arises of whether the high-CpG class is the wide-spread class.

To visualize this intuitively, we show a mosaic plot reflecting the numbers of promoters in the four possible combinations of high/low CpG with wide/narrow TSS spread (Figure 8). This confirms there is a correlation, but in our opinion the correlation is not strong enough to say that the high-CpG class is the wide-spread class.

Figure 8.

Mosaic plot of high/low CpG versus high/low chromosomal spread of TSSs. The area of each rectangle is proportional to the number of promoters in each category.

Figure 8.

Mosaic plot of high/low CpG versus high/low chromosomal spread of TSSs. The area of each rectangle is proportional to the number of promoters in each category.

We still lack explanations for why CpG rate and TSS spread are bimodal. Presumably, the CpG rates indicate that many promoters are either usually (housekeeping) or rarely active in the germ line, and fewer are active for intermediate amounts of time.

DISCUSSION

The first message of this study is that definition of promoters, and of their expression specificity, is partly arbitrary, and there is a danger of artifactual correlations with other promoter properties. The base reality is a rugged landscape of transcription initiation at each genomic nucleotide, and promoters—peaks in this landscape—are a partly subjective abstraction. Simple promoter definitions based on nearness of CAGE tags produce wider promoters in regions with more tags, introducing an artifactual correlation between TSS spread and expression level. This will further cause artifactual indirect correlations, e.g. between TSS spread and CGIs.

The definition of expression specificity is also partly arbitrary, and typical measures such as Shannon entropy have a bias when applied to a sample of expression counts: the measured specificity tends to decrease as the sample size decreases. This causes a spurious anticorrelation between expression level and specificity.

To understand the correlations among promoter properties, the first step must be to avoid such biases. This study has done so in crude but effective ways (fixed promoter width and sample size). A frightening point is that these biases were not obvious to us initially, and similar biases probably exist undetected in other genomic studies. This seems especially likely in large projects with many contributors and diverse results. Genomic research is perhaps especially prone to statistical artifacts (30–32).

The second message of this study is that most promoters have rather nonspecific (but not perfectly uniform) expression across many cell types. Some other studies have emphasized that most promoters have nonuniform regulated expression patterns (12,34). We do not contradict that, but we do emphasize that most promoters have broad, albeit not completely uniform, expression.

This result has implications for how cell types are determined, and how they evolved. We can rule out the idea that most cell types are determined by expressing hundreds of unique genes. (Because there are hundreds of human cell types (17), if most expressed hundreds of unique genes, there would have to be tens of thousands of cell-specific promoters, which is contradicted by Figure 4.) Actually, this is obvious when we consider that humans have many more cell types than some invertebrates, but not many more genes. It implies that cell types are determined by unique combinations of genes, or perhaps by what genes they do not express.

It has been suggested that cell types evolved by three main mechanisms: (i) segregation of functions, starting from multifunctional ancestral cell types, with loss of gene expression in descendant cells; (ii) divergence of functions, often driven by gene duplication and divergence; and (iii) acquisition of new functions, sometimes by co-option of genes from other cell types (33). Our data suggest that loss of gene expression and co-option may have been the major mechanisms.

Our third message is that promoters’ expression breadth is independent of their maximum expression level, and therefore correlates with average expression level. This may seem an obscure finding, but it is in fact fundamental, and it is not obvious a priori. In evolutionary terms, it suggests that genes became tissue-specific mainly by being downregulated, rather than upregulated, in newly evolving cell types.

Our fourth message is that the intriguing correlations between basic properties of promoters can almost all be explained simply. The heart of this explanation is that expression breadth correlates with expression level in any given class of cells, including germ cells, and germ cell expression reduces the CpG mutation rate. In addition, TATA motifs are naturally anticorrelated with %G+C and CpGs, and they reduce TSS spread. Thus we need not invoke any direct functional relationship between CGIs and expression breadth or TSS spread.

These explanations are consistent with the parsimonious theory of CGIs, that they are nonfunctional consequences of mutation patterns in the germ line. It has been shown that CGI-containing sequences have a nucleosome-destabilizing function (19). However, correlation is not necessarily causation. It is possible that some unknown sequence property causes the nucleosome destabilization, which might then lead to expression in germ cells, causing CGI evolution.

FUNDING

FANTOM5 was made possible by a research grant for RIKEN Omics Science Center from Ministry of Education, Culture, Sports, Science and Technology, Japan (MEXT) (to Y. H.) and a grant of the Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from the MEXT, Japan, (to Y.H.). Funding for open access charge: AIST (National Institute for Advanced Industrial Science and Technology).

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

M.C.F. especially thanks Al Forrest, Charles Plessy, Kiyoshi Asai, Michiaki Hamada and Paul Horton for helpful discussions. The authors thank all members of the FANTOM5 consortium for contributing to generation of samples and analysis of the data set as well as GeNAS for data production.

REFERENCES

1
Itoh
M
Kojima
M
Nagao-Sato
S
Saijo
E
Lassmann
T
Kanamori-Katayama
M
Kaiho
A
Lizio
M
Kawaji
H
Carninci
P
, et al.  . 
Automated workflow for preparation of cDNA for cap analysis of gene expression on a single molecule sequencer
PLoS One
 , 
2012
, vol. 
7
 pg. 
e30809
 
2
Carninci
P
Sandelin
A
Lenhard
B
Katayama
S
Shimokawa
K
Ponjavic
J
Semple
CA
Taylor
MS
Engstrom
PG
Frith
MC
, et al.  . 
Genome-wide analysis of mammalian promoter architecture and evolution
Nat. Genet.
 , 
2006
, vol. 
38
 (pg. 
626
-
635
)
3
Smale
ST
Kadonaga
JT
The RNA polymerase II core promoter
Annu. Rev. Biochem.
 , 
2003
, vol. 
72
 (pg. 
449
-
479
)
4
Suzuki
Y
Taira
H
Tsunoda
T
Mizushima-Sugano
J
Sese
J
Hata
H
Ota
T
Isogai
T
Tanaka
T
Morishita
S
, et al.  . 
Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites
EMBO Rep.
 , 
2001
, vol. 
2
 (pg. 
388
-
393
)
5
Ni
T
Corcoran
DL
Rach
EA
Song
S
Spana
EP
Gao
Y
Ohler
U
Zhu
J
A paired-end sequencing strategy to map the complex landscape of transcription initiation
Nat. Methods
 , 
2010
, vol. 
7
 (pg. 
521
-
527
)
6
Rach
EA
Winter
DR
Benjamin
AM
Corcoran
DL
Ni
T
Zhu
J
Ohler
U
Transcription initiation patterns indicate divergent strategies for gene regulation at the chromatin level
PLoS Genet.
 , 
2011
, vol. 
7
 pg. 
e1001274
 
7
Valen
E
Sandelin
A
Genomic and chromatin signals underlying transcription start-site selection
Trends Genet.
 , 
2011
, vol. 
27
 (pg. 
475
-
485
)
8
Lenhard
B
Sandelin
A
Carninci
P
Metazoan promoters: emerging characteristics and insights into transcriptional regulation
Nat. Rev. Genet.
 , 
2012
, vol. 
13
 (pg. 
233
-
245
)
9
Frith
MC
Valen
E
Krogh
A
Hayashizaki
Y
Carninci
P
Sandelin
A
A code for transcription initiation in mammalian genomes
Genome Res.
 , 
2008
, vol. 
18
 (pg. 
1
-
12
)
10
Balwierz
PJ
Carninci
P
Daub
CO
Kawai
J
Hayashizaki
Y
Van Belle
W
Beisel
C
van Nimwegen
E
Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data
Genome Biol.
 , 
2009
, vol. 
10
 pg. 
R79
 
11
Hurlbert
SH
The non-concept of species diversity: a critique and alternative parameters
Ecology
 , 
1971
, vol. 
52
 (pg. 
577
-
586
)
12
Schug
J
Schuller
WP
Kappen
C
Salbaum
JM
Bucan
M
Stoeckert
CJ
Promoter features related to tissue specificity as measured by Shannon entropy
Genome Biol.
 , 
2005
, vol. 
6
 pg. 
R33
 
13
Herzel
H
Schmitt
A
Ebeling
W
Finite sample effects in sequence analysis
Chaos Solitons Fractals
 , 
1994
, vol. 
4
 (pg. 
97
-
113
)
14
Roulston
MS
Estimating the errors on measured entropy and mutual information
Physica D.
 , 
1999
, vol. 
125
 (pg. 
285
-
294
)
15
Bonachela
JA
Hinrichsen
H
Muñoz
MA
Entropy estimates of small data sets
J Phys A Math Theor.
 , 
2008
, vol. 
41
 pg. 
202001
 
16
Horášek
M
Zvárová
J
Traditional measures of diversity and sensitivity of power entropies
EJBI
 , 
2011
, vol. 
7
 (pg. 
17
-
21
)
17
Vickaryous
MK
Hall
BK
Human cell type diversity, evolution, development, and classification with special reference to cells derived from the neural crest
Biol. Rev. Camb. Philos. Soc.
 , 
2006
, vol. 
81
 (pg. 
425
-
455
)
18
Saxonov
S
Berg
P
Brutlag
DL
A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters
Proc. Natl Acad. Sci. USA
 , 
2006
, vol. 
103
 (pg. 
1412
-
1417
)
19
Ramirez-Carrozzi
VR
Braas
D
Bhatt
DM
Cheng
CS
Hong
C
Doty
KR
Black
JC
Hoffmann
A
Carey
M
Smale
ST
A unifying model for the selective regulation of inducible transcription by CpG islands and nucleosome remodeling
Cell
 , 
2009
, vol. 
138
 (pg. 
114
-
128
)
20
Thomson
JP
Skene
PJ
Selfridge
J
Clouaire
T
Guy
J
Webb
S
Kerr
AR
Deaton
A
Andrews
R
James
KD
, et al.  . 
CpG islands influence chromatin structure via the CpG-binding protein Cfp1
Nature
 , 
2010
, vol. 
464
 (pg. 
1082
-
1086
)
21
Cohen
NM
Kenigsberg
E
Tanay
A
Primate CpG islands are maintained by heterogeneous evolutionary regimes involving minimal selection
Cell
 , 
2011
, vol. 
145
 (pg. 
773
-
786
)
22
Pruitt
KD
Tatusova
T
Brown
GR
Maglott
DR
NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy
Nucleic Acids Res.
 , 
2012
, vol. 
40
 (pg. 
D130
-
D135
)
23
Meyer
LR
Zweig
AS
Hinrichs
AS
Karolchik
D
Kuhn
RM
Wong
M
Sloan
CA
Rosenbloom
KR
Roe
G
Rhead
B
, et al.  . 
The UCSC Genome Browser database: extensions and updates 2013
Nucleic Acids Res.
 , 
2013
, vol. 
41
 (pg. 
D64
-
D69
)
24
Bryne
JC
Valen
E
Tang
MH
Marstrand
T
Winther
O
da Piedade
I
Krogh
A
Lenhard
B
Sandelin
A
JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update
Nucleic Acids Res.
 , 
2008
, vol. 
36
 (pg. 
D102
-
D106
)
25
Kim
SH
Yi
SV
Correlated asymmetry of sequence and functional divergence between duplicate proteins of Saccharomyces cerevisiae
Mol. Biol. Evol.
 , 
2006
, vol. 
23
 (pg. 
1068
-
1075
)
26
Arndt
PF
Burge
CB
Hwa
T
DNA sequence evolution with neighbor-dependent mutation
J. Comput. Biol.
 , 
2003
, vol. 
10
 (pg. 
313
-
322
)
27
Duret
L
Arndt
PF
The impact of recombination on nucleotide substitutions in the human genome
PLoS Genet.
 , 
2008
, vol. 
4
 pg. 
e1000071
 
28
Lees-Murdock
DJ
Walsh
CP
DNA methylation reprogramming in the germ line
Epigenetics
 , 
2008
, vol. 
3
 (pg. 
5
-
13
)
29
Ponger
L
Duret
L
Mouchiroud
D
Determinants of CpG islands: expression in early embryo and isochore structure
Genome Res.
 , 
2001
, vol. 
11
 (pg. 
1854
-
1860
)
30
Drummond
DA
Raval
A
Wilke
CO
A single determinant dominates the rate of yeast protein evolution
Mol. Biol. Evol.
 , 
2006
, vol. 
23
 (pg. 
327
-
337
)
31
Semon
M
Mouchiroud
D
Duret
L
Relationship between gene expression and GC-content in mammals: statistical significance and biological relevance
Hum. Mol. Genet.
 , 
2005
, vol. 
14
 (pg. 
421
-
427
)
32
Plotkin
JB
Fraser
HB
Assessing the determinants of evolutionary rates in the presence of noise
Mol. Biol. Evol.
 , 
2007
, vol. 
24
 (pg. 
1113
-
1121
)
33
Arendt
D
The evolution of cell types in animals: emerging principles from molecular studies
Nat. Rev. Genet.
 , 
2008
, vol. 
9
 (pg. 
868
-
882
)
34
Forrest
ARR
Kawaji
H
Rehli
M
Baillie
JK
de Hoon
MJL
Haberle
V
Lassmann
T
Kulakovskiy
IV
Lizio
M
Itoh
M
, et al.  . 
A promoter level mammalian expression atlas
Nature
 , 
2014
 
doi: 10.1038/nature13182
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Comments

0 Comments