Interindividual variation in cytosine modifications could contribute to heterogeneity in disease risks and other complex traits. We assessed the genetic architecture of cytosine modifications at 283 540 CpG sites in lymphoblastoid cell lines (LCLs) derived from independent samples of European and African descent. Our study suggests that cytosine modification variation was primarily controlled in local by single major modification quantitative trait locus (mQTL) and additional minor loci. Local genetic epistasis was detectable for a small proportion of CpG sites, which were enriched by more than 9-fold for CpG sites mapped to population-specific mQTL. Genetically dependent CpG sites whose modification levels negatively (repressive sites) or positively (facilitative sites) correlated with gene expression levels significantly co-localized with transcription factor binding, with the repressive sites predominantly associated with active promoters whereas the facilitative sites rarely at active promoters. Genetically independent repressive or facilitative sites preferentially modulated gene expression variation by influencing local chromatin accessibility, with the facilitative sites primarily antagonizing H3K27me3 and H3K9me3 deposition. In comparison with expression quantitative trait loci (eQTL), mQTL detected from LCLs were enriched in associations for a broader range of disease categories including chronic inflammatory, autoimmune and psychiatric disorders, suggesting that cytosine modification variation, while possesses a degree of cell linage specificity, is more stably inherited over development than gene expression variation. About 11% of unique single-nucleotide polymorphisms reported in the Genome-Wide Association Study Catalog were annotated, 78% as mQTL and 31% as eQTL in LCLs, which covered 37% of the investigated diseases/traits and provided insights to the biological mechanisms.

INTRODUCTION

Interindividual gene expression variation underlies many cellular and organism-level phenotypes; therefore, study of genetic regulation of gene expression enhances our understanding of the genetic basis of disease susceptibility and other complex traits. Previous studies utilizing the International HapMap Project (1,2) lymphoblastoid cell lines (LCLs) derived from apparently healthy individuals of major ethnic groups have demonstrated genetic contribution to gene expression in the form of single-nucleotide polymorphisms (SNPs) (37), known as expression quantitative trait loci (eQTL). Furthermore, eQTL have been shown to be enriched among SNPs associated with a broad spectrum of complex traits (8) including pharmacologic traits (9) identified from genome-wide association studies (GWAS). Some eQTL are known to affect the binding of transcription factors to regulate gene expression (10). However, in general, the functional basis and regulatory mechanisms of the identified eQTL and most GWAS loci remain to be fully characterized.

Both genetic and environmental regulation of gene expression can be mediated through epigenetic mechanisms. Genetic effects on cytosine modification were first assessed in human brain tissues to identify cytosine methylation quantitative trait loci (11,12). Local or cis-acting genetic variants were also found in LCLs that contribute to within- and between-population epigenetic variations (13,14) and that in turn correlated with gene expression (15,16). On the other hand, the relative flexibility of the epigenetic systems in response to environmental and physiological changes, for example exposure to drugs (17), development (18) and aging (19), indicates that they play a critical role in regulating gene expression beyond DNA sequence variations, which may be subject to more functional constraints and selective pressure (20). Epigenetic and genetic dissection of gene expression variation therefore will aid in our understanding of the environmental, developmental and genetic contributions to gene regulation (21). The DNA methylome is known to maintain genome integrity (22) and X chromosome inactivation (23). Cytosine methylation in cis silences gene expression and is thought to be a global repressive mechanism (24). Given the importance of cytosine modifications (primarily CpG methylation) in gene regulation, the cytosine modification quantitative trait loci (mQTL) represent a novel layer of annotations for genetic variants associated with complex traits (25).

We profiled cytosine modification levels in a panel of LCLs derived from Western African (YRI: Yoruba people from Ibadan, Nigeria) and Northern/Western European (CEU: Caucasian residents from Utah, USA) ancestry using the Illumina Infinium HumanMethylation450 BeadChip (450K array), which covers genome-wide CpG sites with high density (>480 000 CpGs) (26). The 450K array requires bisulfite conversion of genomic DNA in order to distinguish modified from unmodified cytosines. This technique, however, cannot distinguish between 5-methylcytosine and 5-hydroxymethylcytosine (27,28), a recently rediscovered cytosine modification type whose functional consequence is not fully understood (29,30). Therefore, we chose to use the term ‘cytosine modification’ to avoid any potential bias. In this study, we assessed the genetic architecture of cytosine modifications to gain novel biological insights into the genetic basis of human complex traits including disease predispositions, and characterized the epigenetic and genetic contributions to gene expression variation.

RESULTS

Genetic architecture of cytosine modifications

For a given CpG site, variation in both local sequence context and distant factors such as cytosine modification enzymes could affect cytosine modification level. To assess local and distant genetic regulation, we associated cytosine modification levels of 283 540 CpG sites with 2.3 million common SNPs genome wide by single-locus linear models. At 5% false discovery rate (FDR), 10 655 associations were detected for 1190 CpG sites in the CEU samples. The majority of mQTL located nearby their target CpG sites, as shown in Figure 1A, indicating predominantly local genetic regulation. A similar pattern was observed for the YRI samples (Fig. 1B). No association hotspots, or master regulators that controlled cytosine modification levels at multiple distant CpG sites, were observed. For a small number of mQTL that did associate with more than one CpG site, the mQTL and the target CpG sites were closely positioned, suggesting local regulation of regional CpG modification levels. If we defined mQTL on the same chromosome of the target CpG sites as local and those on different chromosomes as distant, only 11% of the associations detected in the CEU samples were distant. A proportion of these distant associations were likely due to cross-hybridization of CpG probes, despite our attempt to exclude such probes by relatively stringent criteria (31). For example, several SNPs located >2 kb upstream of nuclear receptor-binding factor 2 pseudogene 3 (NRBF2P3) on chromosome 1 appeared to be associated with the modification levels of CpG site cg15830792 located in nuclear receptor-binding factor 2 (NRBF2) on chromosome 10. Further examinations revealed that this association was due to two artifacts: (i) cross-hybridization of the probe, designed to interrogate cg15830792 within NRBF2, to a homologous sequence within NRBF2P3 and (ii) a SNP with T and C alleles located at the single-base extension site within NRBF2P3, which resulted in assay readouts highly correlating with the genotypes of the SNP and its tagged SNPs. Supplementary Material, Figure S1 illustrates how the two artifacts led to false discovery, suggesting that mismatches at the 5′ end of probes are particularly well tolerated for target hybridization and that polymorphisms at single-base extension site may need to be considered in probe quality control procedures.

Figure 1.

Genetic architecture of cytosine modifications. (A and B) For associations detected in whole-genome scan, the chromosomal positions of CpG sites were plotted against the positions of SNPs for CEU (A) and YRI (B). (C) For associations detected in ±1 Mb scan, r2 was plotted against the relative distance from SNPs to CpG sites for CEU (upper panel) and YRI (lower panel). SNPs were pruned by linkage disequilibrium (r2 > 0.3) for presentation. The ±100 kb regions of CpG sites were marked by orange vertical lines. (D) For CpG sites mapped to multiple additive mQTL (≥2), the adjusted association r2 of multiple mQTL was plotted against that of the most significant mQTL, for CEU (blue) and YRI (orange). (E) An example of local genetic epistasis showed the mean (points) and standard deviation (vertical lines) of M-values at cg00680696 according to genotypes at rs1050816 (x-axis) and rs4674390 (colored by black, blue and orange).

Figure 1.

Genetic architecture of cytosine modifications. (A and B) For associations detected in whole-genome scan, the chromosomal positions of CpG sites were plotted against the positions of SNPs for CEU (A) and YRI (B). (C) For associations detected in ±1 Mb scan, r2 was plotted against the relative distance from SNPs to CpG sites for CEU (upper panel) and YRI (lower panel). SNPs were pruned by linkage disequilibrium (r2 > 0.3) for presentation. The ±100 kb regions of CpG sites were marked by orange vertical lines. (D) For CpG sites mapped to multiple additive mQTL (≥2), the adjusted association r2 of multiple mQTL was plotted against that of the most significant mQTL, for CEU (blue) and YRI (orange). (E) An example of local genetic epistasis showed the mean (points) and standard deviation (vertical lines) of M-values at cg00680696 according to genotypes at rs1050816 (x-axis) and rs4674390 (colored by black, blue and orange).

Association scans of SNPs <1 Mb away from target CpG sites revealed that mQTL with strong effects were predominantly located within the ±100 kb regions (Fig. 1C and D), on which our final analysis of local genetic regulation was focused. At 5% FDR, 72 526 associations for 5240 CpG sites were identified in the CEU (Supplementary Material, Table S1), and 55 066 associations for 7306 CpG sites in the YRI samples (Supplementary Material, Table S2), representing 1.8 and 2.6% of the analyzed CpG sites, respectively (Table 1). Combining the results from the two populations, which were similar in summary statistics, the association r2 ranged from 0.23 to 0.97 with a median of 0.35. Less than 13% of mQTL associated with multiple CpG sites, potentially reflecting regional cytosine modifications. CpG sites mapped to mQTL occurred more frequently in gene bodies than in CpG islands (2.0 versus 1.3% in CEU, 2.7 versus 1.6% in YRI; Supplementary Material, Table S3). mQTL detected in this study highly overlapped previously published mQTL in the CEU (14) and YRI (13,14) samples using the Illumina 27K array (>200-fold enrichment, see Supplementary Material, Table S4). Among the CpG sites mapped to mQTL, we chose three CpG sites and measured their modification levels using bisulfite sequencing (Supplementary Material, Fig. S2 and Table S5): SNP rs7641545 associated with modification levels of CpG site cg01566999 (P = 0.001) located in the gene body of NIPAL2; SNP rs11346248 associated with modification levels of cg11346248 (P = 0.02) located in the promoter region of SLFN12; SNP rs2238542 associated with modification levels of cg03233876 (P = 0.04) located in the gene body of BSG. The mQTL mapping results have been deposited into our integrated SCAN Database (32) (http://www.scandb.org) developed to provide novel annotations for genetic variants.

Table 1.

Summary of local mQTL, meQTL and CpG-gene expression correlations

Type of association Population No. of association No. of CpGa No. of Genea %CpGb %Geneb 
mQTL CEU 72 526 5240  1.8%  
YRI 55 066 7306  2.6%  
meQTL CEU 5472  390  4.5% 
YRI 2184  289  2.7% 
CpG-gene expression CEU 1277 1232 614 0.47% 3.7% 
YRI 1120 1090 535 0.41% 3.2% 
Type of association Population No. of association No. of CpGa No. of Genea %CpGb %Geneb 
mQTL CEU 72 526 5240  1.8%  
YRI 55 066 7306  2.6%  
meQTL CEU 5472  390  4.5% 
YRI 2184  289  2.7% 
CpG-gene expression CEU 1277 1232 614 0.47% 3.7% 
YRI 1120 1090 535 0.41% 3.2% 

aThe number of unique CpG sites (or genes) mapped.

bThe percentage of unique CpG sites (or genes) mapped among the unique CpG sites (or genes) analyzed.

As LCLs, especially the CEU cell lines, have been in culture for many years, there is a concern about the extent of genomic correlation between LCLs and primary lymphocytes. To address this, we compared the eQTL between LCLs and non-transformed peripheral blood samples (33). Local eQTL detected in peripheral blood samples were enriched by more than 2-fold with those detected in LCLs (Supplementary Material, Table S6). Furthermore, we compared the mQTL between LCLs and cerebellum samples from a Caucasian cohort using 27K array (11). We found >5-fold enrichment of local cerebellum mQTL with those detected in LCLs (Supplementary Material, Table S7). These observations demonstrated significant genomic correlations between LCLs and primary cells/tissues, suggesting that LCL is a biologically relevant cell model for human genetic studies.

Local polygenic control and genetic epistasis

Although the recognition motifs of cytosine modification enzymes, for instance, DNA methyltransferases, are simple, structural variations of the target DNA strands may generally modulate the enzyme–substrate interaction (34). Cytosine modification levels of a CpG site could therefore be affected by several local polymorphisms with distinct combinatory effects. We first assessed additive local polygenic control, assuming that polymorphisms contributed independently to the variation of cytosine modification levels. For each CpG site mapped to mQTL at 5% FDR, we fitted cytosine modification levels on allele dosages of all mQTL detected for the site, and defined the optimal number of mQTL by forward–backward model selection. We found that >30% of CpG sites were associated with multiple mQTL. For these CpG sites, however, the majority of the genetically explained variation was frequently attributed to one single major locus. Figure 1D shows the comparison of the adjusted association r2 between the most significant mQTL and multiple mQTL for these CpG sites, suggesting that, in general, additive polygenic control could be interpreted as the combinatorial effects of one major mQTL and several minor loci.

Polygenic control can be due to genetic epistasis as well, where polymorphisms at one site modify the effects of polymorphisms at the other sites. We searched for simple two-locus interactions within ±100 kb regions of target CpG sites. To be statistically sound, we focused on SNP pairs that were common, with minor allele frequency (MAF) >0.2 and genotype frequency >0.1, and were not in linkage disequilibrium (LD) of each other (r2 < 0.3). We detected a small number of CpG sites with local genetic interactions (104 in the CEU and 60 in the YRI samples). Figure 1E shows an interacting SNP pair, rs1050816 and rs4674390, in the YRI samples, which explained no variation in cytosine modification levels at CpG site cg00680696 by additive effect, whereas explained 47% of variation when allowing interaction effect. Large epistatic effect, however, was not common (Supplementary Material, Fig. S3). Several studies have observed prominent population specificity of mQTL effects between the CEU and YRI samples, and hypothesized that genetic epistasis could be one explanation (14,16). To test this hypothesis, we identified 6726 CpG sites mapped to common mQTL and 1216 CpG sites mapped to population-specific mQTL (Supplementary Material, Fig. S4; also see Materials and Methods). Compared with CpG sites mapped to common mQTL, CpG sites mapped to population-specific mQTL were enriched for CpG sites under epistatic regulation in the CEU samples by 9.4-fold (binomial test P < 4 × 10−29), suggesting that local genetic epistasis may contribute to population specificity of mQTL effects. Supplementary Material, Figure S5 illustrates two possible scenarios including population-specific epistasis and different genetic heterogeneity of the interacting loci.

Genetic and epigenetic dissection of gene expression variation

mQTL were more likely to associate with gene expression levels compared with local SNPs within the same regions. As shown in Figure 2A, local SNPs were enriched for eQTL in the CEU samples, due to the predominant contribution of cis-acting polymorphisms to gene expression variation (7), while mQTL were further enriched with eQTL compared with local SNPs. To specifically identify mQTL that were eQTL, we associated mQTL with gene expression levels if the mQTL were <100 kb away from the gene starts or gene stops. At 5% FDR, we detected associations of mQTL with expression traits of 390 genes in the CEU samples (Supplementary Material, Table S8) and 289 genes in the YRI samples (Supplementary Material, Table S9), representing 4.5 and 2.7% of the analyzed genes, respectively (Table 1). We referred to these expression associated mQTL as meQTL (cytosine modification-gene expression quantitative trait loci). CpG sites mapped to meQTL frequently fell in the target genes and the proximal regions, but also located within farther intergenic regions (Fig. 2B). The proportion of CpG sites that mapped to meQTL was not significantly different between gene bodies and CpG islands (13.0 versus 15.5% in CEU, 7.9 versus 6.5% in YRI; Supplementary Material, Table S3). The cytosine modification-gene expression correlation was significant at P< 0.05 for 66% of the identified CpG-gene pairs in the CEU samples, with the direction consistent with that deduced from SNP-cytosine modification and SNP-gene expression associations, indicating that the identified pairs were indeed enriched for genetically driven CpG-gene expression correlations.

Figure 2.

Genetic and epigenetic dissection of gene expression variation. (A) mQTL were enriched for eQTL. For genes containing mQTL within the ±100 kb regions, P-value distribution of associations between expression levels and mQTL (black) were compared with that between expression levels and all SNPs within the regions (gray). (B) Distribution of relative positions between CpG sites and the correlated genes for genetically dependent CpG-gene pairs. TSS: transcriptional start site; TES: transcriptional end site. (C) Epigenetic regulation of gene expression. For cytosine modification-gene expression correlations detected at 5% FDR, the association r2 between gene expression levels and residual cytosine modification levels was compared with the association r2 between gene expression levels and unadjusted cytosine modification levels. The majority of correlations, after adjusted for genetic variations, remained to be significant. (D) Distribution of relative positions between CpG sites and the correlated genes for genetically independent CpG-gene pairs. In (B) and (D), gray bars represent counts and step-wise lines represent proportions.

Figure 2.

Genetic and epigenetic dissection of gene expression variation. (A) mQTL were enriched for eQTL. For genes containing mQTL within the ±100 kb regions, P-value distribution of associations between expression levels and mQTL (black) were compared with that between expression levels and all SNPs within the regions (gray). (B) Distribution of relative positions between CpG sites and the correlated genes for genetically dependent CpG-gene pairs. TSS: transcriptional start site; TES: transcriptional end site. (C) Epigenetic regulation of gene expression. For cytosine modification-gene expression correlations detected at 5% FDR, the association r2 between gene expression levels and residual cytosine modification levels was compared with the association r2 between gene expression levels and unadjusted cytosine modification levels. The majority of correlations, after adjusted for genetic variations, remained to be significant. (D) Distribution of relative positions between CpG sites and the correlated genes for genetically independent CpG-gene pairs. In (B) and (D), gray bars represent counts and step-wise lines represent proportions.

Besides genetic variations, developmental and/or environmental perturbations can also induce cytosine modification changes, which in turn modulate gene expression. To identify such genetically independent CpG-gene expression correlations, we first associated expression levels of 16 468 genes with cytosine modification levels of 264K CpG sites that were <100 kb away from the gene starts or gene stops. At 5% FDR, we detected 1277 CpG sites associated with 614 genes in the CEU samples (Supplementary Material, Table S10), and 1120 CpG sites with 535 genes in the YRI samples (Supplementary Material, Table S11), representing 3.7 and 3.2% of the analyzed genes, respectively (Table 1). For these CpG-gene pairs, we then evaluated the extent of cytosine modification-gene expression correlation in the absence of genetic variation. We identified the most significant mQTL within ±1 Mb regions for each CpG site and subsequently regressed out the mQTL effect from the corresponding cytosine modification levels. The residual cytosine modification levels were further associated with gene expression levels. For 92% of the detected CpG-gene pairs in the CEU samples, the cytosine modification-gene expression correlations remained to be significant (Fig. 2C). One caveat of this approach is that we did not regress out genetic polymorphisms exhaustedly. However, considering the relatively simple genetic architecture of cytosine modifications, the 92% CpG-gene pairs should be enriched for non-genetic regulation. We will henceforth refer to these CpG sites as genetically independent CpG sites, while refer to the CpG sites mapped to meQTL as genetically dependent CpG sites. Compared with genetically dependent CpG sites, genetically independent CpG sites were highly enriched in the target genes with only a few located in farther intergenic regions (Fig. 2D). Similar distributions were observed in the YRI samples (Supplementary Material, Fig. S6). Figure 3 depicts our flow of analyses to define genetically dependent CpG sites and genetically independent CpG sites.

Figure 3.

A flowchart of analyses that define genetically dependent and genetically independent CpG-gene expression correlation. mQTL: SNPs whose genotypes associate with cytosine modification levels. meQTL: mQTL whose genotypes associate with gene expression levels. Genetically dependent CpG-gene expression correlation: cytosine modification and gene expression traits that are mapped to meQTL and that themselves are correlated at P < 0.05; the corresponding CpG site was referred to as ‘genetically dependent CpG site.’ Genetically independent CpG-gene expression correlation: cytosine modification and gene expression traits that are correlated at genome-wide significance, which remains significant after adjusting for genetic variation; the corresponding CpG site was referred to as ‘genetically independent CpG site.’ Facilitative CpG site: genetically dependent or genetically independent CpG site whose cytosine modification levels positively correlate with gene expression levels. Repressive CpG site: genetically dependent or genetically independent CpG site whose modification levels negatively correlate with gene expression levels.

Figure 3.

A flowchart of analyses that define genetically dependent and genetically independent CpG-gene expression correlation. mQTL: SNPs whose genotypes associate with cytosine modification levels. meQTL: mQTL whose genotypes associate with gene expression levels. Genetically dependent CpG-gene expression correlation: cytosine modification and gene expression traits that are mapped to meQTL and that themselves are correlated at P < 0.05; the corresponding CpG site was referred to as ‘genetically dependent CpG site.’ Genetically independent CpG-gene expression correlation: cytosine modification and gene expression traits that are correlated at genome-wide significance, which remains significant after adjusting for genetic variation; the corresponding CpG site was referred to as ‘genetically independent CpG site.’ Facilitative CpG site: genetically dependent or genetically independent CpG site whose cytosine modification levels positively correlate with gene expression levels. Repressive CpG site: genetically dependent or genetically independent CpG site whose modification levels negatively correlate with gene expression levels.

Genetic and non-genetic gene regulation mediated through cytosine modifications

To explore the potential mechanisms of gene regulation underlying genetically dependent and genetically independent CpG-gene expression correlations, as defined in the previous section, we characterized the co-localization of these CpG sites with transcription factor binding sites and histone modification markers obtained from the Encyclopedia of DNA Elements (ENCODE) data (35), on the CEU cell line NA12878. For each type of ENCODE peaks (72 canonical transcription factors and four well studied histone markers H3K4me3, H3K36me3, H3K9me3 and H3K27me3), we estimated null distributions of co-localization with CpG sites along positional bins of genes and flanking regions. Counts of CpG sites whose co-localization with the ENCODE peaks exceeded the 95% percentile of the null distributions were displayed in Figure 4 for the CEU samples. Genetically dependent CpG sites were significantly enriched for transcription factor binding sites (Fig. 4A). However, CpG sites whose modification levels negatively correlated with gene expression levels (repressive sites) predominantly overlapped with H3K4me3 peaks (Fig. 4B, lower panel) that signature active promoters, while CpG sites whose modification levels positively correlated with gene expression levels (facilitative sites) showed few overlaps with H3K4me3 peaks (Fig. 4B, upper panel), even when the threshold was relaxed to 50% percentile of the null distributions (Supplementary Material, Fig. S7).

Figure 4.

Genetic and non-genetic gene regulation mediated through cytosine modifications. Co-localization of genetically dependent (A and B) and genetically independent (C and D) CpG sites detected in the CEU samples with transcription factor binding sites (A and C) and histone markers (B and D). Facilitative sites were plotted in upper panels and repressive sites were plotted in lower panels. Counts of co-localization were represented by bars, for each type of peaks that exceeded 95% percentile of the corresponding null distributions. The aggregated counts of co-localization with transcription factor binding sites were colored by orange, while counts of co-localization with H3K4me3, H3K36me3, H3K9me3 and H3K27me3 peaks were colored by black, red, green and blue, respectively. Total counts of the repressive or facilitative sites were displayed as step-wise lines.

Figure 4.

Genetic and non-genetic gene regulation mediated through cytosine modifications. Co-localization of genetically dependent (A and B) and genetically independent (C and D) CpG sites detected in the CEU samples with transcription factor binding sites (A and C) and histone markers (B and D). Facilitative sites were plotted in upper panels and repressive sites were plotted in lower panels. Counts of co-localization were represented by bars, for each type of peaks that exceeded 95% percentile of the corresponding null distributions. The aggregated counts of co-localization with transcription factor binding sites were colored by orange, while counts of co-localization with H3K4me3, H3K36me3, H3K9me3 and H3K27me3 peaks were colored by black, red, green and blue, respectively. Total counts of the repressive or facilitative sites were displayed as step-wise lines.

Similar to the genetically dependent repressive sites, genetically independent repressive sites were significantly enriched for transcription factor binding sites (Fig. 4C, lower panel) and H3K4me3 peaks (Fig. 4D, lower panel) in the 5′ end of genes. In contrast, genetically independent facilitative sites showed little enrichment for transcription factor binding sites (Fig. 4C, upper panel), but significantly overlapped with H3K27me3 peaks in gene boundaries, which mark polycomb repressive complex 2 (PRC2) targeting (Fig. 4D, upper panel). Interestingly, H3K9me3 peaks that mark heterochromatin and repetitive elements were significantly enriched in genetically independent repressive and facilitative sites, in the 5′ end of genes and gene boundaries, respectively (Fig. 4D). Similar patterns were observed in the YRI samples (Supplementary Material, Fig. S8).

We reasoned that as chromatin accessibility to transcription apparatus is a global feature, whereas transcription factor binding is gene-specific, the strength of correlation between absolute cytosine modification levels and absolute gene expression levels across CpG-gene pairs may reflect the underlying regulatory mechanism of the CpG sites. The correlation would be strong if CpG sites modulate the accessibility of chromatin to transcription apparatus, but would be weak if CpG sites affect expression levels by interfering with transcription factor binding. Compared with genetically dependent CpG sites, genetically independent CpG sites showed stronger correlations of absolute cytosine modification levels with absolute gene expression levels across CpG-gene pairs (Spearman's ρ = −0.24 versus ρ = −0.06 for repressive sites; ρ = 0.29 versus ρ = 0.11 for facilitative sites), suggesting that genetically independent CpG sites were more likely to modulate chromatin accessibility. Such influence on chromatin was highly localized, as only 3% of genetically independent CpG sites correlated (across samples) with expression levels of multiple (≥2) genes, compared with 16% of genetically dependent sites. Notably, none of the genetically independent CpG sites, whereas 25 of the genetic dependent CpG sites, correlated with expressions of multiple genes in opposite directions (test of equal proportions, P = 0.001).

Annotations of GWAS loci by mQTL and eQTL

GWAS to date have generated a large number of disease/trait associations for which biological interpretation is still lacking. Our previous study has demonstrated that GWAS loci were enriched for eQTL (8), suggesting that linking trait-associated loci with intermediate phenotypes such as gene expression likely provided further insights into GWAS results (36). To assess the extent of overrepresentation of mQTL or eQTL in GWAS loci, we obtained 4332 unique SNPs from the National Human Genome Research Institute (NHGRI) GWAS Catalog which had a nominal association P-value < 5 × 10−8 (37). We estimated null distributions of the number of overlaps with GWAS SNPs, by sampling random SNPs based on the number and MAF spectrum of mQTL or eQTL as described previously (8). As the GWAS have, in general, been carried out in Caucasian samples, we used mQTL or eQTL detected in the CEU samples. As shown in Figure 5, the number of overlap with GWAS SNPs was significantly greater than the null for both mQTL (P < 0.0001, Fig. 5A) and eQTL (P < 0.0001, Fig. 5B), demonstrating that GWAS loci were indeed enriched for functional SNPs associated with cytosine modification and gene expression, two intermediate phenotypes.

Figure 5.

Annotation of GWAS loci by mQTL and eQTL. Enrichment of mQTL (A) and eQTL (B) for GWAS SNPs. The null distributions of the number of random SNPs that overlapped with GWAS SNPs were displayed as histograms. The red points mark the number of mQTL and eQTL that overlapped with GWAS SNPs. (C) Enrichment of mQTL and eQTL in GWAS disease categories. The null distributions of the number of random GWAS SNPs annotated as mQTL (left sides) and eQTL (right sides) were displayed as gray points, with darkness representing density of points. The red and green points mark the numbers of GWAS SNPs annotated as mQTL or eQTL for the given disease categories, respectively. Cardiovas.: cardiovascular; neuro.: neurological; psychia.: psychiatric; inflamm.: inflammatory.

Figure 5.

Annotation of GWAS loci by mQTL and eQTL. Enrichment of mQTL (A) and eQTL (B) for GWAS SNPs. The null distributions of the number of random SNPs that overlapped with GWAS SNPs were displayed as histograms. The red points mark the number of mQTL and eQTL that overlapped with GWAS SNPs. (C) Enrichment of mQTL and eQTL in GWAS disease categories. The null distributions of the number of random GWAS SNPs annotated as mQTL (left sides) and eQTL (right sides) were displayed as gray points, with darkness representing density of points. The red and green points mark the numbers of GWAS SNPs annotated as mQTL or eQTL for the given disease categories, respectively. Cardiovas.: cardiovascular; neuro.: neurological; psychia.: psychiatric; inflamm.: inflammatory.

We further carried out a systematic annotation of these GWAS SNPs, by associating them with cytosine modification levels and gene expression levels in the CEU samples. Here, only the GWAS SNPs < 100 kb away from CpG sites or gene starts/stops were considered. At 5% FDR, we detected 361 unique GWAS SNPs, covering 193 (34%) unique diseases/traits, as mQTL associated with cytosine modification levels of 316 CpG sites (Supplementary Material, Table S12). In addition, we detected 144 GWAS SNPs, covering 83 (15%) diseases/traits, as eQTL associated with expression levels of 92 genes (Supplementary Material, Table S13). Overall, 11% of the unique GWAS SNPs were annotated, with 78% being mQTL and 31% being eQTL, covering 37% of the investigated diseases/traits. The annotated GWAS loci shed light into the genetic associations. In Table 2, we present diseases/traits whose GWAS loci were simultaneously mapped to mQTL and eQTL. The annotations frequently narrowed down to one of the candidate genes reported by the original GWAS, but occasionally suggested alternative or additional mechanisms. For example, SNPs associated with progressive supranuclear palsy (38), Parkinson's disease (39) and interstitial lung disease (40) were thought to alter MAPT (microtubule-associated protein tau) function. We found that these SNPs did associate with cytosine modifications at CpG sites within MAPT, but also with expression of the downstream KANSL1 (KAT8 regulatory NSL complex subunit 1) gene, suggesting more complicated mechanisms underlying the disease associations, as pointed out by a previous study (41).

Table 2.

The GWAS loci annotated by both mQTL and eQTL

Category Disease/trait GWAS SNP Associated CpG sites (−log10P)a Associated genes (−log10P)b 
Autoimmune disorder Rheumatoid arthritis rs6457620 cg23464743 (6.1) HLA-DQA2 (3.6) 
Rheumatoid arthritis rs6457617 cg23464743 (6.1) HLA-DQA2 (3.6) 
Systemic sclerosis 
Graves’ disease 
Rheumatoid arthritis rs2736340 cg21701351 (4.3) BLK (4.5), FAM167A (3.9) 
Kawasaki disease 
Crohn's disease rs3764147 cg00423124 (6.0) LACC1 (6.6) 
Ulcerative colitis rs798502 cg01139696 (4.0), cg08377331 (8.5), cg15092213 (5.0), cg17393140 (10), cg24648384 (12) GNA12 (4.7) 
Systemic lupus erythematosus rs13277113 cg21701351 (4.1) BLK (4.3) 
Systemic lupus erythematosus rs2618476 cg21701351 (4.0) BLK (4.0), FAM167A (4.1) 
Type 1 diabetes rs2647044 HLA-DQB2 (3.2)  
Inflammatory bowel disease rs2382817 cg00012203 (5.5), cg05991184 (9.5) TMBIM1 (5.1) 
Bacterial infection Helicobacter pylori serologic status rs10004195 cg09316306 (4.1) TLR6 (4.6) 
Leprosy rs3764147 cg00423124 (6.0) LACC1 (6.6) 
Cancer Hepatocellular carcinoma (hepatitis B virus related) rs9275319 cg04418355 (4.2), cg14255617 (4.9) HLA-DQA2 (3.6) 
Multiple myeloma (hyperdiploidy) rs2272007 cg05589743 (5.1) ULK4 (9.3) 
Multiple myeloma rs1052501 cg05589743 (5.1) ULK4 (9.3) 
Nasopharyngeal carcinoma rs3129055 cg02157626 (4.1), cg02355653 (6.4), cg03449857 (6.4), cg06458771 (4.6), cg13835168 (4.0), cg15570656 (6.8), cg15708526 (3.9), cg16994534 (5.2), cg24100841 (5.3), cg24444631 (4.2), cg26021304 (10), cg27230769 (5.3) ZFP57 (4.8) 
Prostate cancer rs130067 cg07414487 (7.1), cg21334609 (4.4) CCHCR1 (4.6), TCF19 (7.3) 
Wilms tumor rs3755132 cg12888861 (4.6), cg24674087 (5.1) DDX1 (5.0) 
Chronic inflammatory disease COPD-related biomarkers rs2074488 cg04710276 (5.2), cg18923388 (3.9) HLA-B (3.7) 
COPD-related biomarkers rs1265093 cg07414487 (13) TCF19 (4.3) 
Asthma rs9273349 cg19864342 (4.2), cg23464743 (7.5) HLA-DQB2 (6.0) 
Coagulation End-stage coagulation rs10665, rs2181540, rs6041 cg19086603 (19, 19, 22) LOC100506079 (4.1, 3.5, 3.2) 
Prothrombin time rs561241 cg19086603 (23) LOC100506079 (3.2) 
Factor VII 
Metabolism syndrome related Palmitoleic acid (16:1n-7) plasma levels rs11190604 cg07080220 (4.4), cg07570687 (4.7) SEC31B (3.5) 
Phospholipid levels (plasma) rs1077989 cg16224230 (4.8) TMEM229B (3.9) 
Metabolic syndrome rs782590 cg18811423 (4.2) PNPT1 (4.8) 
Body mass index rs4771122 cg22138327 (8.9) GTF3A (4.6) 
Neurological disorder Progressive supranuclear palsy rs8070723 cg07368061 (4.9), cg18228076 (7.3) KANSL1 (5.2) 
Parkinson's disease 
Other Height rs1182188 cg01139696 (5.5), cg08377331 (11), cg15092213 (4.6), cg24648384 (14) GNA12 (5.8) 
Height rs798489 cg01139696 (4.6), cg08377331 (10), cg15092213 (4.5), cg17393140 (9.1), cg24648384 (15) GNA12 (4.5) 
Height rs2665838 cg23090583 (4.6) FTSJ3 (3.9) 
Height rs6457620 cg23464743 (6.1) HLA-DQA2 (3.6) 
Height rs12694997 cg17240976 (33) FARP2 (3.4) 
Height rs2941551 cg23090583 (4.6) FTSJ3 (4.0) 
 Diastolic blood pressure rs9815354 cg05589743 (4.3) ULK4 (9.0) 
Menopause (age at onset) rs2303369 cg04845466 (4.2), cg12000995 (7.2), cg12648201 (6.8), cg17158414 (4.0), cg24768116 (4.5) NRBP1 (3.4) 
Liver enzyme levels (gamma-glutamyl transferase) rs2739330 cg04234412 (9.2), cg09033563 (6.8), cg17293426 (4.1), cg18538332 (4.2), cg24846343 (4.9) GSTT1 (9.7) 
Metabolic traits rs2403254 cg24872173 (5.6) GTF2H1 (4.4) 
Mean corpuscular hemoglobin concentration rs4580814 cg04380229 (6.0), cg04467119 (4.2), cg08344943 (3.9), cg08702805 (4.8), cg12199905 (4.0), cg22955595 (4.3), cg25388971 (4.5), cg25970471 (4.1) SLC12A7 (5.8) 
Response to temozolomide rs477692 cg05714579 (6.8) MGMT (3.3) 
Renal function-related traits (BUN) rs11123170 cg07594247 (5.0), cg17445212 (5.0), cg23564664 (5.0) PAX8 (4.8) 
Multiple sclerosis (OCB status) rs3129720 cg23464743 (6.9) HLA-DQB2 (5.6) 
Multiple sclerosis rs2523393 cg02355653 (11), cg15570656 (4.3), cg17221604 (4.2), cg25046571 (5.8), cg26021304 (7.3), cg27230769 (24) HLA-F (3.3), ZFP57 (3.9) 
Interstitial lung disease rs1981997 cg07368061 (4.9), cg18228076 (7.3), cg24801230 (11) KANSL1 (5.2) 
Category Disease/trait GWAS SNP Associated CpG sites (−log10P)a Associated genes (−log10P)b 
Autoimmune disorder Rheumatoid arthritis rs6457620 cg23464743 (6.1) HLA-DQA2 (3.6) 
Rheumatoid arthritis rs6457617 cg23464743 (6.1) HLA-DQA2 (3.6) 
Systemic sclerosis 
Graves’ disease 
Rheumatoid arthritis rs2736340 cg21701351 (4.3) BLK (4.5), FAM167A (3.9) 
Kawasaki disease 
Crohn's disease rs3764147 cg00423124 (6.0) LACC1 (6.6) 
Ulcerative colitis rs798502 cg01139696 (4.0), cg08377331 (8.5), cg15092213 (5.0), cg17393140 (10), cg24648384 (12) GNA12 (4.7) 
Systemic lupus erythematosus rs13277113 cg21701351 (4.1) BLK (4.3) 
Systemic lupus erythematosus rs2618476 cg21701351 (4.0) BLK (4.0), FAM167A (4.1) 
Type 1 diabetes rs2647044 HLA-DQB2 (3.2)  
Inflammatory bowel disease rs2382817 cg00012203 (5.5), cg05991184 (9.5) TMBIM1 (5.1) 
Bacterial infection Helicobacter pylori serologic status rs10004195 cg09316306 (4.1) TLR6 (4.6) 
Leprosy rs3764147 cg00423124 (6.0) LACC1 (6.6) 
Cancer Hepatocellular carcinoma (hepatitis B virus related) rs9275319 cg04418355 (4.2), cg14255617 (4.9) HLA-DQA2 (3.6) 
Multiple myeloma (hyperdiploidy) rs2272007 cg05589743 (5.1) ULK4 (9.3) 
Multiple myeloma rs1052501 cg05589743 (5.1) ULK4 (9.3) 
Nasopharyngeal carcinoma rs3129055 cg02157626 (4.1), cg02355653 (6.4), cg03449857 (6.4), cg06458771 (4.6), cg13835168 (4.0), cg15570656 (6.8), cg15708526 (3.9), cg16994534 (5.2), cg24100841 (5.3), cg24444631 (4.2), cg26021304 (10), cg27230769 (5.3) ZFP57 (4.8) 
Prostate cancer rs130067 cg07414487 (7.1), cg21334609 (4.4) CCHCR1 (4.6), TCF19 (7.3) 
Wilms tumor rs3755132 cg12888861 (4.6), cg24674087 (5.1) DDX1 (5.0) 
Chronic inflammatory disease COPD-related biomarkers rs2074488 cg04710276 (5.2), cg18923388 (3.9) HLA-B (3.7) 
COPD-related biomarkers rs1265093 cg07414487 (13) TCF19 (4.3) 
Asthma rs9273349 cg19864342 (4.2), cg23464743 (7.5) HLA-DQB2 (6.0) 
Coagulation End-stage coagulation rs10665, rs2181540, rs6041 cg19086603 (19, 19, 22) LOC100506079 (4.1, 3.5, 3.2) 
Prothrombin time rs561241 cg19086603 (23) LOC100506079 (3.2) 
Factor VII 
Metabolism syndrome related Palmitoleic acid (16:1n-7) plasma levels rs11190604 cg07080220 (4.4), cg07570687 (4.7) SEC31B (3.5) 
Phospholipid levels (plasma) rs1077989 cg16224230 (4.8) TMEM229B (3.9) 
Metabolic syndrome rs782590 cg18811423 (4.2) PNPT1 (4.8) 
Body mass index rs4771122 cg22138327 (8.9) GTF3A (4.6) 
Neurological disorder Progressive supranuclear palsy rs8070723 cg07368061 (4.9), cg18228076 (7.3) KANSL1 (5.2) 
Parkinson's disease 
Other Height rs1182188 cg01139696 (5.5), cg08377331 (11), cg15092213 (4.6), cg24648384 (14) GNA12 (5.8) 
Height rs798489 cg01139696 (4.6), cg08377331 (10), cg15092213 (4.5), cg17393140 (9.1), cg24648384 (15) GNA12 (4.5) 
Height rs2665838 cg23090583 (4.6) FTSJ3 (3.9) 
Height rs6457620 cg23464743 (6.1) HLA-DQA2 (3.6) 
Height rs12694997 cg17240976 (33) FARP2 (3.4) 
Height rs2941551 cg23090583 (4.6) FTSJ3 (4.0) 
 Diastolic blood pressure rs9815354 cg05589743 (4.3) ULK4 (9.0) 
Menopause (age at onset) rs2303369 cg04845466 (4.2), cg12000995 (7.2), cg12648201 (6.8), cg17158414 (4.0), cg24768116 (4.5) NRBP1 (3.4) 
Liver enzyme levels (gamma-glutamyl transferase) rs2739330 cg04234412 (9.2), cg09033563 (6.8), cg17293426 (4.1), cg18538332 (4.2), cg24846343 (4.9) GSTT1 (9.7) 
Metabolic traits rs2403254 cg24872173 (5.6) GTF2H1 (4.4) 
Mean corpuscular hemoglobin concentration rs4580814 cg04380229 (6.0), cg04467119 (4.2), cg08344943 (3.9), cg08702805 (4.8), cg12199905 (4.0), cg22955595 (4.3), cg25388971 (4.5), cg25970471 (4.1) SLC12A7 (5.8) 
Response to temozolomide rs477692 cg05714579 (6.8) MGMT (3.3) 
Renal function-related traits (BUN) rs11123170 cg07594247 (5.0), cg17445212 (5.0), cg23564664 (5.0) PAX8 (4.8) 
Multiple sclerosis (OCB status) rs3129720 cg23464743 (6.9) HLA-DQB2 (5.6) 
Multiple sclerosis rs2523393 cg02355653 (11), cg15570656 (4.3), cg17221604 (4.2), cg25046571 (5.8), cg26021304 (7.3), cg27230769 (24) HLA-F (3.3), ZFP57 (3.9) 
Interstitial lung disease rs1981997 cg07368061 (4.9), cg18228076 (7.3), cg24801230 (11) KANSL1 (5.2) 

The −log10P was displayed within parentheses for associations.

aBetween the cytosine modification levels of the CpG site and the GWAS SNP.

bBetween the gene expression levels of the gene and the GWAS SNP. COPD, chronic obstructive pulmonary disease.

The GWAS Catalog includes a range of diseases for which pathogenesis stemmed from distinct tissues and cell types, whereas our mQTL and eQTL were detected from LCLs. To examine the cell lineage specificity of mQTL and eQTL in disease determination, we looked at the types of diseases preferentially associated with mQTL or eQTL. We classified GWAS diseases/traits that have clear etiologies to seven major categories (Fig. 5C), and estimated the null distributions of the number of random GWAS SNPs (nominal association P-value < 5 × 10−8) annotated as mQTL or eQTL for each of the disease categories. As shown in Figure 5C, chronic inflammatory diseases and autoimmune disorders were enriched for GWAS loci annotated as mQTL (P = 0.005 and P = 0.002, respectively) as well as loci annotated as eQTL (P < 0.0001 and P = 0.03, respectively). Psychiatric disorders and cancers were relatively enriched for GWAS loci annotated as mQTL (P = 0.09 and P = 0.1, respectively) compared with eQTL (P = 0.8), suggesting both common and specific disease enrichment between mQTL and eQTL (Supplementary Material, Table S14).

DISCUSSION

We assessed the genetic architecture of cytosine modifications for >280K autosomal CpG sites in two population samples. Our observations suggest that variation in local sequence context was the primary determinant of interindividual cytosine modification differences. Such local genetic regulation appeared to be simple, in that the majority of CpG sites were controlled by a single major mQTL, with 30% of CpG sites further modulated by additional mQTL that had relatively small combinatory effects. The impact of local genetic epistasis on within-population variation was also moderate, judged by the effect size of genetic interactions for the small number of detectable CpG sites (Supplementary Material, Fig. S3). Our analysis, however, did reveal a >9-fold enrichment of CpG sites mapped to population-specific mQTL for CpG sites under local epistatic regulation, suggesting that local genetic epistasis was an important contributor to the population specificity of mQTL effects observed previously (14,16). The paucity of trans regulatory loci and the lack of master regulators found in this study could be interpreted by less complex genetic regulation for cytosine modification variations, but could also be due to a lack of power to detect the trans effects.

Our genetic and epigenetic dissection of gene expression variations revealed several interesting patterns. Genetically dependent CpG sites strongly co-localized with binding sites of canonic transcription factors. However, the repressive sites predominantly overlapped with H3K4me3 peaks at active promoters (42) while the facilitative sites overlapped few of such peaks. As the gene expression levels estimated in our study represented the averages across all exons annotated for the given reference genes, these observations suggest that greater levels of cytosine modification at active promoters suppress overall gene expression but at inactive promoters enhance overall gene expression. Cytosine modifications at transcribing promoters have been experimentally shown to suppress gene expression by disruption of transcription factor binding (43). We reasoned that cytosine modifications at inactive promoters may suppress incompetent transcriptional initiation, thereby facilitating the overall efficiency of transcriptional elongation. Indeed, a large proportion of the facilitative sites overlapped with H3K36me3 peaks (Supplementary Material, Fig. S7) that mark transcriptional elongation (44,45). Many genetically dependent CpG sites located in intergenic regions, which is consistent with recent discoveries of intergenic CpG islands that associate with development-specific promoters (46).

Genetically independent CpG sites preferentially modulate gene expression through local chromatin accessibility. We found that the facilitative sites significantly co-localized with repressive H3K27me3 and H3K9me3 peaks around gene ends. Negative correlation between DNA methylation and H3K27me3 was observed genome wide (47,48). Profiling of H3K27me3 in DNA hypomethylated somatic cells further suggested that DNA methylome restricted PRC2 targeting (49). The positive interindividual correlations between gene expressions and cytosine modifications at H3K27me3 peaks are therefore consistent with previous observations. Expanding of repressive H3K27me3 and H3K9me3 blocks is the most prominent changes observed in lineage-committed human cells compared with pluripotent cells (50), which may explain why 73% of genetically independent CpG sites were facilitative sites. The co-localization of genetically independent repressive sites with H3K9me3 and H3K4me3 peaks may reflect the recruitment or exclusion of cytosine modifications at the already silenced or activated promoters, respectively, marked by the histone modifications (51,52).

We found that mQTL and eQTL were relatively enriched in GWAS SNPs associated with chronic inflammatory and autoimmune disorders (Supplementary Material, Table S14), two disease categories related to the functions of lymphocytes from which LCLs were derived. This observation demonstrated the cell linage specificity of cytosine modifications and gene expressions in determining disease outcomes. On the other hand, mQTL were relatively enriched in GWAS SNPs associated with psychiatric disorders and cancers compared with eQTL (Supplementary Material, Table S14). Cytosine modifications can be inherited over many cell generations (53). It is therefore likely that genetically regulated cytosine modifications, together with potential variations induced by mQTL × environment interaction, are more stably maintained over development than gene expression variations.

Genetic associations based on the 27K array data detected a very small number of mQTL. For example, only 37 CpG sites were mapped to mQTL in the YRI samples (13) and the numbers of CpG sites mapped to mQTL in the CEU and YRI samples in another LCL study were in a similar scale (14). Besides the 17× coverage difference, another cause for the discrepancy between results obtained from the 27K and 450K array data is the distinct genomic regions covered by the two platforms. The 27K array primarily interrogates promoter CpG sites, whereas the 450K array also interrogates a large number of CpG sites within gene bodies. As indicated by our study, the proportion of CpG sites mapped to mQTL is significantly lower in CpG islands compared with gene bodies. Different functional constrains on cytosine modifications and/or different mQTL detection sensitivity for CpG sites between these regions may contribute to the observation. Relatively comprehensive mQTL lists (including our LCL mQTL lists) will facilitate discovery and across-study comparisons (e.g. the correlation between mQTL and eQTL, and the correlation of mQTL across cell/tissue types).

Mapping of mQTL and eQTL aids in the study of complex traits in two ways. As shown in Table 2, annotations of known GWAS SNPs by mQTL and eQTL may help identify potential causal genes and often provide novel biological insights. Furthermore, in rare disease mapping with limited sample size, prioritization of SNPs with strong genetic effect and clear functional interpretation such as mQTL and eQTL could effectively reduce the burden of multiple comparisons and improve the power of the study. mQTL detected here have been deposited to the SCAN Database (32), which has gained >2 million queries from >56 000 unique IP addresses since 2009. We expect that mQTL and eQTL mapping studies using diverse cell types and treatment conditions will greatly expand our understanding of complex traits in general.

MATERIALS AND METHODS

Cell lines

Genomic DNA samples for 60 unrelated CEU (phase II) and 73 unrelated YRI (58 phase II plus 15 phase III samples) HapMap LCLs were purchased from Coriell Institute for Medical Research (Camden, NJ, USA). The cell line identities were confirmed by genotyping 47 SNPs from the Sequenom iPLEX Sample ID Plus Panel in 24 randomly chosen LCLs maintained by the Pharmacogenetics of Anticancer Agents Research Group Cell Core at The University of Chicago.

Cytosine modification data processing and validation

Cytosine modification levels were profiled with the 450K array (Illumina, Inc., San Diego, CA, USA), as described in a previous publication (16). Briefly, 150 ng DNA after bisulfite conversion was obtained for each sample, randomized by population identity and run on the 450K array plates with the Illumina HiScan System. We excluded CpG probes ambiguously mapped to the human genome (31) and CpG probes containing common SNPs (MAF >0.01) based on dbSNP (54) v135. The final dataset is comprised of 283 540 autosomal CpG sites with good hybridization quality (16). M-values, defined as the log2 ratio of the intensities of modified probe versus unmodified probe, were quantile normalized (55) across 133 samples, and adjusted for batch effect (56). Cytosine modification levels at selected CpG sites were validated by bisulfate sequencing (16). The 450K array data for three HapMap samples, NA12891, NA12892 and NA19239, highly correlated with the ENCODE (35) 450K array data (r: 0.95–0.99). The raw cytosine modification data have been deposited into the NCBI Gene Expression Omnibus (GEO Accession Number: GSE39672).

Genotype and gene expression data

SNP genotypes were downloaded from the International HapMap Project (57) (release 27 July 2009). SNPs that deviated from the Hardy–Weinberg equilibrium (P < 0.0001) within a population were removed. The analytic set was comprised of >2 million common SNPs with MAF > 0.05 that were genotyped in at least 48 samples in each population. Imputed genotypes were only used in multilocus model comparisons. The Affymetrix Human Exon Array 1.0ST was previously used to profile gene expression in the CEU and YRI samples (GEO Accession Number: GSE9703) (6). Probe sequences were aligned to human genome (GRCh37) allowing ≤1 miss match. Probes with perfect, unique alignments were further filtered by excluding probes containing common SNPs (MAF > 0.05) based on dbSNP (54) v135 and probes interrogating multiple genes. Probe intensities were log2 transformed, background corrected (58) and quantile normalized. Probe intensity was subtracted by the corresponding probe mean across samples. Gene-level expression intensities were summarized as mean probe intensity within each gene. Genes for which 90% samples had absolute expression levels less than five percentile of all expression levels were further excluded. In total, 16 476 autosomal genes with unique Entrez Gene IDs were included in the analyses. The expression levels of several selected genes that correlated with cytosine modification levels have been validated in the LCLs using quantitative RT–PCR (16).

Genome-wide and local scan of mQTL

In local scans, the top five principal components (PCs) estimated from M-values across CpG sites were regressed out for the 60 CEU samples, to account for potential confounding variables, whereas the top two PCs were regressed out for the 73 YRI samples. The number of PCs regressed out was selected to achieve the greatest detection sensitivity in each population. Cytosine modification levels were regressed on SNP allele dosages within each population. For local scans, FDR was estimated by 100 permutations. For whole-genome scan, FDR was estimated by three permutations in CEU due to the large amount of computation involved. The numbers of random associations called at a range of P-value thresholds were highly consistent across the three permutations. Furthermore, the number of significant calls at permutation-based FDR of 0.05 was close to the number of calls at Bonferroni-adjusted P = 0.05. We therefore used Bonferroni correction, corresponding to nominal P = 10−13, for both CEU and YRI samples in whole-genome scan.

Local genetic epistasis

SNPs with MAF > 0.2 and genotype frequency >0.1 within population were selected for analysis. For each CpG site, SNP pairs located within ±100 kb of the target cytosine were further selected so that LD between the two SNPs of a pair was constrained at r2 < 0.3. A linear model: cytosine modification levels ∼SNP1 + SNP2 + SNP1 × SNP2 + error was used to test for epistasis assuming an additive genetic mode. To avoid model collinearity, the correlation between the interaction term and each of the additive terms was constrained at r2 < 0.7. The significance of the interaction effect was estimated by an F-test that compared the full model with a reduced model without the interaction term. FDR was estimated by 100 permutations using an approximate test (59).

Common and population-specific mQTL

We pooled the CEU and YRI samples to test for common and population-specific mQTL. To avoid associations driven by one population, SNPs with MAF > 0.2 within each population were selected for analysis. A linear model: cytosine modification levels ∼population identity + SNP dosage + error was used to test for common genotype effect. A linear model: cytosine modification levels ∼population identity + SNP dosage + population identity × SNP dosage + error was used to test for population-specific genotype effect. The significance of common genotype effect and population-specific genotype effect was estimated by an F-test comparing a full model with a reduced model without SNP dosage term or without population identity × SNP dosage term, respectively. FDR was estimated by 100 permutations with an approximate test (59).

CpG-gene correlation and local scan of eQTL

We evaluated the within-population correlation between gene expression levels and cytosine modification levels of CpG sites that were <100 kb away from gene starts and gene stops. FDR was evaluated by 100 permutations. We mapped eQTL in 58 CEU and 59 YRI samples that overlapped with samples used in mQTL mapping. SNPs < 100 kb away from gene starts and gene stops were tested. The top 11 PCs estimated from gene expression levels across genes were regressed out for each population samples. Gene expression levels were regressed on SNP allele dosages within population. FDR was estimated by 100 permutations. To test for correlation between absolute cytosine modification levels and absolute gene expression levels across CpG-gene pairs for genetically dependent and genetically independent CpG sites, mean β values across samples and mean expression levels across samples were used.

Co-localization of CpG sites with the ENCODE peaks

We obtained uniformly processed narrow peaks for transcription factor binding sites and broad peaks for histone markers of cell line NA12878 from the ENCODE project (35). Peaks for each of the canonical transcription factors and histone modification markers were examined individually. We mapped all analyzed CpG sites to positional bins including 5 kb bins along the upstream 100 kb from transcriptional start site, 5′UTR, 10 percentile-bin along coding region, 3′UTR and 5 kb bins along the downstream 100 kb from transcriptional end site. To estimate null distributions, we randomly sampled 1000 times the same number of CpG-gene pairs as the number of true CpG-gene expression correlations, matching the positional bin distribution, and counted the number of CpG sites co-localized with the peaks for the given marker at each bin. The true numbers of co-localization were then compared with the 95 percentile thresholds of the null distributions.

Enrichment of GWAS SNPs in mQTL and eQTL

The NHGRI GWAS Catalog (37) has collected published GWAS results for complex traits including risks for common diseases. We obtained the NHGRI SNPs (released in November 6, 2013) reported to be associated with complex traits with a P-value of <5 × 10−8. To assess the enrichment of GWAS SNPs for mQTL and eQTL, we generated null distributions of the number of SNPs overlapping with GWAS SNPs, by random sampling SNPs 10 000 times according to the number and MAF distribution of mQTL and eQTL, respectively. The number of mQTL or eQTL overlapping with GWAS SNPs was then compared with the null distribution. Due to the relatively small number of GWAS SNPs selected at P-value < 5 × 10−8, mQTL and eQTL without LD pruning were used in the enrichment analyses present in Figure 5A and B. However, the enrichments observed were not due to LD, as using mQTL and eQTL pruned by r2 > 0.3 resulted in similar trend: P = 0.0002 for mQTL overlapping and P < 0.0001 for eQTL overlapping. To annotate GWAS SNPs as mQTL or eQTL, GWAS SNPs were associated with cytosine modification levels or gene expression levels in the CEU samples, if they were located within the regions ±100 kb of analyzed CpG sites or genes, respectively. FDR was estimated by 100 permutations. To assess the enrichment of mQTL and eQTL in disease categories, we again generated null distributions for each of the disease categories. For a given disease category, we randomly sampled GWAS SNPs 10 000 times according to the number and MAF distribution of the annotated mQTL or eQTL in that disease category. The true number of mQTL or eQTL annotated to the given disease category was then compared with the null distribution.

Bisulfite sequencing validation

Three CpG sites with local mQTL were selected for bisulfite sequencing validation (Supplementary Material, Table S5). Bisulfite sequencing at these loci was performed on 10–24 random samples from the original DNA collection used in the 450K array profiling. ZymoTaq polymerase (Zymo Research, Irvine, CA, USA) and primers listed in Supplementary Material, Table S15 were used to amplify ∼250 bp fragments that included the CpG sites interrogated by the 450K array as well as adjacent CpGs.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at HMG online.

FUNDING

This work was supported by National Institutes of Health (R21HG006367 to W.Z., L.A.G. and M.E.D.; U01GM061393 to M.E.D.).

ACKNOWLEDGEMENTS

We are grateful to Jennifer McQuade and Jamie Myers for technical support.

Conflict of Interest statement. None declared.

REFERENCES

1
HapMap
The international HapMap project
Nature
 
2003
426
789
796
2
HapMap
A haplotype map of the human genome
Nature
 
2005
437
1299
1320
3
Morley
M.
Molony
C.M.
Weber
T.M.
Devlin
J.L.
Ewens
K.G.
Spielman
R.S.
Cheung
V.G.
Genetic analysis of genome-wide variation in human gene expression
Nature
 
2004
430
743
747
4
Stranger
B.E.
Forrest
M.S.
Dunning
M.
Ingle
C.E.
Beazley
C.
Thorne
N.
Redon
R.
Bird
C.P.
de Grassi
A.
Lee
C.
et al.  
Relative impact of nucleotide and copy number variation on gene expression phenotypes
Science
 
2007
315
848
853
5
Spielman
R.S.
Bastone
L.A.
Burdick
J.T.
Morley
M.
Ewens
W.J.
Cheung
V.G.
Common genetic variants account for differences in gene expression among ethnic groups
Nat. Genet.
 
2007
39
226
231
6
Zhang
W.
Duan
S.
Kistner
E.O.
Bleibel
W.K.
Huang
R.S.
Clark
T.A.
Chen
T.X.
Schweitzer
A.C.
Blume
J.E.
Cox
N.J.
et al.  
Evaluation of genetic variation contributing to differences in gene expression between populations
Am. J. Hum. Genet.
 
2008
82
631
640
7
Duan
S.
Huang
R.S.
Zhang
W.
Bleibel
W.K.
Roe
C.A.
Clark
T.A.
Chen
T.X.
Schweitzer
A.C.
Blume
J.E.
Cox
N.J.
et al.  
Genetic architecture of transcript-level variation in humans
Am. J. Hum. Genet.
 
2008
82
1101
1113
8
Nicolae
D.L.
Gamazon
E.
Zhang
W.
Duan
S.
Dolan
M.E.
Cox
N.J.
Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS
PLoS Genet.
 
2010
6
e1000888
9
Gamazon
E.R.
Huang
R.S.
Cox
N.J.
Dolan
M.E.
Chemotherapeutic drug susceptibility associated SNPs are enriched in expression quantitative trait loci
Proc. Natl. Acad. Sci. USA
 
2010
107
9287
9292
10
Schaub
M.A.
Boyle
A.P.
Kundaje
A.
Batzoglou
S.
Snyder
M.
Linking disease associations with regulatory information in the human genome
Genome Res.
 
2012
22
1748
1759
11
Zhang
D.
Cheng
L.
Badner
J.A.
Chen
C.
Chen
Q.
Luo
W.
Craig
D.W.
Redman
M.
Gershon
E.S.
Liu
C.
Genetic control of individual differences in gene-specific methylation in human brain
Am. J. Hum. Genet.
 
2010
86
411
419
12
Gibbs
J.R.
van der Brug
M.P.
Hernandez
D.G.
Traynor
B.J.
Nalls
M.A.
Lai
S.L.
Arepalli
S.
Dillman
A.
Rafferty
I.P.
Troncoso
J.
et al.  
Abundant quantitative trait loci exist for DNA methylation and gene expression in human brain
PLoS Genet.
 
2010
6
e1000952
13
Bell
J.T.
Pai
A.A.
Pickrell
J.K.
Gaffney
D.J.
Pique-Regi
R.
Degner
J.F.
Gilad
Y.
Pritchard
J.K.
DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines
Genome Biol.
 
2011
12
R10
14
Fraser
H.B.
Lam
L.L.
Neumann
S.M.
Kobor
M.S.
Population-specificity of human DNA methylation
Genome Biol.
 
2012
13
R8
15
McVicker
G.
van de Geijn
B.
Degner
J.F.
Cain
C.E.
Banovich
N.E.
Raj
A.
Lewellen
N.
Myrthil
M.
Gilad
Y.
Pritchard
J.K.
Identification of genetic variants that affect histone modifications in human cells
Science
 
2013
342
747
749
16
Moen
E.L.
Zhang
X.
Mu
W.
Delaney
S.M.
Wing
C.
McQuade
J.
Myers
J.
Godley
L.A.
Dolan
M.E.
Zhang
W.
Genome-wide variation of cytosine modifications between European and African populations and the implications for complex traits
Genetics
 
2013
194
987
996
17
Eadon
M.T.
Wheeler
H.E.
Stark
A.L.
Zhang
X.
Moen
E.L.
Delaney
S.M.
Im
H.K.
Cunningham
P.N.
Zhang
W.
Dolan
M.E.
Genetic and epigenetic variants contributing to clofarabine cytotoxicity
Hum. Mol. Genet.
 
2013
22
4007
4020
18
Reik
W.
Dean
W.
Walter
J.
Epigenetic reprogramming in mammalian development
Science
 
2001
293
1089
1093
19
Issa
J.P.
Aging and epigenetic drift: a vicious cycle
J. Clin. Invest.
 
2014
124
24
29
20
Surani
M.A.
Hayashi
K.
Hajkova
P.
Genetic and epigenetic regulators of pluripotency
Cell
 
2007
128
747
762
21
Zhang
X.
Richards
E.J.
Borevitz
J.O.
Genetic and epigenetic dissection of cis regulatory variation
Curr. Opin. Plant Biol.
 
2007
10
142
148
22
Xu
G.L.
Bestor
T.H.
Bourc'his
D.
Hsieh
C.L.
Tommerup
N.
Bugge
M.
Hulten
M.
Qu
X.
Russo
J.J.
Viegas-Pequignot
E.
Chromosome instability and immunodeficiency syndrome caused by mutations in a DNA methyltransferase gene
Nature
 
1999
402
187
191
23
Wutz
A.
Jaenisch
R.
A shift from reversible to irreversible X inactivation is triggered during ES cell differentiation
Mol. Cell.
 
2000
5
695
705
24
Siegfried
Z.
Eden
S.
Mendelsohn
M.
Feng
X.
Tsuberi
B.Z.
Cedar
H.
DNA methylation represses transcription in vivo
Nat. Genet.
 
1999
22
203
206
25
Gamazon
E.R.
Badner
J.A.
Cheng
L.
Zhang
C.
Zhang
D.
Cox
N.J.
Gershon
E.S.
Kelsoe
J.R.
Greenwood
T.A.
Nievergelt
C.M.
et al.  
Enrichment of cis-regulatory gene expression SNPs and methylation quantitative trait loci among bipolar disorder susceptibility variants
Mol. Psychiatry
 
2013
18
340
346
26
Bibikova
M.
Barnes
B.
Tsan
C.
Ho
V.
Klotzle
B.
Le
J.M.
Delano
D.
Zhang
L.
Schroth
G.P.
Gunderson
K.L.
et al.  
High density DNA methylation array with single CpG site resolution
Genomics
 
2011
98
288
295
27
Jin
S.G.
Kadam
S.
Pfeifer
G.P.
Examination of the specificity of DNA methylation profiling techniques towards 5-methylcytosine and 5-hydroxymethylcytosine
Nucleic Acids Res.
 
2010
38
e125
28
Huang
Y.
Pastor
W.A.
Shen
Y.
Tahiliani
M.
Liu
D.R.
Rao
A.
The behaviour of 5-hydroxymethylcytosine in bisulfite sequencing
PLoS ONE
 
2010
5
e8888
29
Tahiliani
M.
Koh
K.P.
Shen
Y.
Pastor
W.A.
Bandukwala
H.
Brudno
Y.
Agarwal
S.
Iyer
L.M.
Liu
D.R.
Aravind
L.
et al.  
Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1
Science
 
2009
324
930
935
30
Ito
S.
D'Alessio
A.C.
Taranova
O.V.
Hong
K.
Sowers
L.C.
Zhang
Y.
Role of Tet proteins in 5mC to 5hmC conversion, ES-cell self-renewal and inner cell mass specification
Nature
 
2010
466
1129
1133
31
Zhang
X.
Mu
W.
Zhang
W.
On the analysis of the illumina 450k array data: probes ambiguously mapped to the human genome
Front. Genet.
 
2012
3
73
32
Gamazon
E.R.
Zhang
W.
Konkashbaev
A.
Duan
S.
Kistner
E.O.
Nicolae
D.L.
Dolan
M.E.
Cox
N.J.
SCAN: SNP and copy number annotation
Bioinformatics
 
2010
26
259
262
33
Westra
H.J.
Peters
M.J.
Esko
T.
Yaghootkar
H.
Schurmann
C.
Kettunen
J.
Christiansen
M.W.
Fairfax
B.P.
Schramm
K.
Powell
J.E.
et al.  
Systematic identification of trans eQTLs as putative drivers of known disease associations
Nat. Genet.
 
2013
45
1238
1243
34
Song
J.
Teplova
M.
Ishibe-Murakami
S.
Patel
D.J.
Structure-based mechanistic insights into DNMT1-mediated maintenance DNA methylation
Science
 
2012
335
709
712
35
Bernstein
B.E.
Birney
E.
Dunham
I.
Green
E.D.
Gunter
C.
Snyder
M.
An integrated encyclopedia of DNA elements in the human genome
Nature
 
2012
489
57
74
36
Zhang
X.
Cal
A.J.
Borevitz
J.O.
Genetic architecture of regulatory variation in Arabidopsis thaliana
Genome Res.
 
2011
21
725
733
37
Hindorff
L.A.
Sethupathy
P.
Junkins
H.A.
Ramos
E.M.
Mehta
J.P.
Collins
F.S.
Manolio
T.A.
Potential etiologic and functional implications of genome-wide association loci for human diseases and traits
Proc. Natl. Acad. Sci. USA
 
2009
106
9362
9367
38
Hoglinger
G.U.
Melhem
N.M.
Dickson
D.W.
Sleiman
P.M.
Wang
L.S.
Klei
L.
Rademakers
R.
de Silva
R.
Litvan
I.
Riley
D.E.
et al.  
Identification of common variants influencing risk of the tauopathy progressive supranuclear palsy
Nat. Genet.
 
2011
43
699
705
39
Spencer
C.C.
Plagnol
V.
Strange
A.
Gardner
M.
Paisan-Ruiz
C.
Band
G.
Barker
R.A.
Bellenguez
C.
Bhatia
K.
Blackburn
H.
et al.  
Dissection of the genetics of Parkinson's disease identifies an additional association 5′ of SNCA and multiple associated haplotypes at 17q21
Hum. Mol. Genet.
 
2011
20
345
353
40
Fingerlin
T.E.
Murphy
E.
Zhang
W.
Peljto
A.L.
Brown
K.K.
Steele
M.P.
Loyd
J.E.
Cosgrove
G.P.
Lynch
D.
Groshong
S.
et al.  
Genome-wide association study identifies multiple susceptibility loci for pulmonary fibrosis
Nat. Genet.
 
2013
45
613
620
41
Tobin
J.E.
Latourelle
J.C.
Lew
M.F.
Klein
C.
Suchowersky
O.
Shill
H.A.
Golbe
L.I.
Mark
M.H.
Growdon
J.H.
Wooten
G.F.
et al.  
Haplotypes and gene expression implicate the MAPT region for Parkinson disease: the GenePD Study
Neurology
 
2008
71
28
34
42
Lauberth
S.M.
Nakayama
T.
Wu
X.
Ferris
A.L.
Tang
Z.
Hughes
S.H.
Roeder
R.G.
H3K4me3 interactions with TAF3 regulate preinitiation complex assembly and selective gene activation
Cell
 
2013
152
1021
1036
43
Tate
P.H.
Bird
A.P.
Effects of DNA methylation on DNA-binding proteins and gene expression
Curr. Opin. Genet. Dev.
 
1993
3
226
231
44
Carrozza
M.J.
Li
B.
Florens
L.
Suganuma
T.
Swanson
S.K.
Lee
K.K.
Shia
W.J.
Anderson
S.
Yates
J.
Washburn
M.P.
et al.  
Histone H3 methylation by Set2 directs deacetylation of coding regions by Rpd3S to suppress spurious intragenic transcription
Cell
 
2005
123
581
592
45
Keogh
M.C.
Kurdistani
S.K.
Morris
S.A.
Ahn
S.H.
Podolny
V.
Collins
S.R.
Schuldiner
M.
Chin
K.
Punna
T.
Thompson
N.J.
et al.  
Cotranscriptional set2 methylation of histone H3 lysine 36 recruits a repressive Rpd3 complex
Cell
 
2005
123
593
605
46
Illingworth
R.S.
Gruenewald-Schneider
U.
Webb
S.
Kerr
A.R.
James
K.D.
Turner
D.J.
Smith
C.
Harrison
D.J.
Andrews
R.
Bird
A.P.
Orphan CpG islands identify numerous conserved promoters in the mammalian genome
PLoS Genet.
 
2010
6
e1001134
47
Lister
R.
Pelizzola
M.
Dowen
R.H.
Hawkins
R.D.
Hon
G.
Tonti-Filippini
J.
Nery
J.R.
Lee
L.
Ye
Z.
Ngo
Q.M.
et al.  
Human DNA methylomes at base resolution show widespread epigenomic differences
Nature
 
2009
462
315
322
48
Meissner
A.
Mikkelsen
T.S.
Gu
H.
Wernig
M.
Hanna
J.
Sivachenko
A.
Zhang
X.
Bernstein
B.E.
Nusbaum
C.
Jaffe
D.B.
et al.  
Genome-scale DNA methylation maps of pluripotent and differentiated cells
Nature
 
2008
454
766
770
49
Reddington
J.P.
Perricone
S.M.
Nestor
C.E.
Reichmann
J.
Youngson
N.A.
Suzuki
M.
Reinhardt
D.
Dunican
D.S.
Prendergast
J.G.
Mjoseng
H.
et al.  
Redistribution of H3K27me3 upon DNA hypomethylation results in de-repression of Polycomb target genes
Genome Biol.
 
2013
14
R25
50
Hawkins
R.D.
Hon
G.C.
Lee
L.K.
Ngo
Q.
Lister
R.
Pelizzola
M.
Edsall
L.E.
Kuan
S.
Luu
Y.
Klugman
S.
et al.  
Distinct epigenomic landscapes of pluripotent and lineage-committed human cells
Cell Stem Cell
 
2010
6
479
491
51
Feldman
N.
Gerson
A.
Fang
J.
Li
E.
Zhang
Y.
Shinkai
Y.
Cedar
H.
Bergman
Y.
G9a-mediated irreversible epigenetic inactivation of Oct-3/4 during early embryogenesis
Nat. Cell Biol.
 
2006
8
188
194
52
Ooi
S.K.
Qiu
C.
Bernstein
E.
Li
K.
Jia
D.
Yang
Z.
Erdjument-Bromage
H.
Tempst
P.
Lin
S.P.
Allis
C.D.
et al.  
DNMT3L connects unmethylated lysine 4 of histone H3 to de novo methylation of DNA
Nature
 
2007
448
714
717
53
Bird
A.
DNA methylation patterns and epigenetic memory
Genes Dev.
 
2002
16
6
21
54
Sherry
S.T.
Ward
M.H.
Kholodov
M.
Baker
J.
Phan
L.
Smigielski
E.M.
Sirotkin
K.
dbSNP: the NCBI database of genetic variation
Nucleic Acids Res.
 
2001
29
308
311
55
Bolstad
B.M.
Irizarry
R.A.
Astrand
M.
Speed
T.P.
A comparison of normalization methods for high density oligonucleotide array data based on variance and bias
Bioinformatics
 
2003
19
185
193
56
Johnson
W.E.
Li
C.
Rabinovic
A.
Adjusting batch effects in microarray expression data using empirical Bayes methods
Biostatistics
 
2007
8
118
127
57
Frazer
K.A.
Ballinger
D.G.
Cox
D.R.
Hinds
D.A.
Stuve
L.L.
Gibbs
R.A.
Belmont
J.W.
Boudreau
A.
Hardenbol
P.
Leal
S.M.
et al.  
A second generation human haplotype map of over 3.1 million SNPs
Nature
 
2007
449
851
861
58
Borevitz
J.O.
Liang
D.
Plouffe
D.
Chang
H.S.
Zhu
T.
Weigel
D.
Berry
C.C.
Winzeler
E.
Chory
J.
Large-scale identification of single-feature polymorphisms in complex genomes
Genome Res.
 
2003
13
513
523
59
Anderson
M.J.
Braak
C.J.F.T.
Permutation tests for multi-factorial analysis of variance
J. Stat. Comput. Simul.
 
2003
73
85
113