Methylation of CpG islands (CGIs) plays an important role in gene silencing. For genome-wide methylation analysis of CGIs in female white blood cells and in sperm, we used four restriction enzymes and a size selection step to prepare DNA libraries enriched with CGIs. The DNA libraries were treated with sodium bisulfite and subjected to a modified 454/Roche Genome Sequencer protocol. We obtained 163 034 and 129 620 reads from blood and sperm, respectively, with an average read length of 133 bp. Bioinformatic analysis revealed that 12 358 (7.6%) blood library reads and 10 216 (7.9%) sperm library reads map to 6167 and 5796 different CGIs, respectively. In blood and sperm DNA, we identified 824 (13.7%) and 482 (8.5%) fully methylated autosomal CGIs, respectively. Differential methylation, which is characterized by the presence of methylated and unmethylated reads of the same CGI, was observed in 53 and 52 autosomal CGIs in blood and sperm DNA, respectively. Remarkably, methylation of X-chromosomal CGIs in female blood cells was most often incomplete (25–75%). Such incomplete methylation was mainly found on the X-chromosome, suggesting that it is linked to X-chromosome inactivation.
CpG islands (CGIs) are CpG-rich DNA sequences, which are often found at the transcription start site of mammalian genes (1,2). With the exception of imprinted genes, genes on the inactive X-chromosome in females, germline-specific genes as well as a few developmental genes, the cytosine residues within CGIs >500 bp are usually unmethylated (3). The methylation of CGIs imposes transcriptional silencing that can be transmitted by clonal inheritance in somatic cells for many cell generations. The role of DNA methylation at CGIs <500 bp as well as in the gene body is less clear, although certain differentially methylated sequences have been identified as regulatory elements (4,5). Aberrant methylation of CGIs is found in many disease states, especially cancer. In order to identify all the genes that are regulated by differential DNA methylation, it will be necessary to determine the DNA methylation patterns of all CGIs and other CG-rich sequences in different tissues at different stages, both in normal as well as in disease states.
In recent years, several techniques for genome-wide DNA methylation studies have been developed. For many years, restriction genomic landmark scanning was the only method capable of analyzing more than 1000 loci in parallel (6). Later, microarray-based techniques were developed (7,8). Although the latter techniques make it possible to survey all CGIs in a single experiment, the resolution is only moderate, and no allele-specific information can be obtained. Single base resolution requires sodium bisulfite sequencing. With next generation sequencing technology, it has now become possible to perform massive parallel bisulfite sequencing of DNA. Using the Illumina platform, which generates reads of ∼35 bp, Lister et al. (9) have studied the DNA methylation of the Arabidopsisthaliana genome at single base resolution. At present, sequencing of an entire mammalian genome, which is ∼30 times larger than the A. thaliana genome, with or without sodium bisulfite treatment, is not feasible, and DNA sequences of interest have to be enriched prior to sequence analysis. Meissner et al. (10) have generated genome-scale DNA methylation maps of murine cells by end sequencing of size-selected MspI fragments on the Illumina platform.
The 454/Roche platform generates fewer reads compared with the other platform, but these are 100–400 bp in length (11). Therefore, it is possible to obtain reads that cover a substantial fraction of a CGI and thus provide comprehensive methylation information of a single allele in a single cell. This should make it possible, for example, to distinguish between allele-specific methylation differences and biallelic partial methylation. Furthermore, we anticipated that longer reads can be more accurately mapped to the reference sequence, especially reads from unmethylated sequences, which contain only three different bases. On the other hand, very efficient enrichment methods are required. We have performed in silico restriction digestions of the human genome to select a combination of enzymes that generate a DNA fragment set with optimally enriched CGI content for 454/Roche sequencing. We have modified the sequencing protocol and have developed algorithms for mapping the reads onto the reference sequence and determine the degree of methylation. Finally, we used the novel strategy to analyze and compare DNA methylation of human blood and sperm samples.
Enrichment of CpG-island-containing reads
The human haploid genome reference sequence (NCBI Build 36.1; http://genome.ucsc.edu/index.html) has a length of 3 142 044 949 bp and contains 27 639 annotated CGIs with a total length of 21 101 376 bp (0.67%). Thus, performing unselected sampling of DNA fragments <1% of all reads hit a CGI. Therefore, the first step of our approach was to prepare a genomic DNA library enriched for a defined fraction of CGIs fragments. The overall strategy is to digest genomic DNA using an appropriate combination of restriction enzymes followed by gel-based size fractionation of restriction fragments. To best meet the requirements of the 454/Roche sequencing technology, the desired length of restriction fragments is 300–800 bp. For degradation of AT-rich regions into smaller fragments, we chose MseI (TTAA) and Tsp509I (AATT). For further enrichment of CGI fragments within the pool of fragments with desired length, the following enzymes were tested by computationally digesting the human reference genome: AluI (AGCT), NlaIII (CATG), BfaI (CTAG), HpyCH4 (TGCA), DpuI (GATC), MboII (GAAGA), MlyI (GAGTC) and BccI (CCATC). We found optimal enrichment of CGI fragments by the use of following restriction enzyme mix (REM): (MseI, Tsp509I, NlaIII and HpyCH4V), resulting in 99 187 different in silico restriction fragments (ISRFs)/male genome with an optimal length range of 346–800 bp (see Supplementary Material, Table S1). Under these conditions, almost all other (∼70 million) ISRFs are shorter than 346 bp, 4.964 ISRFs are longer than 800 bp. We find that 16 312 (16.4%) of the ISRFs hit 12 482 distinct CGIs, which is 45% of all CGIs in the genome, thus resulting in a 24-fold enrichment of CGI fragments compared with random sampling.
To check the in silico enrichment of CGIs, we digested female genomic DNA and subjected the size-selected fragments (346–800 bp) to 454/Roche sequence analysis. We obtained 389 188 reads with an average length of 102 bp. Mapping the reads to the genome, we found that 149 141 reads (38%) were not mappable to the reference sequence and 15 841 reads (4%) could not unambiguously be assigned as they map to more than one position in the human genome. A number of 120 738 reads (31%) clearly map to the predicted reference sequence, which is defined as 123 797 distinct ISRFs of lengths 330–850 bp (accounting for ∼5% inaccuracy in size fractionation). The remaining 103 468 reads (27%) map elsewhere in the genome. The ISRFs are not covered uniformly: 61.879 ISRFs were not hit by a read, 30.672 were hit once and 16.365 twice. Further coverage frequencies follow a geometric progression with a factor of 0.5 (Supplementary Material, Fig. S1). Based on in silico restriction, we expected reads from 12 482 distinct CGIs within the 120 738 mappable reads. We found 7728 of the expected CGIs were hit at least once and 4754 CGIs were not covered. In detail, among these 7728 covered CGIs, 2894 were hit once, 1934 twice and 2900 more than twice (Supplementary Material, Fig. S2). Overall, among the 120 738 mappable reads, 19.238 (15.9%) hit a CGI (that is 4.94% of all 389 188 reads). Thus, resulting in an enrichment of CGI reads in our sample by a factor of 7.4.
Massively parallel bisulfite sequencing
We performed massively parallel sequencing of CGI-enriched and bisulfite-treated DNA fragments from human genomic blood and sperm DNA samples. Using half a plate for each of both samples, we obtained 163 034 and 129 620 reads, respectively, with an average read length of 133 bp. Bioinformatic analysis revealed that 12.358 (7.6%) blood library reads map to 6167 different CGIs. Almost the same proportion of sperm reads (7.9%) maps to 5796 different CGIs. By this analysis, we obtained methylation data of 22 and 21% of all CGIs in the blood and sperm sample, respectively. Depending on the degree of methylation (=methylated CpGs/all CpGs × 100), each read is grouped into one of the three categories: unmethylated (<25% methylated), partially methylated (between 25 and <75%) and fully methylated (≥75%). Next, we defined different CGI classes according to the degree of methylation of the matching reads (Table 1).
|CGI classes||Degree of methylation of matching reads|
|1. Fully methylated||−||−||+|
|3. Partially methylated||−||+||−|
|4. Differentially methylated||+||−||+|
|5. Partially + unmethylated||+||+||−|
|6. Partially + Fully methylated||−||+||+|
|7. Mixed methylated||+||+||+|
|CGI classes||Degree of methylation of matching reads|
|1. Fully methylated||−||−||+|
|3. Partially methylated||−||+||−|
|4. Differentially methylated||+||−||+|
|5. Partially + unmethylated||+||+||−|
|6. Partially + Fully methylated||−||+||+|
|7. Mixed methylated||+||+||+|
CGIs larger than 500 bp are presumed to be of higher regulatory relevance, therefore, the methylation patterns of CGI <500 bp (small CGI) and ≥500 bp (large CGI) were analyzed separately. The proportion of un-, fully, partially and differentially methylated CGIs depending on genomic location and CGI length is presented in Figure. 1 (see also Supplementary Material, Tables S2 and S3).
Autosomal methylated CGIs
We detected a substantial number of methylated autosomal CGIs in both samples. In blood DNA, we identified 824 (13.7%) fully methylated and 231 (3.8%) partially methylated autosomal CGIs. In sperm DNA, the number of fully and partially methylated CGIs is considerably lower with 482 (8.5%) and 148 (2.6%), respectively. Overall, in both samples, autosomal small CGIs tend to be more frequently methylated compared with large CGIs (Fig. 1 and Supplementary Material, Table S3). In blood and sperm DNA we identified 394 and 172 fully methylated large CGIs, respectively (Supplementary Material, Table S3). Among these, 428 are associated with 380 different genes. In 66 (17%) of these genes, the CGI intersects or overlaps with the promoter region, which we define as the region 1 kb upstream of the transcription start site (Supplementary Material, Tables S4 and S5). CGIs that are hit by methylated and unmethylated reads are referred to as differentially methylated (dmCGI). In somatic cells, autosomal dmCGIs are indicative of allelic methylation and imprinting. In blood and sperm samples, we identified 53 and 52 dmCGI (<1%), respectively. Among them, 75 dmCGIs are associated with 81 known genes (some are associated with more than one gene) (Supplementary Material, Table S6). Interestingly, the data revealed that dmCGIs are clearly distinct from fully methylated CGIs: dmCGIs are predominantly ≥500 bp in length (Supplementary Material, Table S3) and they are more frequently associated with promoter regions (50%) compared with fully methylated large CGIs (15%) (Supplementary Material, Tables S4 and S6). Seven of the genes associated with dmCGIs are already known or predicted to be imprinted (Supplementary Material, Table S6) (12).
In DNA from female blood cells, X-chromosomal CGIs are more frequently fully, partially or differentially methylated compared with autosomal CGIs (Fig. 1 and Supplementary Material, Table S7). This is most likely due to the inactivation of X-linked genes in female cells. Unexpectedly, among the X-chromosomal CGIs in female blood cells, there is a comparatively large fraction (25%) of partially methylated CGIs showing a degree of methylation between 25 and 75% (Fig. 1). On autosomes, this intermediate methylation state is observed in only 3.8% of all CGIs (Fig. 1 and Supplementary Material, Table S3). The evidence of an overall lower degree of methylation of X-chromosomal reads is also supported by the large number of CGIs that are hit by both, partially and unmethylated reads. Figure 2 shows the pattern of the partially methylated X-chromosomal reads. To take a closer look at the distribution of partially methylated reads, we subdivide the reads into 10 different classes depending on their degree of methylation (Fig. 3) and found that the vast majority of autosomal reads (>93%) are derived from genomic sequences with 0–20 or 80–100% methylation (Fig. 3A). The distribution of X-chromosomal reads is distinct, with a large proportion of reads derived from genomic sequences with 50−80% methylation (Fig. 3B). This applies to reads mapping to small CGIs and to large CGIs. Using Pearson’s χ2 test, we found that the difference between the distribution of autosomal and X-chromosomal methylated reads is highly significant (P = 2.4E−25). For this analysis, we omitted the unmethylated reads (0–30%), because a different distribution of these reads as a consequence of X-chromosome inactivation is to be expected.
To validate the results, we determined the methylation patterns of regions overlapping with reads generated on the 454/Roche platform by conventional bisulfite sequencing. We randomly chose eight fully, three differentially as well as one autosomal and five X-chromosomal partially methylated CGIs. Cloning of PCR products from bisulfite-treated DNA and sequence analysis of at least eight individual clones per loci revealed complete or almost complete methylation of seven CGIs that were classified as fully methylated based on one to six reads generated by the 454/Roche sequencer (MIB2, three reads; ZNF672, three reads; DPPA5, two reads; BEND3, six reads; RAI1, four reads; DLGAP1, one read; GLI2, two reads). We found complete methylation in 11 of 12 clones amplified from the KIAA0408 promoter: one clone showed a degree of 55% methylation (Supplementary Material, Fig. S3A). Differential methylation of TNFGSF4 in sperm DNA and FUT4 in blood DNA is also confirmed as we detected fully methylated as well as unmethylated alleles in both dmCGIs by conventional bisulfite sequencing (Supplementary Material, Fig. S3B). ASCL2 was classified as differentially methylated based on the presence of seven unmethylated and one fully methylated reads. Detailed inspection of the methylated read revealed a perfect match to the ASCL2 target sequences, and eight out of nine CpGs within the read were methylated. Sequence analysis of 13 individual cloned PCR fragments showed an overall low degree of methylation varying from 0 to 8% in all clones. A fully methylated allele was not found. Partial methylation of the CGI associated with the promoter of HLCS was confirmed as 13 clones show a degree of methylation between 0 and 42%.
For validation of X-chromosomal partially methylated reads, we analyzed five randomly chosen CGIs, which showed 46–70% methylation. Sequence analysis of individual clones revealed a broad range of methylation, varying from 26 to 90% among the methylated clones (Fig. 4 and Supplementary Material, Table S8). Excluding the unmethylated clones, which are most probably derived from the active X, the mean degree of methylation of the analyzed loci varied from 43 to 69% (Supplementary Material, Fig. S3A; and Table 2). This is distinct from the results of autosomal fully methylated CGIs, which showed a uniform methylation pattern among individual clones and a mean degree of methylation >92% (Table 2). The results confirm the 454/Roche sequencing data and emphasize the difference in the degree of methylation of X-chromosomal and autosomal CGIs.
|Position||Gene||Length (bp)||% CpG||% C or G||Observed/expected||Degree of methylation (%)|
|Position||Gene||Length (bp)||% CpG||% C or G||Observed/expected||Degree of methylation (%)|
Digestion of genomic DNA with the used REM followed by size selection of restriction fragments resulted in a 7.4-fold enrichment of CGI fragments. This is much lower than 24-fold enrichment as expected on the basis of in silico digestion of the human reference genome. Several reasons might account for this.
(i) The in silico study is based on the RefSeq assembly, which excludes highly repetitive sequence elements such as centromeric satellite repeats. In our experiments, ∼38% of the reads are unmappable. Inspection of these reads revealed that they mainly consist of repetitive DNA including satellite sequences. Therefore, further depletion of repetitive elements would markedly increase the CGI enrichment. (ii) Fragment length selection in agarose gels is usually inaccurate. In particular, smaller DNA fragments might co-migrate with larger fragments. The fraction of smaller, unpredicted fragments might be further increased in the library due to preferential amplification in the PCR following the bisulfite modification. (iii) Normal genomic variants affecting the restriction enzyme recognition sites and fragment length such as single nucleotide polymorphisms and Copy Number Variations (CNVs) might also explain the deviation of in silico and real digestion experiments. Another 27% of all reads are probably attributed to the last two conditions as they are mappable to the genome but are not predicted by the in silico digestion.
After bisulfite modification of CGI-enriched DNA libraries, we obtained methylation data from more than 20% of all CGIs of a human genome. Validation experiments performed by conventional bisulfite sequencing proved the reliability of the 454/Roche platform and the suitability of the newly developed algorithms for global methylation analysis. A minor discrepancy was observed for the CGI associated with ASCL2, which was classified as dmCGI based on seven un- and one fully methylated read, but no fully methylated clone was detected in the validation experiment. The results of both experiments suggest that fully methylated ASCL2 alleles are rare and therefore might have been missed in the validation experiment. Furthermore, we cannot completely rule out biased amplification of different methylation states by PCR.
Concordant with the results from other studies based on different techniques, we identified a substantial number of methylated CGIs in blood and sperm DNA (13,14). In this study, we found 824 (17.6%) autosomal CGIs fully methylated in normal blood DNA. This corresponds well with the results of a study by Yamada et al. (13), in which a restriction enzyme-based methylation analysis of human chromosome 21q revealed that 31 out of 149 CGIs (21%) are fully methylated in normal peripheral blood cells. These results confirm that a remarkable fraction of CGIs, other than previously thought, is methylated in the human genome. It is also apparent from our and published data that larger CGIs tend to be less frequently methylated (3,15,16). Here, we report that 289 CGIs with a length of ≥500 bp are methylated. Most of these CGIs (78%) do not overlap with promoter regions of known genes, and therefore their regulatory relevance is less clear. The remaining 63 methylated large CGIs intersect with transcription start sites. Provided that promoter methylation results in long-term silencing of associated genes, expression of these genes might be restricted to germ cells or preimplantation embryos that undergo extensive epigenetic reprogramming (17). Such developmentally specific methylation patterns have been described for the key transcription factors POU5F1 and NANOG, which are necessary for maintaining a cell in pluripotent state (4). Alternatively, tissue-specific methylation might apply to these genes. Finally, we cannot exclude that at least a subset of the methylated CGIs is in fact differentially methylated, and gene expression is initiated from the second unmethylated allele. As most CGIs are covered by only one read or a few reads of the same methylation type, the portion of differentially methylated CGIs is certainly underestimated.
Due to X-chromosome inactivation in female cells, most X-chromosomal CGIs are expected to be differentially methylated. This implies that on the read level, the portion of methylated reads must be close to 50%. In our study, the number of X-chromosomal dmCGIs in the blood sample is lower than expected. As mentioned above, this can, at least in part, be attributed to the overall underestimation of dmCGIs. The portion of fully methylated reads (16%) is also clearly below the expected range, which cannot be easily explained by an ascertainment bias. However, including all reads with a degree of methylation >40%, the fraction of methylated reads increases to 43%; thus coming close to the expectation. This fits to the observation that CGIs that are hit by an unmethylated and a partially methylated read are predominantly observed on the X-chromosome. These results show that partial CGI methylation is a characteristic feature of X-chromosome inactivation.
To gain insight into the functional relevance of partial methylation, we analyzed if promoter and non-promoter reads are methylated differentially, but found highly similar degrees of methylation in both subgroups (data not shown). We also addressed the possibility that partial methylation might correlate with X inactivation escape as observed in pseudoautosomal and further regions on the X-chromosome. However, only a few X-chromosomal reads map to genes with biallelic expression as determined by Carrel and Willard (18). Therefore, our results largely reflect the methylation pattern of regions that are subjected to X inactivation (data not shown). Guided by a publication from Pfeifer et al. (19), we also analyzed a possible influence of the sequence context on the degree of methylation of individual CpGs. In this study, GCG and CGC trinucleotides located in the CGI of the PGK1 gene were found to partially escape methylation on the inactive X-chromosome. However, we could not find such a methylation escape in the partially methylated CGIs validated by conventional methylation analysis. At a first glace, a characteristic methylation pattern of the partially methylated reads is not evident in our results (Figs 2 and 4). So far, however, we analyzed only one female blood sample. Therefore, it will be interesting to see if this methylation pattern is typical for further female individuals and cell types and if variations in this pattern are associated with phenotype alterations.
Compared with other next generation sequencers, the 454/Roche platform can reliably detect and quantify the degree of methylation of partially methylated reads. The average read length of >130 nt makes it possible to distinguish between allele-specific methylation differences and partial methylation. Owing to the low coverage, however, the methylation status of a specific locus cannot always be reliably determined.
In conclusion, we show that the FLX 454/Roche sequencing technology can be adopted for genome-wide methylation analysis with minor modifications of the standard procedure. Compared with other approaches that are based on array analysis of DNA fragments precipitated by antibodies against 5mC (MeDip), our method allows an unbiased evaluation of the methylation patterns within the reduced representation of a mammalian genome. Even more, high resolution sequence analysis provides precise measures of the degree of methylation of single alleles over a length of up to 250 bp. Using half a plate (200 000 reads) per sample, we determined the methylation status of 22% of all CGIs of a human genome. We expect that further depletion of repetitive elements by the use of additional restriction enzymes combined with the use of an entire plate that provides a sequencing capacity of more than 400 000 reads per run, results in coverage of nearly 50% of all CGIs in the human genome.
MATERIAL AND METHODS
DNA isolation from blood and sperm
Genomic DNA was extracted from blood and sperm cells using the FlexiGene DNA Kit (Qiagen, Hilden, Germany) following the manufacturer's instructions. Whole blood DNA was prepared from a female healthy donor. Semen samples were obtained from volunteers whose ejaculates were considered to be normal based on WHO criteria (20). DNA was quantified using the NanoDrop ND1000 instrument. For sperm library generation and sequence analysis, we pooled the DNA from four different semen samples.
Preparation of DNA samples for massively parallel bisulfite pyrosequencing
Genomic DNA (30–40 µg) was digested using 30–50 U of each restriction enzyme MseI, Tsp509I, NlaIII and HpyCH4V, following the manufacturer's (NEB, Frankfurt, Germany) recommendations. Enzymes were applied sequentially to the DNA samples in the order listed above. After 2 h of incubation, the enzyme activity was replenished by adding 30–50 U of fresh enzyme to each sample and increasing the reaction volume accordingly with 1× reaction buffer. Incubation was continued for another 2 h. After Tsp509I digestion, restriction fragments were phenol extracted followed by ethanol precipitation. After final HpyCH4V digestion, DNA fragments were phenol extracted followed by ultra-filtration with Microcon 100 (Amicon) filtering device at 500g. Typical yields were 10 µg of DNA. Restriction fragments were size fractionated by electrophoresis on a 1.8% agarose gel with 1 µg DNA applied per lane. Slices containing the 350–800 bp fractions were excised from the ethidium bromide-stained gel, and DNA was recovered by Wizard SV Gel and PCR Clean-Up System (Promega, USA). DNA was concentrated by centrifugation in Microcon 100 devices at 500g. Recovery from the gel was typically 400 ng.
Library preparation for sequence analysis of bisulfite-treated DNA fragments on GS FLX
Library preparation of samples was performed according to the manufacturer's protocol ‘GS DNA Library Preparation Kit’ (11) with the following modifications.
Without prior nebulization, restriction fragments were blunt-ended and phosphorylated through the activity of an enzyme mix [T4 DNA polymerase, Escherichiacoli DNA polymerase (Klenow fragment), T4 polynucleotide kinase]. The polishing reaction was performed in the presence of 5-methyl-dCTP (Jena Bioscience GmbH) instead of dCTP.
The A and B adapters each comprised 44 bp double-stranded oligodeoxyribonucleotides were ligated to the polished ends of the DNA fragments. To preserve the sequence in the following bisulfite modification step, we used adapters that contain 5-methyl-C instead of C. The 5′ modification of the B adapter with a biotin tag remains unaltered. After adapter ligation, samples were purified using MinElute columns (Qiagen).
The two nicks at the 3′-junctions were repaired by the strand-displacement activity of Bst DNA polymerase. For preservation of methylated cytosines in the adapter sequence, the reaction was performed in the presence of 5-methyl-dCTP instead of dCTP. Single-stranded AB adapted library fragments were isolated as described in the manufacturer's protocol. DNA yield varied from100 pg to 5 ng.
Bisulfite modification and massively parallel pyrosequencing
Single-stranded DNA fragments were subjected to bisulfite modification as described elsewhere (21,22) and finally resuspended in 10 µl of water. PCR amplification of 2 µl bisulfite-converted library was performed in a total reaction volume of 25 µl using GoTaq Green master Mix (Promega) and 200 nm of each of the primers GSFLXA and GSFLXB. The primers GSFLXA and GSFLXB both have a length of 20 nt and bind to the 5′ part of the adapters A and B, respectively (11). After initial denaturation at 95°C for 2 min, 20 cycles were carried out with denaturation at 95°C for 30 s annealing at 62°C for 30 s and extension at 72°C for 1 min, followed by a final extension at 72°C for 5 min. To remove unspecific reaction products, gel purification of PCR products was performed as described above, however, DNA fragments in the size of 440–900 bp were recovered. The DNA was quantified using the Agilent 2100 BioAnalyzer on a RNA 6000 Pico LabChip.
Massively parallel pyrosequencing
Bead preparation and sequence analysis was performed according to the Shotgun protocol of the GS emPCR Kit User’s Manual and Genome Sequencer FLX. Massively parallel pyrosequencing of the amplification products was performed according to the protocol provided by the manufacturer.
Single locus methylation analysis
For conventional methylation analysis, the regions of interest were amplified on bisulfite-treated DNA. Primers were designed using MethPrime (23) and were provided with an additional 5′ tag-sequence (Tag) that does not bind to the template (Supplementary Material, Table S9). A common fwTag was synthesized to all forward (fw) primers and a distinct revTag to all reverse (rev) primers. PCR was performed in a total volume of 25 µl containing 3 µl of bisulfite-treated DNA, 0.2 mm of each dNTP, 0.25 µm of each primer, 2.5 µl 10× PCR-Puffer, 2.5 mm MgCl2 and 1.5 U Taq Polymerase (AmpliTaq Gold, Applied Biosystems, Foster City, CA, USA) using a GeneAmp 9700 system. A touch down protocol was adopted for amplification as follows (24): After denaturation at 95°C for 5 min, the annealing temperature was decreased 0.7°C for every cycle from 63 to 56°C, at which temperature 35 cycles were carried out. For all cycles, annealing was performed for 1 min, denaturation at 95°C for 20 s and extension at 72°C for 1 min, followed by a final extension at 72°C for 5 min. After agarose gel electrophoresis and purification using MinElute™ Gel Extraction Kit (Qiagen), PCR products were cloned and individual clones were sequenced on an ABI 3100 automated capillary genetic analyzer (Applied Biosystems) using Big Dye 1.1 chemistry. Cloning and sequence analysis of PCR product was performed as described elsewhere (25).
Pre-preprocessing of sequence reads
All reads are searched for overlaps with any of the adapter sequences using a simple ends-free sequence alignment algorithm. If an adapter is found, it is removed from the read. The process is repeated at most three times, taking into account the rare instances of an adapter sequence found twice at the end of a read.
The sequence reads were mapped back to the genome using our own VerJInxer software available at http://verjinxer.googlecode.com.
For mapping of reads from CGI-enriched fragments, a CGI reference sequence was generated, by extracting CGI sequences as defined by the UCSC browser from the repeat-masked human genome reference sequence including additional 200 nt at both ends of each CGI (henceforth called CGI-RefSeq) (26). The resulting FASTA file contains 32 156 976 nt in total, of which 2 689 539 nt (9.1%) are repeats and counted as mismatches in further processing. For mapping reads (query sequences) to the genome, we first used a q-gram index of the CGI-RefSeq (target sequence) with a word length of q = 10 to find exact matches between query and target. Matches of a minimum length of n0 = 25 (seeds) are extended by computing an optimal alignment to the left and to the right such that they cover the entire read. Reads for which there are at most 5% (relative to the read length) errors (indels or mismatches) are recorded. After repeating the process with the reverse complements of all reads, all reads are reported that map uniquely to one position on the CGI-RefSeq. A regular q-gram index of sequence contains, for each string of length q (q-gram), the positions at which that string occurs in the sequence (27).
To find a query in the target, the query’s q-grams are looked up in the index and thus located in the target sequence, giving initial (short) matches of length q. Consecutive q-grams of the read that match consecutive q-grams in the index correspond to a longer match (>q).
Mapping bisulfite reads
To map bisulfite reads, we create a bisulfite q-gram index of the CGI-RefSeq, which indexes not the sequence itself, but a simulated bisulfite-treated version of it. For each q-gram found in the target sequence, all possible methylation patterns are simulated. All of the resulting q-grams are recorded in the index as belonging to this position. As an example for q = 5, let the reference sequence contain CGACA at some position. Since the first cytosine can be either methylated or not, bisufite treatment results in a read containing either CGATA or TGATA. Both of these q-grams are included in the index. The VerJInxer software also includes the non-bisulfite-treated q-gram in the index in order to detect entirely or partially unmodified reads. After PCR amplification of bisulfite-treated DNA fragments that are complementary to the RefSeq, in the resulting reads Gs opposite unmethylated Cs are replaced by A (relative to the RefSeq). The respective q-grams are also simulated and included in the index. Note that this is orthogonal to the fact that for each read, its reverse complement is computed and separately mapped to the reference sequence. In the following discussion, the G/A substitutions are not mentioned anymore. It is possible that the number of q-grams resulting from the simulation is large. For example, if q = 10 and a q-gram consisting of five consecutive CG dinucleotides is considered, a total of 25 + 25 − 1 = 63 q-grams must be recorded in the index. The average number is much lower. The bisulfite q-gram index (q = 10) of CGI-RefSeq contains 4.36 q-grams per sequence position on average.
For mapping bisulfite reads to the CGI-RefSeq, we used the algorithms as described above, but the procedure is modified such that, when extending the seeds, the possible substitutions are observed, accomplished by an alteration of the cost function that is used for the computation of the optimal alignment. In order to detect partially unmodified reads, we allow a C in the RefSeq to match a C in the read even if not part of a CpG, which would not occur if bisulfite modification was perfect. As before, the process is repeated for the reverse complement of each read and only those reads mapping uniquely to the CGI-RefSeq are reported.
Degree of methylation
To identify and exclude apparently unmodified Cs that can also arise by the incorporation of methylated Cs in the course of the experimental setup (Nick Repair), we compare the reads with the reference sequence and classify Cs outside CpG dinucleotides as unmodified.
To determine the degree of methylation, unmodified parts of the reads must be ignored. We assume that there is a boundary in each read which splits it into two parts (front and back) and optimize the position of this border such that the number of unmodified nucleotides in one part plus the number of modified nucleotides in the other part is maximized. After that, any modified (unmodified) nucleotides that are in the unmodified (modified) part are added to the number of errors that were already reported for this read. If the error rate increases above the maximum error rate of 5%, the read is discarded. Otherwise, its methylation rate is determined. Only the part of the read that is bisulfite modified and that intersects a CGI is analyzed. CG dinucleotides replaced by a TG count as unmethylated, and CGs that are unchanged count as methylated. The degree of methylation is computed as the number of methylated CGs divided by the number of methylated CGs plus the number of unmethylated CGs. Reads with a degree of methylation ≥75 and <25% are referred to as fully methylated and unmethylated, respectively. Reads with a degree of methylation in-between are referred to as partially methylated. All CGI reads, including their chromosomal position and degree of methylation, are listed in BED files (see Supplementary Data: Blood.bed and Sperm.bed). For visualization of the results from large-scale bisulfite sequencing, both BED files were imported into the IGB browser available at: https://www.affymetrix.com/partners_programs/programs/developer/tools/download_igb.affx.
The authors thank Lars Maßhöfer for technical assistance and Jörg Gromoll for providing sperm DNA.
Conflict of Interest statement. The manuscript reports on results of a joint project involving authors employed by a pharmaceutical company (ROCHE, G.B., A.K., C.S. and B.F.), and scientists from different academic institutions. M.Z., B.H., S.R. and E.F. are co-applicants of patent application filed by Roche for parts of the methodology.