Highly sensitive targeted methylome sequencing by post-bisulfite adaptor tagging

The current gold standard method for methylome analysis is whole-genome bisulfite sequencing (WGBS), but its cost is substantial, especially for the purpose of multi-sample comparison of large methylomes. Shotgun bisulfite sequencing of target-enriched DNA, or targeted methylome sequencing (TMS), can be a flexible, cost-effective alternative to WGBS. However, the current TMS protocol requires a considerable amount of input DNA and hence is hardly applicable to samples of limited quantity. Here we report a method to overcome this limitation by using post-bisulfite adaptor tagging (PBAT), in which adaptor tagging is conducted after bisulfite treatment to circumvent bisulfite-induced loss of intact sequencing templates, thereby enabling TMS of a 100-fold smaller amount of input DNA with far fewer cycles of polymerase chain reaction than in the current protocol. We thus expect that the PBAT-mediated TMS will serve as an invaluable method in epigenomics.


Introduction
Methylation occurring at the position 5 of cytosine residues in DNA is an epigenetic modification critically involved in the regulation of eukaryotic genomes. Accordingly, the genome-wide distribution of 5-methylcytosine residues, or the methylome, has been attracting intense attention from a wide audience in a variety of research disciplines. The recent advent of next-generation sequencing has revolutionized the way of interrogating the methylome: it has realized whole-genome bisulfite sequencing (WGBS) or genome-wide methylation analysis at single-base resolution. 1,2 The power of WGBS has been well demonstrated by many findings that would never have been achieved with other technologies. Although WGBS represents the gold standard for methylome analysis and is rapidly becoming the method of choice, its cost has remained substantial, thereby preventing it from being widely used for multi-sample comparison of large methylomes including those of mammals.
The most popular alternative to WGBS is reduced-representation bisulfite sequencing (RRBS), which efficiently enriches CpG-rich regions through restriction enzyme digestion to reduce the cost of sequencing, while maintaining deep coverage of a subset of CpG sites. 3 Notably, RRBS can be applied to a minute amount of input DNA. 4 However, it cannot be used to examine particular regions of interest unless they are adequately flanked by the restriction enzyme sites. In this context, the shotgun bisulfite sequencing of subgenomic regions enriched using solution hybridization capture technology, referred to hereafter as targeted methylome sequencing (TMS), is ideal, because it can in principle target any unique genomic region. [5][6][7] However, all of the TMS protocols reported thus far require not only a considerable amount of input DNA (i.e. 3 µg or more), but also a large number of polymerase chain reaction (PCR) cycles (i.e. 10-20 cycles), making it difficult to apply TMS to samples of limited quantity.
We recently developed a highly efficient protocol for WGBS library construction termed post-bisulfite adaptor tagging (PBAT). 8 Although it is well known that bisulfite treatment destroys DNA, all of the conventional ligation-based WGBS protocols as well as the tagmentation-based one 9 treat adaptor-tagged library DNAs with bisulfite, inevitably leading to a considerable loss of intact sequencing template molecules. To circumvent this bisulfite-induced loss, we proposed the PBAT strategy in which adaptor tagging is performed after bisulfite treatment. This simple trick allowed us to prepare a PCR-free WGBS library from as little as 125 pg of input DNA. 8 We routinely achieve a PCR-free, 30-fold coverage of mammalian methylomes from ∼30 ng of input DNA. Indeed, PBAT has been applied to mouse WGBS from 400 to 1,000 germinal vesicle-stage oocytes 10,11 and a few thousand primordial germ cells, 12 notably without any global PCR amplification. More recently, it has even been applied to single-cell genome-wide bisulfite sequencing with the aid of PCR. 13 As all of the current TMS protocols include bisulfite treatment of adaptor-tagged library DNA, we reasoned that PBAT can improve the efficiency of TMS to develop a low-input protocol (Fig. 1). We have indeed succeeded in the development of a highly efficient TMS method applicable to samples of limited quantity.

Preparation of DNA
Both human and mouse genomic DNAs used in the model experiments were purchased from Promega. Genomic DNA from IMR90 primary human lung fibroblasts was a generous gift from Yae Kanai. The indicated amount of genomic DNA was dissolved in 130 µl of 10 mM Tris-HCl ( pH 8.0) and sheared with Covaris S220 to the indicated size. We used AMPure XP to purify the fragmented DNA as follows. First, the shared DNA (130 µl) was mixed with 1.8× volume (234 µl) of the AMPure XP reagent and stood for 15 min at room temperature. Next, the beads were collected using a magnet stand, and the supernatant was removed. The pelleted beads were then rinsed with 70% ethanol and dried by standing at 37°C for 5 min. Finally, DNA was eluted from the beads to 20 µl of RNase-free water. The eluted DNA solution was dried in a vacuum concentrator and dissolved in 7 µl of RNase-free water.

Target enrichment
Enrichment of targets with liquid-phase hybridization capture was performed using the reagents in SureSelect Human or Mouse Methyl-Seq kit (Agilent). Genomic DNA (7 µl) fragmented and purified as above was supplemented with 3 µl of formamide (Wako, Biochemistry grade) and overlaid with 80 µl of mineral oil (Sigma). The DNA was completely denatured by incubating the tube at 99°C for 10 min, cooled down to 65°C and kept at 65°C for at least 5 min before adding the following reagents. Hybridization buffer was prepared by mixing 7.5, 0.3, 3.0 and 3.0 µl of Hyb#1, #2, #3 and #4, respectively. Capture probe mix was prepared by mixing 5.0, 0.5 and 1.0 µl of capture probe solution, RNase Inhibitor and RNase-free water, respectively. The hybridization buffer and the capture probe mix were individually overlaid with 80 µl of mineral oil and incubated at 65°C for 10 min. These two solutions were then combined and mixed thoroughly by pipetting. The combined solution was transferred to the tube containing the denatured input DNA kept at 65°C as above and mixed thoroughly with the DNA solution by pipetting. The tube was incubated at 65°C for at least 24 h to allow hybridization between the probes and the targets. Fifty microlitres of well-suspended solution of DynaBeads MyOne Streptavidin T1 (Life Technologies) was taken into a 1.5-ml tube, and the beads were washed 2 times with 200 µl of Binding Buffer. To the pelleted beads, the hybridization reaction supplemented with 200 µl of Binding Buffer was added and mixed well. After incubation with rocking at room temperature for 30 min, the beads were collected using a magnetic stand and washed with 500 µl of Wash Buffer 1. The beads were then subjected to three rounds of washing, each composed of resuspension in pre-warmed Buffer 2 followed by incubation at 65°C for 10 min. Following thorough removal of washing solutions from the tube, enriched DNA was eluted by incubating the beads in 20 µl of Elution solution at room temperature for 20 min. The eluate was immediately used for bisulfite treatment.

Bisulfite treatment
EZ DNA Methylation-Gold kit (Zymo Research) was used for the bisulfite treatment of target-enriched DNA, according to manufacturer's instruction. The enriched DNA solution (20 µl) was directly mixed with 130 µl of CT conversion reagent freshly prepared before use. The mixture was incubated at 64°C for 2.5 h. Note that the incubation step at 98°C for 10 min described in the instruction was omitted, because the target-enriched DNA was already denatured. Following the purification and desulfonation steps, bisulfite-treated DNA was eluted with 20 µl of M-Elution buffer.

PBAT library construction and Illumina sequencing
We used the bisulfite-treated DNA for library preparation according to the PBAT protocol 14 (also available from http://crest-ihec.jp/english/ epigenome/index.html), except for the primers used in the firstand second-strand synthesis. The primer used for the first-strand synthesis was 5′-ACA CTC TTT CCC TAC ACG ACG CTC TTC CGA TCT WWW WNN NN-3′ (W = A or T). The indexed primers used for the second-strand synthesis was 5′-CAA GCA GAA GAC GGC ATA CGA GAT XXX XXX GTA AAA CGA CGG CCA GCA GGA AAC AGC TAT GAC WWW WNN NN-3′, in which XXX XXX stands for the index sequence of each primer. We sequenced the constructed TMS libraries using Illumina HiSeq2500 as described previously. 14

Data analysis
The obtained reads were mapped to human and mouse DNA using the hg19 and mm9 assemblies, respectively, analysed and visualized as described previously. 8

PBAT to target-enriched DNA
To test whether PBAT is applicable to TMS library construction, we applied it to target-enriched DNA prepared using the RNA probes provided in the Agilent SureSelect Mouse Methyl-Seq kit, which are designed to cover mouse genomic regions spanning 109 Mb in total. As PBAT employs random primer extension, it is ideal that the primer hybridizes to the 3′-flanking region of each target region in the input DNA and initiates the synthesis of a complementary DNA strand to fully span the target region. We therefore assumed the size of the input DNA to be critical and first examined its effects on the yield of the library. Starting from 3 µg of input DNA, as recommended by the manufacturer, we prepared DNA fragments with an average length of 180 bp, 400 bp or 2 kb (Supplementary Fig. S1A). From these DNA preparations, we enriched the targets by solution  hybridization capture and successfully generated PCR-free single-read PBAT libraries, each of which was sufficient for one or more lanes in the HiSeq2500 system ( Supplementary Fig. S1B). Note that the original Methyl-Seq protocol requires a total of 14 cycles of global PCR amplification when starting from the same amount of input DNA (i.e. 3 µg). As expected, the raw yield of library increased with the size of input DNA (Supplementary Fig. S1B). However, the normalized yield by the size of input DNA indicated that the 400-bp DNA fragments were most efficient. In addition, the library generated from the 2-kb DNA fragments contained a substantial number of offtarget reads that were mapped near to but not onto the targets (Supplementary Fig. S1C). We therefore decided to use input DNA fragmented to a size of 400 bp for further development.

Novel primers for improved yield of indexed TMS libraries
It is highly likely that the users of TMS intend to sequence two or more indexed libraries in a single lane for efficient data collection. Thus, we next attempted to improve the efficiency of the PBAT paired-end/indexed read protocol, because it is 2-fold less sensitive than the singleread protocol. 14 While the original PBAT protocol uses a random tetramer sequence (N 4 ) attached to the 3′-end of primers containing Illumina adaptor sequences, 8 we found that novel primers containing a semi-random tetramer composed solely of A or T bases immediately upstream of the random tetramer (W 4 N 4 ; W = A or T) improved the yield of TMS library by ∼4-fold, with a marginal effect on the GC bias in target coverage ( Supplementary Fig. S2). We also found that the addition of formamide to the hybridization capture reaction at a final concentration of 10% improved the reproducibility of the target enrichment step (data not shown). These modifications were critical to perform highly sensitive and robust construction of TMS libraries.

Characterization of low-input PBAT-mediated TMS libraries
Using the protocol optimized as above, we prepared indexed TMS libraries from 3,000 to 10 ng of human and mouse DNA using the human and mouse RNA probe sets obtained from Agilent, respectively ( Table 1). The yield of library DNA was nearly linearly correlated with the amount of input DNA (Table 1; Supplementary Fig. S3A). The smaller the amount of input DNA, the more evident the adaptor dimers were (Supplementary Fig. S3B). We performed the minimum cycles of PCR amplification to obtain sufficient DNA for a single  (Table 1). (B) Consistency among TMS data. Methylation levels were compared among the six TMS libraries generated from 3,000 to 10 ng of human genomic DNA using the PBAT-mediated procedure as well as the one generated from 3,000 ng of input DNA using the original Methyl-Seq protocol (DRA002274-002280). The numbers and the images in the boxes above and below the diagonal indicated the coefficients of determination (R 2 ) and the scatter plot of methylation levels, respectively, between all the possible combinations among the seven data sets. The moving averages of methylation levels (window size, 500 bp; step size, 250 bp) were calculated based on CpGs covered by 20 or more reads. (C) A snapshot of TMS data. Data around the imprinted control region (ICR) for PEG3 were compared among the seven libraries generated with either PBAT or Methyl-Seq using the indicated amount of input DNA. Red bars and grey shadows indicated the methylation levels of individual CpG sites and the depth of reads, respectively. Note that most reads were mapped to the bottom strand, as the RNA probes used in the experiment were designed from the top strand. As expected, a region around the PEG3 promoter showed ∼50% methylation level due to the imprinted monoallelic methylation. The green dashed box denoted the ICR of PEG3 (chr.19: 57,351,728 to 57,352,173 in hg19 human reference genome sequence).
HiSeq2500 lane (Table 1). We then used a half lane to sequence each of the human libraries and mapped the obtained reads to the reference human genome sequence. (Note that we combined the library DNA with the same amount of the PhiX control to compensate the extreme base bias of bisulfite-converted sequences.) The mapping rate was inversely correlated with the amount of input DNA within the range of 3,000 to 30 ng, but it showed a prominent decline when the DNA input was reduced to 10 ng (Table 1). In addition, all of the libraries except the one generated from 10 ng of DNA showed similar statistics regarding the coverage of the targets (Fig. 2A). The methylation levels of CpG sites covered by 20 or more reads were highly consistent among these five libraries (R 2 > 0.95, when the window size and the step were 500 and 250 bp, respectively) (Fig. 2B). We inspected the methylation status of imprinted genes and found that the expected 50% methylation level was faithfully recapitulated, even in the library generated from 30 ng of DNA, but not in the library obtained from 10 ng of DNA (Fig. 2C). These results suggest that the PBAT-mediated TMS method is reliably applicable to as little as 30 ng of input DNA. We also performed Methyl-Seq from the same human genomic DNA according to the protocol provided by the manufacturer, except that fewer PCR cycles were used (i.e. 11 cycles instead of 14 cycles), and compared the results with those of PBAT-mediated TMS. While the two data sets showed largely consistent methylation levels (R 2 = 0.88) (Fig. 2B), they were distinct in terms of the GC content of the covered targets: Methyl-Seq and PBAT-mediated TMS preferentially covered AT-rich and GC-rich targets, respectively ( Supplementary  Fig. S4). This was presumably because the extensive global PCR step in Methyl-Seq failed to amplify GC-rich regions and because the random priming from the PBAT adaptor primers was rather compromised in AT-rich regions. We also note that the Methyl-Seq data recapitulated the methylation status of imprinted genes less precisely than the PBAT data (Fig. 2C).
Having confirmed the high efficiency and consistency of PBATmediated TMS, we next compared it with WGBS. We performed TMS using 300 ng of genomic DNA from human IMR90 cells and compared the resulting data with PBAT-mediated WGBS data from the same cell line. The methylation levels of CpG sites covered by 20 or more reads showed an excellent correlation between the two data sets in either the window-based or the nucleotide-based analysis (Fig. 3). These results indicated that TMS can be a reliable alternative to WGBS.

Discussion
WGBS has become the gold standard method in methylomics for its unsurpassed resolution and coverage. However, it is too expensive to be used for analysing multiple samples. Efforts have thus been paid to develop cost-effective alternatives to WGBS, including various TMS methods based on padlock probes, array capture and solution hybridization capture. 7 Among these approaches, the last one is highly flexible, but current protocols require a considerable amount of input DNA and are not applicable to samples of limited quantity. 5,6 Furthermore, they include extensive global PCR amplification, which may increase the risk of inaccurate estimate of methylation levels, especially when the quantity of input DNA is limited. Therefore, a novel TMS protocol is desirable that requires a much smaller amount of input DNA with far fewer cycles of global PCR amplification than current ones. As the inefficiency of current protocols is likely attributable to bisulfite-induced degradation of target-enriched library DNA (Fig. 1A), we reasoned that the PBAT strategy can circumvent the adverse effect of bisulfite treatment to improve the efficiency of TMS (Fig. 1B).
We tested the possibility using the biotinylated RNA probes in the Agilent Methyl-Seq kit designed to cover CpG islands with their shores and shelves, enhancers, promoters, differentially methylated regions and other regulatory elements. Our results demonstrated that PBAT enables TMS from a 100-fold smaller amount of input DNA (i.e. 30 versus 3,000 ng) with considerably fewer cycles of PCR amplification (i.e. 4 versus 11-14 cycles) than the original protocol provided by the manufacturer (Table 1; Fig. 2). It also enables PCR-free TMS, provided that 1 µg of input DNA is available. Furthermore, the coverage statistics and the methylation levels of imprinted genes indicated that the PBATmediated protocol has superior coverage and accuracy than the original one (Fig. 2). We also confirmed high consistency between PBAT-mediated TMS and WGBS (Fig. 3), proving that the former can serve as a reliable surrogate for the latter. Although further improvements are necessary to generate high-quality TMS libraries from <30 ng of input DNA and to achieve more even coverage of the targets regardless of their GC contents, the PBAT-mediated protocol significantly outperforms the original Methyl-Seq protocol. It will be also useful for base-resolution analysis of 5-hydroxymethylcytosine when applied to genomic DNA treated with adequate oxidants or enzymes. 15,16 We also evaluated the cost performance of PBAT-mediated TMS in comparison with those of PBAT-mediated WGBS and RRBS. The . Consistency between TMS and WGBS data. Methylation levels were compared between the TMS data (DRA002281) ( Table 1) and a publicly available WGBS data (DRA002248) on human IMR90 cells using the CpG sites covered by 20 or more reads. Methylation levels were plotted for moving windows (window size, 500 bp; stepping size, 250 bp) (A) and for individual CpG sites (B).
TMS is ∼10 times less expensive than the WGBS, when the former and the latter use the high output mode of HiSeq to achieve 40-fold coverage of the target regions and the whole genome, respectively. Note that, the deeper the coverage is, the more cost effective the TMS is. While the target regions in the TMS cover only ∼2.8% of the human genome, they include approximately one-seventh of the total CpG sites. On the other hand, the TMS costs 4-5 times more but covers only ∼1.5 times more CpG sites than RRBS does. Nevertheless, the TMS likely remains highly competitive to RRBS in many instances, since the former covers a much broader range of genomic elements than the latter.
Taken together, PBAT significantly enhances the utility of TMS to enable various novel applications, especially those analysing a large number of precious samples. The PBAT-mediated TMS will thus serve as an invaluable tool for epigenomics.

Availability
The TMS data sets from this study have been submitted to the DDBJ Sequence Read Archive under the accession numbers DRA002274-002281.