Abstract

Motivation

The rapid development of next-generation sequencing technology provides an opportunity to study genome-wide DNA methylation at single-base resolution. However, depletion of unmethylated cytosines brings challenges for aligning bisulfite-converted sequencing reads to a large reference. Software tools for aligning methylation reads have not yet been comprehensively evaluated, especially for the widely used reduced representation bisulfite sequencing (RRBS) that involves enrichment for CpG islands (CGIs).

Results

We specially developed a simulator, RRBSsim, for benchmarking analysis of RRBS data. We performed extensive comparison of seven mapping algorithms for methylation analysis in both real and simulated RRBS data. Eighteen lung tumors and matched adjacent tissues were sequenced by the RRBS protocols. Our empirical evaluation found that methylation results were less consistent between software tools for CpG sites with low sequencing depth, medium methylation level, on CGI shores or gene body. These observations were further confirmed by simulations that indicated software tools generally had lower recall of detecting these vulnerable CpG sites and lower precision of estimating methylation levels in these CpG sites. Among the software tools tested, bwa-meth and BS-Seeker2 (bowtie2) are currently our preferred aligners for RRBS data in terms of recall, precision and speed. Existing aligners cannot efficiently handle moderately methylated CpG sites and those CpG sites on CGI shores or gene body. Interpretation of methylation results from these vulnerable CpG sites should be treated with caution. Our study reveals several important features inherent in methylation data, and RRBSsim provides guidance to advance sequence-based methylation data analysis and methodological development.

Availability and implementation

RRBSsim is a simulator for benchmarking analysis of RRBS data and its source code is available at https://github.com/xwBio/RRBSsim or https://github.com/xwBio/Docker-RRBSsim.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

DNA methylation is involved in various biological processes including transcriptional activity, genomic imprinting, development and many diseases including cancer (Bock, 2012; Feinberg et al., 2016; Jones, 2012; Schubeler, 2015). Bisulfite treatment with next-generation sequencing (BS-Seq) has been a popular and powerful method to profile genome-wide DNA methylation patterns at single-base resolution (Cokus et al., 2008). Notably, an enriched approach called reduced representation bisulfite sequencing (RRBS) is becoming increasingly common for studying genome-wide DNA methylation due to its significantly reduced sequencing cost and high coverage of gene promoters (Gu et al., 2010; Meissner et al., 2005; Plongthongkum et al., 2014; Yin et al., 2016). Moreover, the extremely low input requirement and the applicability of the protocol to formalin-fixed and paraffin-embedded samples brings RRBS to a wide range of biological and clinical samples and research applications (Orozco et al., 2015). This technique uses restriction endonuclease such as MspI to digest DNA (5′-CCGG-3′) and enriches CpG-rich DNA fragments for sequencing. As a consequence of treatment with bisulfite, most of the unmethylated cytosines (Cs) in the sequencing reads become thymines (Ts) but methylated cytosines remain unchanged.

The depletion of unmethylated cytosines brings a challenge for aligning bisulfite-converted sequencing reads to a large reference genome and standard short-read aligners are not suitable for bisulfite-sequencing reads (Laird, 2010). Therefore, two types of algorithms, wildcard and three-letter, have been put forward for bisulfite-sequencing read alignment. The wildcard aligners (e.g. BSMAP, GSNAP) regard Cs in the reference genome as wildcards and can match either Cs or Ts in the bisulfite sequencing reads. Three-letter aligners (e.g. BS-Seeker2, Bismark) convert all Cs into Ts on both strands of reference genome and the bisulfite-converted reads, so the mapping is carried out on a three-letter alphabet (A, G and T). The methylation level of a CpG site (CpGs) is calculated based on methylated and unmethylated reads mapped to the reference genome. Therefore, precise reads mapping is essential, otherwise it would give rise to biased estimation of DNA methylation levels and subsequent false-positive differential methylated regions (DMRs) and incorrect biological interpretation (Bock et al., 2010). From this point of view, it is very necessary to assess the accuracy of methylation calls for different mapping algorithms.

Comparisons of several bisulfite-sequencing alignments were performed in 2012 (Chatterjee et al., 2012) and 2014 (Kunde-Ramamoorthy et al., 2014) using a real sequencing dataset, but software versions have been considerably upgraded and novel alignment methods have been developed since then. Due to the lack of ground truth for the methylation profile of real bisulfite-sequencing data, it is difficult to determine which software tools perform best and which, if any, software tools perform adequately in previous studies. Simulating data for benchmarking alignment methods is more powerful and convenient. However, benchmarking analysis of bisulfite-sequencing alignment algorithms using simulated data is rarely seen. Moreover, several key questions can potentially be answered using simulated data, which have not yet been assessed in previous studies. For instance, how probable is it that a CpG site will be detected when it is sequenced (termed recall)? How precisely can the methylation level at a CpG site be estimated (termed precision)? What factors (e.g. sequence depth, genomic feature and methylation level) potentially influence the precision and recall of detecting methylation? Hence, comprehensive evaluation and benchmarking of bisulfite-sequencing data using simulated data is needed.

To evaluate and benchmark RRBS alignment algorithms, we specially developed a RRBS simulator, RRBSsim. RRBSsim can produce paired-end bisulfite-sequencing reads that contain significant attributes of real data. It can take a reference genome or user-defined DNA sequence and generate simulated reads with customized length, polymorphisms and sequence errors. RRBSsim was integrated with sequence context-dependent error models, which is reflective of Illumina reads. As far as we know, RRBSsim is the first simulator software that can satisfy the goals of our benchmarking analysis. However, there are a few simulator software programs available for whole genome bisulfite sequencing (WGBS), e.g. WGBSsuite, but they were not specifically designed for RRBS data and do not capture characteristics of real sequencing data (Rackham et al., 2015). Though Michelle et al. reported a simulator for RRBS, they only generated methylation values at a single CpG site through statistical models rather than simulated sequencing reads (Lacey et al., 2013). Hence, the tool falls short of the complexity of real RRBS data and cannot be used for assessing the performance of RRBS aligners. Under the help of RRBSsim, it is possible to generate simulated RRBS data that is used for meaningful benchmarking analysis of alignment software tools.

Here, we present and discuss five commonly used mapping algorithms for benchmarking analysis in RRBS data, including Bismark (Krueger and Andrews, 2011), BS-Seeker2 (Guo et al., 2013), BSMAP (Xi and Li, 2009; Xi et al., 2012), GSNAP (Wu and Nacu, 2010) and bwa-meth (Pedersen et al., 2014) (Table 1). They were systematically assessed in a large RRBS real dataset and simulated dataset generated by RRBSsim under a wide range of scenarios. To assess the potential effects of short read alignment on the methylation profiling, two aligners, bowtie (BS-Seeker2-bowtie) (Langmead et al., 2009) and bowtie2 (Langmead and Salzberg, 2012) with local (BS-Seeker2-local) and end-to-end (BS-Seeker2-e2e) alignment modes, were further compared in BS-Seeker2 for analyzing RRBS data.

Table 1.

Brief description of different software tools for methylation analysis

BismarkBS-Seeker2bwa-methBSMAPGSNAP
Mapping strategyThree-letterThree-letterThree-letterWildcardWildcard
AlignerBowtie, bowtie2Bowtie, bowtie2, soapBwaSoapGsnap
WGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBS
Adapter trimmingNoYesNoYesYes
Multi-coresYesYesYesYesYes
Directional/undirectionalYes/YesYes/YesYes/NoYes/YesYes/Yes
Single-end/paired-endYes/YesYes/YesNo/YesYes/YesYes/Yes
Programming languagePerlPythonPythonC++C and Perl
BismarkBS-Seeker2bwa-methBSMAPGSNAP
Mapping strategyThree-letterThree-letterThree-letterWildcardWildcard
AlignerBowtie, bowtie2Bowtie, bowtie2, soapBwaSoapGsnap
WGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBS
Adapter trimmingNoYesNoYesYes
Multi-coresYesYesYesYesYes
Directional/undirectionalYes/YesYes/YesYes/NoYes/YesYes/Yes
Single-end/paired-endYes/YesYes/YesNo/YesYes/YesYes/Yes
Programming languagePerlPythonPythonC++C and Perl
Table 1.

Brief description of different software tools for methylation analysis

BismarkBS-Seeker2bwa-methBSMAPGSNAP
Mapping strategyThree-letterThree-letterThree-letterWildcardWildcard
AlignerBowtie, bowtie2Bowtie, bowtie2, soapBwaSoapGsnap
WGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBS
Adapter trimmingNoYesNoYesYes
Multi-coresYesYesYesYesYes
Directional/undirectionalYes/YesYes/YesYes/NoYes/YesYes/Yes
Single-end/paired-endYes/YesYes/YesNo/YesYes/YesYes/Yes
Programming languagePerlPythonPythonC++C and Perl
BismarkBS-Seeker2bwa-methBSMAPGSNAP
Mapping strategyThree-letterThree-letterThree-letterWildcardWildcard
AlignerBowtie, bowtie2Bowtie, bowtie2, soapBwaSoapGsnap
WGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBSWGBS/RRBS
Adapter trimmingNoYesNoYesYes
Multi-coresYesYesYesYesYes
Directional/undirectionalYes/YesYes/YesYes/NoYes/YesYes/Yes
Single-end/paired-endYes/YesYes/YesNo/YesYes/YesYes/Yes
Programming languagePerlPythonPythonC++C and Perl

In addition, we examined the running time, differential behaviors at predefined genomic regions (e.g. CGI, CGI shore, promoter and gene body), the effect of read depth and other factors. Our study highlights differences in methylation profiling resulting from different software tools for analyzing RRBS data. We reveal several important features that influence the precision and recall of genome-wide methylation analysis and provide guidance to advance sequence-based methylation experimental design and data analysis for molecular biologists.

2 Materials and methods

2.1 Real RRBS data

Our real RRBS data were generated from 18 tumors and matched normal tissues from non-small cell lung cancer (NSCLC) patients. These 36 samples were treated with the RRBS library preparation protocols (Gu et al., 2010; Meissner et al., 2005). The RRBS libraries were pooled and sequenced on the Illumina HiSeq 2000 platform (100 bp, paired-end). Each library yielded, on average, 30 million reads after removing adapters and sequences with low quality (quality score <20) (Supplementary Table S1). The study protocol was approved by the institutional review board of Sir Run Run Shaw Hospital at Zhejiang University (Hangzhou, China). All samples used in the current study were obtained at the time of diagnosis before any treatment was administered. All raw RRBS sequencing data generated by this study have been submitted to SRA (SRP125064).

2.2 Analyzing real RRBS data

Before the analysis, the adapter sequences were first removed from the output reads from Illumina HiSeq 2000; sequences with low quality (base quality < 20) at both ends of reads were further trimmed. The trimmed reads were input in five commonly used software tools, Bismark (v0.14.1) (Krueger and Andrews, 2011), BS-Seeker2 (v2.0.9) (Guo et al., 2013), BSMAP (v2.89) (Xi et al., 2012; Xi and Li, 2009), GSNAP (v2016-11-07) (Wu and Nacu, 2010) and bwa-meth (v0.10) (Pedersen et al., 2014) (Table 1). These methylation sequence reads were mapped to the human reference genome (hg19) downloaded from the UCSC Genome Browser (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/). To assess potential effects of short read alignment on methylation profiling, two different aligners, bowtie (1.1.1) (Langmead et al., 2009) (BS-Seeker2-bowtie), bowtie2 (2.2.4) (Langmead and Salzberg, 2012) with local (BS-Seeker2-local) and end-to-end (BS-Seeker2-e2e) alignment modes, were used in BS-Seeker2 for analyzing the real RRBS data. In the analysis, we attempted to make all software tool parameter settings as similar as possible. Their detailed parameter settings were documented in the Supplementary Material.

On the basis of coverage, CpG sites in 36 RRBS libraries were grouped into three categories: lower than 10 reads coverage (<10×), 10 to 30 reads coverage (10×–30×) and higher than 30 reads coverage (>30×). According to methylation levels, they also were classified into three categories: methylation levels lower than 1/3 (<1/3), methylation levels between 1/3 and 2/3 (1/3–2/3) and methylation levels more than 2/3 (>2/3). The presented results were averaged across 36 RRBS datasets.

We divided genomic regions of interest into five genomic features including Transcription start site (TSS), promoter, gene body, CGI and CGI shore. Methylation patterns in these genomic features were characterized by different methylation analysis tools. TSS regions were defined as 1 kb upstream and downstream of transcript start site of a transcript (UCSC gene annotation, hg19). Predicted positions of promoter regions were obtained from the UCSC Genome Browser. The gene body was defined as a region between the start and end of a transcript. The genomic positions of CpG islands for hg19 were downloaded from the UCSC Genome Browser. CpG shores were defined as ±2 kb flanking the islands. Methylation levels of CpG sites were calculated as the number of unconverted cytosines (C) in CpG context, divided by total number of cytosines including converted cytosines (T) in CpG context (as C/(C+T)) in each RRBS library.

2.3 Generating simulation data using RRBSsim

We employed our recently developed RRBSsim tool to simulate realistic RRBS sequences on the Illumina platform. The RRBSsim simulator provides a user interface for choosing sequence depth, number of CpG sites, methylation level and MspI fragment size and its detailed workflow is depicted in the section of Results. We first created a file in the BED format for each type of genomic feature (TSS, promoter, gene body, CGI and CGI shore) based on the human reference genome (hg19). To comprehensively evaluate the performance of different software tools, a wide range of scenarios were simulated to generate RRBS sequence data. These scenarios varied among genomic features, sequence depths and methylation levels, and their combinations. Specifically, all of five genomic features were simulated to generate realistic RRBS sequences. We randomly selected 500 regions from each of the BED files of genomic features as simulations are time-consuming. The average size of each genomic feature is about 3.2 × 107 bp. The sequence depth was set as 10× (low sequence depth), 30× (medium sequence depth) and 50× (high sequence depth). The methylation level at CpG sites was set as a mean of 0.2 (low methylation level), 0.45 (medium methylation level) and 0.675 (high methylation level). A total of 100 simulations were replicated in each scenario, and results were averaged over 100 simulations.

The output files of RRBSsim are FASTQ files, which have the same format as those from real RRBS data. Identical to the real RRBS data analysis, the simulated sequence reads were trimmed and input to software tools for methylation analyses. Two metrics were used to assess the performance of bisulfite-seq mapping tools: recall and precision, which has been applied to comprehensive comparison of RNA-seq aligners recently (Baruzzo et al., 2017). In our study, recall measures the proportion of all CpG sites that were detected correctly, and precision measures the proportion of all detected CpG sites that were estimated correctly. Runtime is also a benchmark to measure the performance of bisulfite tools. Different from previous studies (Chatterjee et al., 2012; Kunde-Ramamoorthy et al., 2014), we divided running time into two parts: reads-mapping time and CpGs calling time.

3 Results

3.1 Real data analysis

As a pilot study, our RRBS datasets of 18 lung tumors and matched adjacent normal lung tissues were analyzed using the above-mentioned mapping software tools (Table 1). These RRBS libraries were pooled and sequenced on the Illumina HiSeq 2000 platform (100 bp, paired-end). Each library yielded, on average, 30 million reads after removing adapters and sequences with low quality (quality score < 20) (Supplementary Table S1).

First, we took an intuitive look at methylation levels of PAX7 gene [a member of the paired box (PAX) family], which is frequently inactivated in most of human cancers and promotes tumor growth (He et al., 2013) (Fig. 1). The aligners showed distinct methylation profiles in the promoter and other regions of PAX7. There are generally two types of discrepancies among their resulting methylation profiles: number of detected CpG sites and the methylation level estimated at CpG sites. For example, a few CpG sites were detected by some software tools but not by the others. Different aligners present substantial variability in estimation of methylation levels at some CpG sites.

DNA methylation profile of PAX7 revealed by different mapping algorithms. Visualization is made using the UCSC Genome Browser. These mappers resulted in differential methylation profiles for PAX7
Fig. 1.

DNA methylation profile of PAX7 revealed by different mapping algorithms. Visualization is made using the UCSC Genome Browser. These mappers resulted in differential methylation profiles for PAX7

Next, we had a global view of methylation profiles of our real RRBS data generated by the above-mentioned software tools. The proportion of CpG sites with at least 10 reads detected by any of two tools ranges from 78.3 to 97.2% with a mean of 89.2% (Fig. 2A). In particular, BS-Seeker2 using the bowtie aligner (i.e. BS-Seeker2-bowtie) detected a very distinct set of CpG sites from the other software tools. Less than 90.0% of the CpG sites obtained by BS-Seeker2-bowtie overlapped with any others. As compared with BS-Seeker2-bowtie, BS-Seeker2 using the bowtie2 (i.e. BS-Seeker2-local and BS-Seeker2-e2e) aligner often yielded results that were more consistent (about average 10% improvement) with the other aligners. This indicated that a short-read alignment algorithm had a significant impact on the number of recalled CpG sites and subsequent downstream analyses.

Comparison of methylation results from different software tools in real RRBS data. (A) Overlaps of identified CpG sites among different mapping aligners. (B) Distribution of concordant and discordant CpG sites under different sequencing depths. (C) Distribution of concordant CpG sites with different levels of DNA methylation. (D) Distribution of concordant and discordant CpG sites from different genomic regions (CGI, CGI shore, gene body, TSS and promoter). In concordant CpG sites, the differences in the methylation levels of the same CpG sites estimated from any two of these aligners are less than 0.05, and in discordant CpG sites, the differences are more than 0.05
Fig. 2.

Comparison of methylation results from different software tools in real RRBS data. (A) Overlaps of identified CpG sites among different mapping aligners. (B) Distribution of concordant and discordant CpG sites under different sequencing depths. (C) Distribution of concordant CpG sites with different levels of DNA methylation. (D) Distribution of concordant and discordant CpG sites from different genomic regions (CGI, CGI shore, gene body, TSS and promoter). In concordant CpG sites, the differences in the methylation levels of the same CpG sites estimated from any two of these aligners are less than 0.05, and in discordant CpG sites, the differences are more than 0.05

To investigate the characterization of CpG sites where different bisulfite mapping tools exhibited significant discrepancies in detecting and estimating methylation level, we defined two groups of CpG sites based on strict consensus rule, such as concordant and discordant CpG sites. The concordant CpG sites referred to those CpGs that were detected by all software tools and that the range of their estimated methylation level among different aligners was smaller than 0.05. The remaining CpGs were termed as discordant CpG sites. The selected threshold was often used to identify differentially methylated CpGs in epigenetic study of cancer and other investigation. If differences were more than the threshold, these CpGs were usually regarded as functional importance in the context of cancer and other diseases. It is reasonably conjectured that methylation values of concordant CpGs are more likely accurately estimated than discordant CpGs. Whereas, biological interpretation of the discordant CpGs should be treated with caution as their methylation levels tend to be biased estimated. Moreover, we also chose a less strict threshold of 0.1 to define concordant and discordant CpG sites. Both of them yielded very consistent results.

First, it was observed that the higher the sequencing depth of CpG sites, the more concordantly they were detected and estimated by different bisulfite aligners (Fig. 2B, Supplementary Fig. S1A). About 67.8% of all CpG sites with high (>30×) sequencing depth were concordantly identified and estimated. As for CpG sites with medium (10×–30×) sequencing depth, the concordant rate of detected CpG sites is 53.9%. In CpG sites with low (<10×) sequencing depth, only 33.3% of their detection and estimation were concordant. These findings suggested that sequencing depth of CpG sites affected CpGs calling and estimation of their methylation levels. Discordant CpGs were characterized by low coverage of depth, whereas concordant CpGs characterized by high coverage of depth. Furthermore, we examined the distribution of methylation level across these three categories of depth and found no significant differences among them, showing independent of methylation level of CpGs (Supplementary Fig. S2A).

Second, it was found that CpGs with low methylation level (<1/3) showed the most concordant calling and estimation, whereas CpGs with intermediate methylation level (1/3–2/3) were the least (Fig. 2C, Supplementary Fig. S1B). About 49.7% of CpG sites with low methylation level were concordantly called and estimated (Fig. 2C, Supplementary Fig. S1B). Conversely, 9.8% of CpG sites with intermediate methylation level presented concordant detection and methylation estimation. The results of CpG sites with high methylation level (≥2/3) were modest and 32.0% of their detection and estimation were concordant. These results showed that methylation level of CpGs played a key role in their concordant calling and estimation by different aligners. Discordant CpGs were characteristic of intermediate methylation level, whereas concordant CpGs were characteristic of low and high methylation value. Additionally, we investigated the distribution of sequencing depth across these three categories of methylation level and found no significant differences among them, indicating independent of sequencing coverage of CpGs (Supplementary Fig. S2B).

Finally, it was revealed that CpG sites located on the CGI shore and gene body tended to be less concordantly identified and estimated. We grouped genomic regions into five categories based on their genomic features: CGI, CGI shore, gene body, TSS and promoter (Irizarry et al., 2009; Jones, 2012) (Fig. 2D, Supplementary Fig. S1C). We evaluated the distribution of concordant and discordant CpG sites in these five regions. On the CGI shore or gene body, the proportion of concordant CpG sites were 66.8 and 64.5%, respectively. On the contrary, the percentage of concordant CpG sites was more than 80% for CGI, promoter and TSS regions. These discoveries supported that location of CpGs impacted calling and estimation of CpGs. Discordant CpGs squinted toward CGI shore and gene body, whereas concordant CpGs toward CGI, TSS and promoter.

In the next section, we simulated RRBS data with different sequencing depths, global methylation levels and genomic region sequences to find the best software tools under different scenarios and verify whether software tools perform better or worse in certain scenarios.

3.2 Simulations

Due to the lack of ground truth for methylation profiles of real RRBS data, it is difficult to determine which software tools perform best and which, if any, software tools perform adequately (Baruzzo et al., 2017; Bock et al., 2010). Simulated RRBS data is a more powerful and convenient method to perform evaluation and benchmarking analysis of their performance. Therefore, we specifically developed a comprehensive RRBS simulator, RRBSsim, providing paired-end bisulfite-sequencing reads with significant attributes of real data. The workflow of RRBSsim is outlined in Figure 3 and described as follows: (i) Simulation of a reference genome with sequence variation: variations including single nucleotide variation (SNV), insertion and deletion and structural variation can be randomly integrated into a reference genome provided by users (Mu et al., 2015). In addition, a file with sequence variant information can also be supported. (ii) Simulation of a reference genome with methylated CpGs: the methylation level of CpG site can be fitted using a Beta distribution (Supplementary Fig. S3), thus we generate methylation values of CpGs in the genome using a Beta model. Methylation levels of CpGs within a read are generated to account for the strong dependence of methylation among CpG sites within a genomic region in real data. (iii) Generation of simulated RRBS reads: the above modified reference genome is digested by a restriction enzyme (e.g. MspI) in silico and the position of generated fragments is marked (Gu et al., 2010). Single-end or paired-end sequencing reads were then obtained from these fragments with their coverage fitted with a Poisson distribution. Both a directional and a nondirectional library can be simulated. (iv) Sequencing error profile: the quality score of the base is related to its position for Illumina reads and so we produce an empirical error profile from enough large training datasets that it can be used to generate simulated Illumina reads (Hu et al., 2012). The base quality profile of simulated Illumina reads is similar to that of the training datasets (Supplementary Fig. S4). We also provide Python scripts for users to obtain their own sequencing error profiles. Moreover, our simulator also can simulate non-CpG sites (e.g. CHG and CHH) and their default values were estimated from our real RRBS dataset of lung cancer. Users can simulate other levels of non-CpG sites through an option in the program (e.g. ‘–CHG_level or –CHH_level’).

The overall workflow of RRBSsim for RRBS reads simulation
Fig. 3.

The overall workflow of RRBSsim for RRBS reads simulation

The RRBSsim provides a user interface for choosing the sequence depth, number and methylation level of CpG sites and MspI fragment sizes. First, we compared distributions of the depth, length and base quality score of reads as well as the methylation profiles between simulated and real RRBS data. Our simulated RRBS data had nearly identical read depth to the real RRBS data in CGI, CGI shore, gene body, promoter and TSS regions (Fig. 4A). Both simulated and real RRBS data had roughly similar distribution of read length (Fig. 4B). The methylation level of CpG sites of simulated data followed a bimodal distribution identical to the real RRBS data (Fig. 4C). As expected, the per-base sequence quality score of simulated RRBS reads generally decreased over the read length and is similar to that of real datasets (Supplementary Fig. S4). These comparisons indicated that our software can recapitulate significant attributes of real RRBS data and the simulation is robust and efficacious. Therefore, the conclusions to be drawn from RRBSsim simulated data below will be valid and can be extrapolated to real RRBS experiments. Of course, user can define arbitrary depth and length that they need, which is not necessary the same as the real RRBS data. It should be noted that sequencing length of simulated reads below was 100 bp, commonly used in current sequencer.

Similarity between RRBSsim-generated and real RRBS data. Real and simulated RRBS data were compared in terms of (A) read depth, (B) read length and (C) methylation levels
Fig. 4.

Similarity between RRBSsim-generated and real RRBS data. Real and simulated RRBS data were compared in terms of (A) read depth, (B) read length and (C) methylation levels

We then compared seven mapping algorithms for benchmarking analysis in simulated RRBS data under a wide range of scenarios. First, we evaluated the effects of sequence depth and genomic regions on the performance of these software tools for methylation analysis (Fig. 5, Supplementary Fig. S5). In general, the three-letter algorithms are comparable with the wild-card competitors in their performances. All aligners yielded increasing recall of identifying CpG sites and increasing precision of estimating methylation levels at CpG sites as the sequence depth grew. Moreover, these tools had higher recall and precision of profiling methylation in the CGI, TSS and promoter than in the CGI shore and gene body. The results are consistent with those from our real RRBS data. In all scenarios, Bismark, BS-Seeker2-local, BS-Seeker2-e2e, bwa-meth and GSNAP generally work better than the others in terms of recall and precision of profiling methylation. BSMAP showed the highest recall but the lowest precision among all of the software tools. BS-Seeker2-bowtie had quite high precision of estimating methylation levels at their identified CpG sites; however, their recall of detecting CpG sites was very low. It was observed that the alignment algorithm only captured about 20% of CpG sites that we simulated. In other words, a large majority of CpG sites will be missed using BS-Seeker2-bowtie for methylation analysis.

Recall and precision of different mapping algorithms for simulated RRBS datasets under different sequencing depth and genomic features. Recall measures the proportion of all CpG sites that were detected correctly, and precision measures the proportion of all detected CpG sits that were estimated correctly
Fig. 5.

Recall and precision of different mapping algorithms for simulated RRBS datasets under different sequencing depth and genomic features. Recall measures the proportion of all CpG sites that were detected correctly, and precision measures the proportion of all detected CpG sits that were estimated correctly

Second, we evaluated the performance of these software tools under different methylation levels of various genomic regions using simulated RRBS data (Fig. 6, Supplementary Fig. S6). Generally, all these tools perform best in regions with low methylation levels, whereas they perform worst in regions with medium methylation levels; their performance in regions with high methylation levels is between the former two scenarios. The CGI shore and gene body are regions with lower recall of detecting CpG sites and precision of estimating methylation levels at CpG sites. Similarly, Bismark, BS-Seeker2-local, BS-Seeker2-e2e, bwa-meth and GSNAP perform better than the others without regard to different genomic features and methylation levels. BS-Seeker2-bowtie failed in detecting a large proportion of CpG sites, although they could accurately estimate methylation levels at detected CpG sites. Then, we investigate how the capabilities of these software tools for methylation analysis were affected by sequencing read length together with sequencing depth. Three types of read length (50, 100 and 150 bp) were assessed, which were commonly used in current sequencing platforms. For the majority of algorithms, the longer of sequencing read, the more of the number of CpG sites to be detected, whereas increasing read size has little effect on improving the precision of estimating methylation values (Fig. 7). It is consistent with previous results that BSMAP and BS-Seeker2-bowtie perform the worst in terms of power and precision of identifying CpG sites under different sequencing depth and read length. These findings suggest that the longer read length and higher sequencing depth are required for reliable methylation study without regard to sequencing costs.

Recall and precision of different mapping algorithms for simulated RRBS datasets under different levels of methylation and genomic features
Fig. 6.

Recall and precision of different mapping algorithms for simulated RRBS datasets under different levels of methylation and genomic features

Recall and precision of different mapping algorithms for simulated RRBS datasets under different sequencing read length together with sequencing depth
Fig. 7.

Recall and precision of different mapping algorithms for simulated RRBS datasets under different sequencing read length together with sequencing depth

Finally, the running time of different software tools was also compared for analyzing the same RRBS data (Supplementary Fig. S7). Different from previous studies (Bock et al., 2010; Kunde-Ramamoorthy et al., 2014), we divided running time into two sections, read-mapping time and CpG-site calling time. Generally, the three-letter algorithms are faster than the wild-card ones for short read mapping. As for the whole running time, BS-Seeker2, bwa-meth and BSMAP are much faster than GSNAP and Bismark. Our comparison study was performed on a computing cluster node with 64-bit duo quad core Intel E5-2690v2 and Intel Xeon E7-8837 CPU and with 128 GB RAM on Red Hat 4.4.7-3. Two threads were used to run all the software tools for the comparison study through an option in program (Supplementary Material method).

4 Discussion

Previous investigation of bisulfite mapping tools mainly focused on mapping efficiency and overall methylation percentage using real bisulfite-sequence data (Chatterjee et al., 2012; Kunde-Ramamoorthy et al., 2014). The alignment efficiency measures the proportion of bisulfite-converted reads that are mapped on the reference genome and is one aspect evaluating performance of software tools. However, the mapping efficiency does not explicitly imply how many CpG sites of interest are detected and how precisely methylation levels at CpG sites are estimated as well as total methylation in a bisulfite-sequencing experiment. Investigators are more interested in the latter two measurements (i.e. recall and precision) to assess bisulfite aligners. They are also eager to know the circumstances in which tools suffer from mapping failure in CpG sites and bias in estimating methylation levels. Nevertheless, these important issues cannot be exhaustedly assessed in a bisulfite-sequencing experiment because true methylation profiles of these data remain unknown.

Thus, we specifically developed the first comprehensive RRBS simulator (called RRBSsim) that can produce paired-end bisulfite-sequencing reads with significant attributes of real data to tackle these significant problems. Both RRBS alignment algorithms and statistical methods of identifying differential methylated regions (DMRs) can be effectively evaluated by the software (Klein and Hebestreit, 2016; Shafi et al., 2017). We generated artificial RRBS reads in a way that is similar to RRBS library construction and sequencing. Moreover, RRBSsim adopted the empirical sequencing error model obtained from training of real sequencing data, accounted for the dependence of methylation among adjacent CpG sites, and introduced sequence variation in the simulated reads. It has been shown that the simulation is robust and efficient from the analogous distribution of methylation level and read length and coverage, as well as quality values between simulated and real RRBS data.

According to comprehensive benchmarking studies, our results showed that Bismark, BS-Seeker2 (bowtie2), bwa-meth and GSNAP generally preform best in terms of both CpGs recall and accuracy. The notable difference among them is almost runtime. The speed of BS-Seeker2 was dramatically improved by only indexing reduced representation genome, whereas bwa-meth introduces few temporary files and thus results in notably reduced runtime. In spite of the best recall, BSMAP resulted in incorrect estimation of methylation values at the greatest fraction of detected CpGs. This justifies the theoretical inference that wildcard strategy to address the C-T mapping asymmetry can achieve higher CpGs recall but introduce biased CpGs methylation (Bock, 2012). On the contrary, GSNAP is also a wildcard aligner and can achieve a reasonable balance between CpGs recall and accuracy similar to those three-letter aligners. Therefore, these differences in the performance are likely ascribed to what underlying alignment algorithms are used. BSMAP (using SOAP aligner) first roughly searches genomic regions of reads using a single q-mer seed and then determines their positions. Whereas GSNAP searches genomic regions of reads through merging and filtering position hash table from index of genome in a successively constrained manner. GSNAP achieves the better performance by balancing CpGs recall and precision but at the cost of dramatically increasing runtime. These findings indicate that it is perhaps an ill-advised strategy that integrating seed-based alignment approach with wildcard method to manage the C-T mapping asymmetry. Moreover, compared with BS-Seeker2 using bowtie2 (i.e. BS-Seeker2-local and BS-Seeker2-e2e), recall of BS-Seeker2-bowtie is the lowest. Bowtie2 is an advanced version of full-text minute index based bowtie and integrates the index-based alignment and SIMD-accelerated dynamic programming to achieve high precision and sensitivity. Hence, bowtie is short of sensitivity for mapping long reads and thus results in the lowest recall, which is consistent with our findings that recall of Bs-seeker2-bowtie is similar to the other tools when simulated reads are 50 bp in length but is much lower than the other tools for longer reads (Fig. 7). In total, both alignment algorithm and strategy of addressing the C-T mapping asymmetry can help to explain the differences among bisulfite mapping tools.

Furthermore, we observed the differences in performances of bisulfite-seq aligners among these scenarios including sequencing depth and length, CpGs methylation and distinct genomic regions on the basis of our empirical and simulated investigation. These findings were preserved and consistent among different bisulfite mapping tools. It has been shown that mapping efficiency is a function of read depth and length. Thus, it is reasonable to see that the deeper and the longer of reads, the larger of CpGs recall. It is also obvious that the estimated accuracy is predominantly affected by reads depth rather than size. This is confirmed by our observation that as read size increases, the accuracy of estimation does not improve.

It is revealed that bisulfite-seq alignment tools performed worst for intermediate methylation CpGs regardless of software tools, CpGs depth and genomic regions (Supplementary Figs S8 and S9). The underlying cause may be multifactorial and its explanation is a challenging work. It is worth noting that the sequencing depth of CpGs with intermediate methylation level was similar to those CpGs with low or high methylation level and thus reads covering these CpGs were not selectively discarded by aligners. Taking this point into account, we surmise that reads discarded by aligners have different influences on methylation estimation of CpGs with different magnitude of actual methylation levels. To illustrate these influences, an intuitive numeric example is given in Supplementary Figure S10. In theory, CpGs with intermediate methylation are expected to be unbiasedly estimated but with a large variance under large sample sizes if reads discarded occurs. Some of reads covering the same CpGs may be discarded by bisulfite-seq aligners because of bisulfite-converted error, single nucleotide variation, insertion and deletion, sequencing error and ambiguous alignment after bisulfite-conversion, etc. These discarded reads have a significant impact on methylation estimation of the CpGs with intermediate values. These CpGs with the expected large variation tend to be biasedly estimated based on a sample (i.e. a bisulfite-seq experiment). It is observed that the accuracy of estimation for these CpGs is lower than 0.5 in a single sampling when the discarding rate of reads is 5% and the read depth is 30× (Supplementary Fig. S11). Even though using 36 real datasets or 100 replicates of simulated samples, the large discrepancy in estimated methylation at these CpGs remains be observed. It should be also noted that the number of bisulfite-seq samples falls far short of our simulated sample size because of sequencing costs. Therefore, more care should be taken when investigating functional importance and relevance of these CpGs with intermediate methylation level in practice. Additionally, it is interesting to develop rigorous statistical approaches to quantify the uncertainty in measuring methylation of these CpGs and propagate this into downstream DMRs analysis in the future.

It is also showed in our simulation study that bisulfite mapping tools had a worse performance in CGI shore and genebody than in CGI, promoter and TSS regions regardless of software, sequence depth and read length, and methylation level of CpGs (Figs 5 and 6). This may be due to a higher percentage of segmental duplication or repetitive elements in CGI shore and genebody than CGI, promoter and TSS regions. Systematical investigation of these five genomic features confirmed our speculation. Additionally, sequences in CGI, promoter and TSS are functionally informative and unique across human genome that are less likely to be aligned to multiple positions even under bisulfite conversion. Low methylation level in CpG sites can also partially account for the excellent performance in lowly methylated regions (such as CGI, promoter and TSS). Interestingly, alteration of methylation in CGI shore and genebody has been strongly associated with gene regulation and tumorigenesis (Irizarry et al., 2009; Pathiraja et al., 2014; Rao et al., 2013). Notably, these regions are often significantly enriched with modest methylation in biological samples (Elliott et al., 2015; Libertini et al., 2016) (Supplementary Fig. S12). According to the above illustration, multiple factors collectively exert a tremendous influence on biased estimation of CpGs in these regions in practice. Therefore, biological interpretation of percentage methylation values calculated from these regions should be careful in practical bisulfite-seq experiments.

The motivation of our study is to assess the ability of correctly identifying CpG sites and unbiasedly estimating CpGs methylation levels for different bisulfite mapping tools in RRBS datasets. A majority of work in human epigenetics was focused on 5-methylcytosine (5mC) in the context of CpG dinucleotides, especially for epigenetic studies in cancer (Jones, 2012; Plongthongkum et al., 2014; Schubeler, 2015). The function of methylation of non-CpG sites remains unclear in human. Furthermore, RRBS uses restriction endonuclease such as MspI to digest DNA (5′-CCGG-3′) and enriches CpG-rich DNA fragments for sequencing (Gu et al., 2010). Hence, the number of non-CpG sites and corresponding methylation levels is very small relative to CpGs. The overall methylation levels of CpG sites and non-CpG sites are 0.44 and 0.03, respectively, which were estimated from our real RRBS lung cancer dataset. Therefore, non-CpG sites are not our mainly focused targets. Our simulator can simulate non-CG sites (e.g. CHG and CHH) and we had simulated non-CG sites as control cofactors in all our analysis.

It should also be mentioned that both our empirical evaluation and simulation study focused on paired end reads instead of single end reads. There are three reasons to use paired end sequencing for our RRBS study. First, single end read sequencing is now gradually becoming obsolete as the new Illumina sequencing platforms (e.g. HiSeq 3000/4000 and HiSeq X systems) were released. Second, in RRBS, the subset of DNA fragments is obtained by digesting genomic DNA with a restriction enzyme. The common enzyme is MspI which has a recognition sequence of 5′-CCGG-3′. The resulting MspI fragments include the majority of high CpG-rich promoters, as well as regions such as repeated sequences that are difficult to profile using conventional bisulfite sequencing approaches (Gu et al., 2010; Meissner et al., 2005). Paired end sequencing facilitates high quality of mapping of reads, particularly, in regions with repeat content. Third, in a regular DNA sequencing library, multiple segments (e.g. reads) resulted from random fragmentation by sonication or acoustic shearing are partially overlapping. Thus, the same genomic positions are covered and sequenced by multiple overlapping segments. On the contrary, MspI fragments are unique, non-overlapping. Short single end reads do not sufficiently cover large MspI fragments. For instance, HiSeq 3000/4000 sequencer only provides one running mode of single end reads with 50 bp in read length (1 × 50 bp). For a 220-bp MspI fragment, its middle 120-bp region cannot be covered and sequenced by such short single end reads. Whereas, HiSeq 3000/4000 sequencer provides several modes of paired end reads with up to 150 bp in read length (2 × 150 bp), which can cover whole MspI fragments of varied sizes. Therefore, paired end read sequencing is being commonly adapted in current RRBS studies. Our study focusing on paired-end sequencing reads are practical and timely.

5 Conclusions

In summary, sequencing depth, methylation level and positions of CpG sites have significant impacts on their ability to be detected and unbiased estimation through our analysis of real and simulated data. CpG sites in the CGI and promoter with higher sequencing depth and lower methylation level are more likely to be efficiently identified and estimated without bias. Among the software tools mentioned in our paper, bwa-meth and BS-Seeker2 (bowtie2) are our preference for RRBS data analysis at the current stage on the basis of their power to detect CpG sites, their precision in measuring methylation levels, and their running time. However, currently existing aligners cannot efficiently handle CpG sites with moderate methylation level and CpG sites on the CGI shore or gene body. Biological interpretation of DNA methylation level estimated from these vulnerable CpG sites using bisulfite-sequencing methods should be treated with caution. Further efforts are required to improve the existing aligners to overcome these challenges and our developed RRBSsim simulator can efficiently guide the development of these tools.

Acknowledgements

We thank the Bioinformatics Core Facility at Zhejiang University School of Medicine for providing computing capacity and three anonymous reviewers for reading and commenting on the manuscript.

Funding

This work was supported in part by National Key R&D program of China (2016YFC0902700 and 2016YFA0501800), National Natural Science Foundation of China (No. 31401125, 81472420, 81572256, 81772766 and 81372514), the Fundamental Research Funds for the Central Universities, the 1000 talent Plan of China and the American Heart Association (15SFRN23910002).

Conflict of Interest: none declared.

References

Baruzzo
 
G.
 et al. (
2017
)
Simulation-based comprehensive benchmarking of RNA-seq aligners
.
Nat. Methods
,
14
,
135
139
.

Bock
 
C.
(
2012
)
Analysing and interpreting DNA methylation data
.
Nat. Rev. Genet
.,
13
,
705
719
.

Bock
 
C.
 et al. (
2010
)
Quantitative comparison of genome-wide DNA methylation mapping technologies
.
Nat. Biotechnol
.,
28
,
1106
1114
.

Chatterjee
 
A.
 et al. (
2012
)
Comparison of alignment software for genome-wide bisulphite sequence data
.
Nucleic Acids Res
.,
40
,
e79
.

Cokus
 
S.J.
 et al. (
2008
)
Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning
.
Nature
,
452
,
215
219
.

Elliott
 
G.
 et al. (
2015
)
Intermediate DNA methylation is a conserved signature of genome regulation
.
Nat. Commun
.,
6
,
6363
.

Feinberg
 
A.P.
 et al. (
2016
)
Epigenetic modulators, modifiers and mediators in cancer aetiology and progression
.
Nat. Rev. Genet
.,
17
,
284
299
.

Gu
 
H.
 et al. (
2010
)
Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution
.
Nat. Methods
,
7
,
133
136
.

Guo
 
W.
 et al. (
2013
)
BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data
.
BMC Genomics
,
14
,
774.

He
 
W.A.
 et al. (
2013
)
NF-κB–mediated Pax7 dysregulation in the muscle microenviron ment promotes cancer cachexia
.
J. Clin. Investig
.,
123
,
4821
4835
.

Hu
 
X.
 et al. (
2012
)
pIRS: profile-based Illumina pair-end reads simulator
.
Bioinformatics
,
28
,
1533
1535
.

Irizarry
 
R.A.
 et al. (
2009
)
The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores
.
Nat. Genet
.,
41
,
178
186
.

Jones
 
P.A.
(
2012
)
Functions of DNA methylation: islands, start sites, gene bodies and beyond
.
Nat. Rev. Genet
.,
13
,
484
492
.

Klein
 
H.U.
,
Hebestreit
K.
(
2016
)
An evaluation of methods to test predefined genomic regions for differential methylation in bisulfite sequencing data
.
Brief. Bioinform
.,
17
,
796
807
.

Krueger
 
F.
,
Andrews
S.R.
(
2011
)
Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications
.
Bioinformatics
,
27
,
1571
1572
.

Kunde-Ramamoorthy
 
G.
 et al. (
2014
)
Comparison and quantitative verification of mapping algorithms for whole-genome bisulfite sequencing
.
Nucleic Acids Res
.,
42
,
e43
.

Lacey
 
M.R.
 et al. (
2013
)
Modeling, simulation and analysis of methylation profiles from reduced representation bisulfite sequencing experiments
.
Stat. Appl. Genet. Mol. Biol
.,
12
,
723
742
.

Laird
 
P.W.
(
2010
)
Principles and challenges of genomewide DNA methylation analysis
.
Nat. Rev. Genet
.,
11
,
191
203
.

Langmead
 
B.
,
Salzberg
S.L.
(
2012
)
Fast gapped-read alignment with Bowtie 2
.
Nat. Methods
,
9
,
357
359
.

Langmead
 
B.
 et al. (
2009
)
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
.
Genome Biol
.,
10
,
R25
.

Libertini
 
E.
 et al. (
2016
) Saturation analysis for whole-genome bisulfite sequencing data. Nature Biotechnol.,
34
,
691
693
.

Meissner
 
A.
 et al. (
2005
)
Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis
.
Nucleic Acids Res
.,
33
,
5868
5877
.

Mu
 
J.C.
 et al. (
2015
)
VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications
.
Bioinformatics
,
31
,
1469
1471
.

Orozco
 
L.D.
 et al. (
2015
)
Epigenome-wide association of liver methylation patterns and complex metabolic traits in mice
.
Cell Metab
.,
21
,
905
917
.

Pathiraja
 
T.N.
 et al. (
2014
)
Epigenetic reprogramming of HOXC10 in endocrine-resistant breast cancer
.
Sci. Transl. Med
.,
6
,
229ra41
229ra241
.

Pedersen
 
B.S.
 et al. (
2014
) Fast and accurate alignment of long bisulfite-seq reads, arXiv preprint arXiv: 1401.1129.

Plongthongkum
 
N.
 et al. (
2014
)
Advances in the profiling of DNA modifications: cytosine methylation and beyond
.
Nat. Rev. Genet
.,
15
,
647
661
.

Rackham
 
O.J.
 et al. (
2015
)
WGBSSuite: simulating whole-genome bisulphite sequencing data and benchmarking differential DNA methylation analysis tools
.
Bioinformatics
,
31
,
2371
2373
.

Rao
 
X.
 et al. (
2013
)
CpG island shore methylation regulates caveolin-1 expression in breast cancer
.
Oncogene
,
32
,
4519
4528
.

Schubeler
 
D.
(
2015
)
Function and information content of DNA methylation
.
Nature
,
517
,
321
326
.

Shafi
 
A.
 et al. (
2017
) A survey of the approaches for identifying differential methylation using bisulfite sequencing data. Brief. Bioinform., doi: 10.1093/bib/bbx013.

Wu
 
T.D.
,
Nacu
S.
(
2010
)
Fast and SNP-tolerant detection of complex variants and splicing in short reads
.
Bioinformatics
,
26
,
873
881
.

Xi
 
Y.
 et al. (
2012
)
RRBSMAP: a fast, accurate and user-friendly alignment tool for reduced representation bisulfite sequencing
.
Bioinformatics
,
28
,
430
432
.

Xi
 
Y.
,
Li
W.
(
2009
)
BSMAP: whole genome bisulfite sequence MAPping program
.
BMC Bioinformatics
,
10
,
232
.

Yin
 
D.
 et al. (
2016
)
High concordance between Illumina HiSeq2500 and NextSeq500 for reduced representation bisulfite sequencing (RRBS)
.
Genomics Data
,
10
,
97
100
.

Author notes

The authors wish it to be known that, in their opinion, Xiwei Sun and Yi Han authors should be regarded as Joint First Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)
Associate Editor: Bonnie Berger
Bonnie Berger
Associate Editor
Search for other works by this author on:

Supplementary data