Identification of active regulatory regions from DNA methylation data

We have recently shown that transcription factor binding leads to defined reduction in DNA methylation, allowing for the identification of active regulatory regions from high-resolution methylomes. Here, we present MethylSeekR, a computational tool to accurately identify such footprints from bisulfite-sequencing data. Applying our method to a large number of published human methylomes, we demonstrate its broad applicability and generalize our previous findings from a neuronal differentiation system to many cell types and tissues. MethylSeekR is available as an R package at www.bioconductor.org.

For the discovery of UMRs and LMRs in mouse embryonic stem cell (ESC) and neural progenitor (NP) methylomes, we have previoulsy proposed a three-state Hidden Markov Model (from here on referred to as the HMM model) [1]. Briefly, in this model, the three states correspond to a fully-methylated state which emits background methylation levels around 80%, a low-methylated state which emits methylation levels around 30% and an unmethylated state that emits methylation levels around 0%. Regions in the low-methylated state correspond to LMRs and regions in the unmethylated state to UMRs.
To directly compare MethylSeekR with this previously proposed approach, we looked at the overlap of the identified regions (UMRs and LMRs) in mouse ESCs and NPs when using m = 50% and n = 3 (which corresponds to the smallest n such that the FDR < 5%) for MethylSeekR. Importantly, we cannot directly compare MethylSeekR to the HMM model on the human methylomes analysed in this study, as the HMM model does not include functionality to take into account differing amounts of noise nor to mask partially methylated domains (see below).
In ESCs/NPs, 89/90% of the regions identified by MethylSeekR overlap (by at least one nucleotide) with the regions identified by the HMM and 96/98% of the regions identified by the HMM overlap with the segments identified by MethylSeekR. On the level of single CpGs, 95/96% of the CpGs in the MethylSeekR regions overlap with CpGs in the HMM-derived regions and 98/99% of CpGs in the HMMderived regions overlap with the MethylSeekR regions. Outside of CpG islands, these numbers decrease to 94/94% and 86/87%. This shows that there is a very good agreement between the two methods. Generally, the MethylSeekR segmentation results in a smaller number of segments (n=55059/34746 versus n=60010/48769). This difference is mostly due to differences in the segmentation of unmethylated CpG islands. The MethylSeekR segmentation is not affected by moderately elevated methylation levels at a few CpGs within otherwise unmethylated CpG islands (which occur quite frequently, particularly in NPs), whereas the HMM is sensitive to such variations and tends to cut these CpG islands into smaller pieces, resulting in several UMRs compared to one single UMR (data not shown).
We believe that MethylSeekR has, in addition to its simplicity and computational efficiency, several important advantages over the HMM approach. First, only MethylSeekR can identify and filter partially methylated domains, which is a crucial and indispensable step in the identification of regulatory regions in PMD-containing methylomes. PMDs can cover up to 40% of the genome [6,7] and due to the heterogeneity in methylation levels, they contain a very large number of CpGs with reduced methylation levels that would wrongly be classified as UMRs or LMRs. Second, with regard to the segmentation of UMRs and LMRs, the method has fewer free parameters (only m, n, cut-off on the coverage and smoothing kernel width) and we show that the most important of these, m and n, can be determined via a FDR calculation scheme. The HMM on the other hand has many more parameters, such as transition and emission parameters, which can be trained via standard expectation-maximization, but need to be manually adjusted to reduce the FDR [1] and thus cannot easily (unlike MethylSeekR) be operated in an unsupervised manner.
Third, whereas MethylSeekR is free of parametric assumptions, the HMM makes assumptions about the shape of emission distributions as well as the length-distribution of the hypomethylated regions. In particular, the model structure implies a geometric length distribution in the number of CpGs, which is violated in the case of UMRs. Thus, even though there is a fairly clear separation in median methylation levels between LMRs and UMRs, the HMM cannot perform a clean separation of the two classes and tends to classify many regions with few CpGs as UMRs, even if they have residual methylation levels ( [1], data not shown).
Fourth, the HMM model separates UMRs and LMRs mainly via their difference in methylation levels. However, we believe that using the number of CpGs instead of the methylation levels for the classification of UMRs and LMRs is a more robust approach because the number of CpGs in a given region is fixed and determined by the DNA sequence whereas methylation levels can fluctuate due to variations in conversion efficiency, coverage, global changes in methylation levels as well as variability between cell types. This can be appreciated in Supplementary Figure 7, in which the number of CpGs per region is plotted against median methylation for all regions identified in the different methylomes. Whereas the average methylation in UMRs and LMRs globally change from methylome to methylome, with varying bimodality, the separation in the number of CpGs is always strong and stays roughly constant, allowing the use of a fixed cut-off for the classification.
Due to the imprecise separation of UMRs and LMRs because of the varying bimodality in methylation levels and the geometric length distribution constraint of the HMM model, the classification of regions into UMRs and LMRs via the number of CpGs per regions (as used in MethylSeekR) leads to a reassignment of hypomethylated regions to UMRs and LMRs. In the case of our previously published methylomes (mouse ESC and NP), this leads to a reclassification of 10632/5567 UMRs (as defined by the HMM) to LMRs (in contrast to the small number of LMRs reclassified as UMRs (1788/1076)). These novel LMRs have the same enrichment for enhancer chromatin marks as the LMRs identified by the HMM (Supplementary Figure 17) and additionally, the reclassification leads to a much clearer separation of UMRs and LMRs into proximal and distal regulatory elements (Supplementary Figure 18). These findings support the validity of the reclassification by MethylSeekR, leaving our previous conclusions [1] unchanged.
Finally and importantly, unlike our previous HMM approach, the MethylSeekR software is publicly available and we provide it as an easy-to-use and fully documented R package, which includes a detailed description of the analysis steps for an example methylome and should thus greatly facilitate the identification of regulatory regions from Bis-seq data.

Comparison of methylomes on region level
For the comparison of methylation levels on the level of regions (Supplementary Figure 11, Supplementary Figure 15 and Supplementary Figure 16), we defined, for each pair-wise comparison, the joint set of regions as the union of all regions of both methylomes. For comparisons involving PMD-containing methylomes, only the regions that did not overlap with PMDs in any of the two cell types were used. If two (or several) regions overlapped, they were fused into one larger region that contained all the nucleotides of the original regions. For both methylomes, methylation levels were then calculated for the joint set of regions to produce the final scatter plots. Similarity of methylation levels was assessed using the Pearson correlation coefficient.       Supplementary Figure 17: Enrichment of chromatin marks and TF binding sites in hypomethylated regions in mouse embryonic stem cells as identified by our previously proposed HMM [1] or MethylSeekR. The main difference between the two methods is that MethylSeekR reclassifies a set of UMRs, as identified by the HMM, as LMRs (Supplementary text). We thus separated the identified hypomethylated regions into UMRs (UMRs according to MethylSeekR), LMR (LMRs according to both the HMM and MethylSeekR) and MethylSeekR-LMR (LMRs according to MethylSeekR, but UMRs according to the HMM approach). Whereas the first set of regions ("UMR") is occupied by PolII and show enrichments for chromatin marks of active promoters (H3K4me3 and no H3K4me1), both the second ("LMR") and third set ("MethylSeekR-LMR") display the characteristics of active distal regulatory regions (enrichments for p300 and H3K4me1, no H3K4me3, no PolII), supporting the classification implemented in MethylSeekR. For details regarding the calculation of enrichments see [1].