SNP-guided identification of monoallelic DNA-methylation events from enrichment-based sequencing data

Monoallelic gene expression is typically initiated early in the development of an organism. Dysregulation of monoallelic gene expression has already been linked to several non-Mendelian inherited genetic disorders. In humans, DNA-methylation is deemed to be an important regulator of monoallelic gene expression, but only few examples are known. One important reason is that current, cost-affordable truly genome-wide methods to assess DNA-methylation are based on sequencing post-enrichment. Here, we present a new methodology based on classical population genetic theory, i.e. the Hardy–Weinberg theorem, that combines methylomic data from MethylCap-seq with associated SNP profiles to identify monoallelically methylated loci. Applied on 334 MethylCap-seq samples of very diverse origin, this resulted in the identification of 80 genomic regions featured by monoallelic DNA-methylation. Of these 80 loci, 49 are located in genic regions of which 25 have already been linked to imprinting. Further analysis revealed statistically significant enrichment of these loci in promoter regions, further establishing the relevance and usefulness of the method. Additional validation was done using both 14 whole-genome bisulfite sequencing data sets and 16 mRNA-seq data sets. Importantly, the developed approach can be easily applied to other enrichment-based sequencing technologies, like the ChIP-seq-based identification of monoallelic histone modifications.


INTRODUCTION
For diploid organisms, gene expression is denoted as monoallelic if only one allele is transcriptionally active. The expressed allele can be randomly selected (e.g. Xchromosome inactivation and some autosomal genes) or predetermined by parental imprinting (1)(2)(3). Erroneous monoallelic expression has been associated to several genetic disorders, like the Prader-Willi syndrome, as well as to certain forms of cancer, like Wilms' tumour. Both diseases are caused by loss of imprinting of some genes in the 15q11-q13 and 11p15.5 region, respectively (4).
Epigenetics is defined as the study of inheritable modifications on both chromatin and DNA that have an influence on gene expression without changing the underlying DNA sequence (5). Mammalian DNA-methylation is an epigenetic mark that is predominantly found in a CpG sequence context (6). This methylation mark has been linked with gene expression and when located in the promoter region, it generally leads to transcriptional silencing of the corresponding gene (7). As it is a defining feature of cellular identity and essential for normal development, its dysregulation is often associated with disease (4). Monoallelic DNA-methylation is likely to bare an important role in the regulation of monoallelic expression (8). In addition to DNA-methylation, histone modifications also contribute to the maintenance of monoallelic expression. The methylated, silenced allele is mostly sustained with the repressive histone modification histone H3 trimethylation at lysine 9 (H3K9me3) while the active allele is characterized by the permissive histone marker H3 trimethylation at lysine 4 (H3K4me3) (9).
An important example of monoallelic DNA-methylation is the regulation of the parental-dependent monoallelic expression at imprinted loci, where the silenced allele is significantly more methylated than the active, expressed allele (2). Although imprinting is a well-investigated topic and several studies already provided evidence (e.g. compu-tational predictions based on DNA sequence characteristics or detection of monoallelic expression) of some regions with monoallelic DNA-methylation (2,3,(10)(11)(12)(13)(14)(15)(16), only a few imprinted regions are well characterized in humans, like, for example, the IGF2/H19 region. Furthermore, monoallelic methylation has recently been recognized as very common at non-imprinted loci affecting autosomal genes, regulating, for example, the production of specific antibodies and receptors in the immune system as well as the selection of olfactory receptors (17,18). While monoallelic methylation has been shown to play an important role in the differentiation between tissues, little is known about the specific location of these loci as well as the genome-wide character of monoallelic DNA-methylation.
The recent advent of next-generation massively parallel sequencing platforms has introduced the possibility of genome-wide DNA-methylation profiling. Bisulfite sequencing, which combines bisulfite treatment of genomic DNA (gDNA) with the high-throughput sequencing of the entire genome, is the gold standard and allows to readily identify monoallelic methylated alleles (19), but is very costly and therefore outside the reach of smaller projects. Fortunately, cost-effective alternatives based on the specific enrichment of methylated portions of the genome (i.e. enrichment-based methods) such as methylated DNA immunoprecipitation followed by sequencing (MeDIP-seq) and methyl-CpG binding domain protein sequencing (MethylCap-seq) exist. Yet, these methods do neither provide single base pair (bp) resolution nor information regarding unmethylated alleles and are therefore not directly applicable to detect monoallelic events (20). While some approaches already tried to tackle this issue, they rely on the combination of multiple sequencing technologies, like, for example, the integrative method of Harris et al. (21), which tries to find regions with intermediate and potentially monoallelic events by combining data originating from MeDIP-seq, methylation-sensitive restriction enzyme sequencing (MRE-seq), ribonucleic acid sequencing (RNAseq) and chromatin immunoprecipitation followed by sequencing (ChIP-seq).
To circumvent these issues, we developed a data analytical framework that solely uses data from enrichment-based sequencing (like MethylCap-seq), which screens for regions that exhibit monoallelic DNA-methylation based on classical population genetic theory, i.e. the Hardy-Weinberg equilibrium, in a parental-independent and genome-wide manner. This theory states that in a large random-mating population with no selection, mutation or migration both the allele and genotype frequencies of a gene locus with two alleles are constant from generation to generation, and furthermore, that there is a simple relationship between these allele and genotype frequencies: if the alleles are A and a with frequencies p and q (= 1 − p), respectively, then at equilibrium the genotype frequencies of AA, Aa and aa are p 2 , 2pq and q 2 , respectively (22).
The developed pipeline first compares enrichment-based sequencing data of multiple samples to the public NCBI Single Nucleotide Polymorphism (SNP)-archive (dbSNP) in order to screen the obtained non-duplicate, uniquely mappable sequence reads for SNPs. Only SNP loci with an adequately coverage and allele frequency are retained and the effect of sequencing errors is further reduced by comparing the chance of a sequencing error with the chance of detecting genuine SNPs. For each single SNP locus, the Hardy-Weinberg theorem is then applied to evaluate whether the observed frequency of samples featured by a biallelic event is lower than randomly expected (22). Using a permutation approach, confidence limits are simulated and genomic regions with a P-value smaller than the P-value corresponding with a given false discovery rate (FDR) can be assumed to harbour a monoallelic event.
Starting from MethylCap-seq data of a mixture of 334 Caucasian human samples and an FDR of 0.1, this methodology allowed the identification of 80 monoallelically methylated loci, significantly more found than expected in promoter regions. Of these 80 loci, 25 have previously been linked to imprinting. Additional validation was done using both 14 whole-genome bisulfite sequencing (WGBS) data sets of diverse origin and mRNA-seq data of 16 normal tissues. Here, the analysis was performed on available samples originating from a variety of tissues, mostly cancer tumours, providing a challenging data set to identify monoallelic methylation events (see Discussion). However, even in this set-up generally known imprinted regions were identified as well as putative novel imprinted genes, demonstrating the robustness of our method. Finally, because of the general rationale of the developed approach, it can be applied to enrichment-based sequencing applications to detect monoallelic features other than DNAmethylation. A possible application could be ChIP-seq (23) to screen for monoallelic histone modifications (24)(25)(26)(27)(28).

Samples
A total of 334 human samples, mostly cancer samples of various tissues, was used to detect monoallelically methylated loci (Supplementary Table S1). Of these 334 samples, 215 samples were of female origin and only these were used to analyse the X-chromosome. gDNA was extracted from these samples with the Easy DNA kit (Invitrogen K1800-01) using protocol #4 from the manufacturer manual. The DNA concentration was measured on a Nanodrop ND-1000. Subsequently, the gDNA was sheared on the Covaris S2 with following settings: duty cycle 10%, intensity 5, 200 cycles per burst during 180 s to obtain fragments with an average length of 200 bp. The power mode was frequency sweeping, temperature 6 • C-8 • C and water level of 12. A total of 500 ng was loaded in 130 l TE (1:5) in a microtube with Adaptive Focused Acoustics (AFA) intensifier.

Methyl-CpG binding domain sequencing
Methyl-CpG binding domain protein sequencing (MethylCap-seq) (20), which combines enrichment of methylated DNA-fragments by methyl-CpG binding domain (MBD)-based affinity purification with massively parallel sequencing, was used to profile the DNAmethylation pattern of the 334 samples. The samples were sequenced according to the protocol described in the paper of De Meyer et al. (29) with some additional modifications: (i) After DNA fragmentation, the methylated fragments PAGE 3 OF 14 Nucleic Acids Research, 2014, Vol. 42, No. 20 e157 were captured using Diagenode's MethylCap kit starting from a DNA concentration of 500 ng instead of 200 ng. (ii) Paired-end sequencing was done on either the Illumina GAIIx or the HiSeq platform. Depending on the sequencing platform, the obtained paired-end sequence reads were 45 or 50 bp, respectively.

Data pre-processing
The rationale behind the proposed methodology is that biallelic DNA-methylation results in MethylCap-seq data which is in Hardy-Weinberg equilibrium for each locus, i.e. if SNPs are present for a locus, both homozygous and heterozygous subjects will be detected at a predictable rate (22). However, in case of monoallelic methylation, heterozygous samples will no longer be detected resulting in deviation from the Hardy-Weinberg equilibrium, which can be measured. For a detailed description of the statistical framework, see the Additional Methods. Figure 1 gives an overall representation of the workflow starting from MethylCapseq data.
Mapping. For each of the 334 samples, the MethylCapseq paired-end reads were mapped using BOWTIE (v1.0.0) (30). The mapping parameters were chosen so that only those paired-end reads that mapped uniquely on the human hg19/GRCh37 reference assembly within a maximum of 400 bp of each other were retained. In order to both reduce the presence of sequencing errors as well as to allow the occurrence of real SNPs, a maximum of three mismatches was allowed. As suggested by Stevenson et al. (31), allowing three mismatches during the alignment step significantly reduces the reference allele bias at SNP loci while still enabling an ample amount of uniquely mapped reads. Duplicate fragments, i.e. fragments with the exact same location of both paired-end reads, were disposed as these are most likely the result of amplification of the same sequence reads during the library preparation.
SNP tracing. The non-duplicate, uniquely mappable reads were subsequently screened for SNPs. Only positions that showed a mismatch in the mapping of one or more samples and that overlapped known single nucleotide variations (SNVs) of the dbSNP (version 137) were withheld. Not keeping all the mismatches reduces both the effect of sequencing errors (false positives) and the computational load in the further analyses. Also, for each locus, the coverage of each SNP variant was determined, and the allele frequencies were estimated.
Additional data filtering and correction. Both for computational reasons and as a first filtering step for sequencing errors, SNP loci with a very high major allele frequency (MAF) were filtered (threshold 0.9). Additionally, a minimal total coverage threshold, i.e. across all samples, for each SNP locus was imposed (350 ∼ 1× per sample). Note that loci not fulfilling both criteria are unlikely to provide sufficient power for the subsequent statistical analysis. As analysis of the X-chromosome involved fewer samples, the threshold for the coverage was set to a less stringent value, namely, 250 instead of 350, which roughly corresponded to the number of female subjects.
In this reduced data set, an additional sequencing error correction was performed. For computational reasons, a simple Bayesian methodology was implemented. Basically, for each sample and locus, the chance of obtaining a certain profile was calculated under (i) the assumption of heterozygosity and (ii) the assumption of homozygosity but with additional sequencing errors. The option with the largest a posteriori change was withheld (with alleles representing putative sequencing errors being removed from the data set). As the prior chances of homozygosity and heterozygosity were based on the allele frequencies, which are updated upon each round of the Bayesian algorithm, this method was performed twice (see Additional Methods Section 2.1.3). This approach can be considered to be conservative (i.e. to disfavour the presence of monoallelic DNAmethylation), as (i) only two rounds of correction were applied and (ii) the sequencing error estimate (0.25%, based on Quail et al. (32)) is on the lower bound of estimates reported and is based on the performance of the Illumina HiSeq, whereas also more error prone GAIIx data were included in this study.

Detection of monoallelically methylated loci
After additional filtering and data correction, the remaining data were used as input of the new data-analytical framework developed in the R statistical environment (R 2.15.2). The statistical strategy and practical implementation are elaborated in the Additional Methods. In summary, based on the observed allele frequencies, theoretically expected genotype frequencies can be calculated assuming Hardy-Weinberg equilibrium in the overall data set. If the observed frequency of heterozygote individuals is significantly reduced relative to Hardy-Weinberg expectations, this indicates significant monoallelic methylation. Null distributions were generated using random data with the same allele frequencies and sample coverages (for that locus) as in the original data. This approach accounts for the increased likelihood of erroneously calling loci with a low coverage homozygous. P-values were determined by comparison of the observed frequency of heterozygotes with the generated null distributions. Only loci that obtained a P-value smaller than or equal to 0.005 after the first iteration were kept as input for the second iteration. Thus, after the first iteration round, loci that were in all probability not monoallelically methylated, were filtered out as to reduce the computational time in the second iteration. At the end of the second iteration the algorithm obtained a P-value for each locus. If this P-value was smaller than the P-value corresponding with an FDR of 0.1, monoallelic methylation on this locus was called significant. This procedure was also performed two times, a first time with 1000 and a second time with 1 000 000 iterations. To summarize results, significant loci were visualized on a circular plot with the Circos tool (33).

Functional annotation and enrichment analysis
Successful completion of the monoallelically methylated loci detection pipeline resulted in a list of significant SNPs. The functional annotation (i.e. promoter, exon, intron and intergenic) of these SNP positions was determined using Figure 1. Overview of bioinformatics pipeline to detect putative monoallelically methylated SNP loci starting from MethylCap-seq data. After mapping with BOWTIE the non-duplicate, uniquely mapped reads are screened for SNPs using dbSNP. To reduce the computational load SNP loci with a too high MAF and/or a too low overall coverage are filtered. In this reduced data set, an additional sequencing error correction was performed with two iterations. The corrected data was next put in the newly developed data-analytical framework with 1000 and 1 000 000 iterations, respectively. Only loci that obtained a P-value smaller than or equal to 0.005 after the first iteration were kept as input for the second iteration. If the P-value obtained for a locus was smaller than the P-value corresponding with an FDR of 0.1 the monoallelic methylation on this locus was called significant. After determining the functional annotation of these SNP positions an enrichment analysis was performed. Finally, the resulting loci were validated using both literature and WGBS data.
Ensembl (release 66), wherein the promoter was defined as starting from 2000 bp upstream until the transcriptional start site.
We tested for enrichment in one or more of these functional categories. A null distribution was generated by random sampling from the total amount of detected SNPs (after filtering as specified in 'Additional data filtering and correction' of Materials and Methods) and counting the occurrences of the respective annotations. During this sampling procedure, the number of SNPs sampled for each chromosome was equal to the number of significant SNPs on that chromosome. This sampling was repeated 1000 times. With the null distribution obtained for each of these functional locations (i.e. promoter, exon, intron and intergenic), it was possible to calculate a two-sided P-value for each functional location. For loci that were featured by more than one functional annotation (i.e. overlapping genes and/or different transcripts and/or sense and antisense strand) the score for the functional location was divided by the amount of different functional locations that this locus has (the sum always being one). For example, if a locus is located in an exon on the sense strand but is also located in an intron on the other strand, both the exon and intron were attributed a score of 0.5.

Validation of putative loci using 14 WGBS data sets
In order to evaluate the loci detected by this novel methodology, an extra validation step was performed using 14 publicly available WGBS data sets comprising a range of tissue types. The WGBS data sets were downloaded from the Gene Expression Omnibus repository (34). A summary of the data sets including accession numbers is provided in Supplementary Table S3. The 14 samples were aligned in a window of 2000 bp (1000 bp upstream and 1000 bp downstream) around the candidate SNP positions (hg19/GRCh37 reference assembly) using BISMARK (35). A maximum of three non-bisulfite mismatches was permitted in the seed (70 bp) to (i) lower the presence of sequencing errors while still allowing the detection of real SNPs, but also to (ii) circumvent a possible bias alignment to the reference allele while keeping a substantial unique alignment rate (31). After excluding duplicates, only reads mapping onto one of the candidate monoallelically SNP positions were kept. Next, for each SNP position and each sample the methylation calls, i.e. methylated or unmethylated, of all CpGs were summarized from the mapped bisulfite reads per SNP allele (covered by the reads on the specific SNP position). To assess monoallelic DNA-methylation in the SNP loci a Pearson's chi-square test was performed. With a chi-square test, it could be assessed if each allele has an equal distribution of methylation calls, i.e. degree of methylation. Samples that were not covered or were homozygous for the particular locus were excluded. In summary, for each heterozygous sample a chi-square value was calculated based on the degree of (non-)methylation obtained for each SNP allele, with a high chi-square value indicating that the methylation degree is allele dependent (i.e. monoallelic methylation), i.e. one SNP allele featuring a high degree of methylation while the other allele is characterized with no or a low degree of methylation. Null distributions were made by a permutation approach (using the chisq.test function of the R Stats package) generating 2000 random chi-square values for each sample, making it possible to determine a sample-specific P-value for each SNP-loci. By summing the chi-square values over all heterozygous samples for a specific SNP locus and again generating null distributions of random chi-square values, also a global P-value for a SNP locus could be obtained. Note that this test does not require absolute absence of methylation of one allele, which would be too strict given the possibility of incomplete bisulfite conversion and the presence of both sequencing errors and sequencing bias.

Validation of allele-specific expression (ASE) using 16 RNAseq data sets
As an additional validation and to evaluate the effect of the found monoallelic methylated loci on gene expression, the results were combined with publicly available mRNAseq data sets from 16 normal human tissues (Illumina's Human BodyMap 2.0 project), including adipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid and leukocyte cells originating from different individuals (15 Caucasians and 1 African American, see Supplementary Table S5). The data are accessible from ArrayExpress, ArrayExpress accession: E-MTAB-513, actual sequence files are in ENA archive with accession number: ERP000546 (linked from ArrayExpress page as 'ENA -ERP000546' tag in links section).
For each tissue, the raw paired-end sequence reads (2 × 50 bp) were aligned using the transcriptome mapper STAR (v2.3.1) (36). In order to tackle possible mapping bias to the reference allele, reads were aligned using the method of Degner et al. (37), i.e. using the human hg19/GRCh37 as a reference genome which was masked for known dbSNP positions. Reads mapping up to 10 places were allowed with a maximum of 8 mismatches per fragment, i.e. read pair. Only uniquely mapped reads were kept and duplicate fragments were removed with Picard's MarkDuplicates command-line tool (v1.97) (http://picard.sourceforge.net/).
By assessing if some of the found monoallelic methylated loci are associated with ASE, ASE was determined on a perheterozygote-site per-tissue basis. In a likewise manner as other ASE studies (31,38,39), Samtools mpileup/bcftools (v0.1.19) (40) was used to call possible variants in the nonduplicate, uniquely mapped reads, whereby variant sites with a raw read depth lower than 10 were filtered out and only bases with a minimum base quality (MAQ) of 13 were considered. Next, only SNP positions called by Samtools mpileup and present in dbSNP were kept. Additionally, known dbSNP sites called as homozygous for the reference allele, but where at least two high quality (MAQ ≥ 13) al-ternate (i.e. non-reference) bases mapped, were also added to the list of variant sites.
After observing the amount of high quality mapped reference versus non-reference bases for each variant site in each tissue, ASE was assessed by performing an exact binomial statistical test with the null hypothesis that each allele is equally expressed. To correct for multiple testing, an FDR of 1% was used. In a next step, the variant sites showing a significant deviation from the binomial distribution were mapped to their corresponding genes. Only genes with at least two significant variant sites were assumed to be allelespecific expressed.

Mapping
For the 334 samples the mapping resulted in 2 995 375 490 uniquely mapped reads and an average mapping percentage of 63.05% (Supplementary Table S1). After removing the duplicate fragments a total of 2 688 409 588 non-duplicate, uniquely mapped reads was acquired.

SNP tracing and data filtering
After parsing the mapping output for SNPs (= mapping mismatches), 19 850 891 SNPs overlapped with already known SNV positions from dbSNP. These 19 850 891 loci represent 41.61% of the total number of SNV present in db-SNP and only these SNPs were used in the remainder of the analysis. Supplementary Table S2 details the number of SNPs that overlapped with dbSNP per chromosome.
After pre-processing the data, the corresponding coverage and allele frequencies were calculated for each of the 19 850 891 loci and subsequently used to filter the data. Only positions with a frequency of the major allele smaller than 0.90 and coverage larger than or equal to 350 (250 for chromosome X) were retained. A total of 486 090 out of 19 850 891 loci (2.45%) complied with these thresholds. Supplementary Table S2 shows the number of SNP positions that were retained after filtering as well as the fraction per chromosome.

Detection of monoallelically methylated loci
Likely sequencing errors in the list of filtered loci were adjusted (see Materials and Methods 'Additional data filtering and correction' and Additional Methods Section 2.1.3). Corrected data (available as Additional Data) were subsequently analysed using the developed statistical methodology. If the P-value obtained for a locus was smaller than the P-value corresponding with an FDR of 0.1 (P-value = 0.000016), the monoallelic methylation on this locus was called significant. This was true for 80 loci (see Table 1). Figure 2 depicts the genomic distribution of these 80 monoallelically methylated loci.
Materials and Methods 'Functional annotation and enrichment analysis'), the analysis indicated a significant enrichment in promoter methylation (P-value = 0.002), but not in other functional locations.

Validation of putative loci using 14 WGBS data sets
After pre-processing the 14 WGBS data sets as outlined in Materials and Methods 'Validation of putative loci using 14 WGBS data sets', 44 out of the 80 significant loci were covered by at least one heterozygous sample. Table 4 summarizes both the global and the sample-specific P-values obtained for each of these 44 loci. Note that 29 loci (65.9%) had a global P-value lower than 0.05 of which 24 (54.5%) even had a global P-value virtually equal to 0 suggesting monoallelic methylation in at least one of the 14 samples.

Validation of ASE using 16 RNA-seq data sets
To validate if the found monoallelically methylated loci are associated with ASE, publicly available mRNA-seq data sets from 16 different individuals and tissues were searched for ASE. After pre-processing the data for each variant site, binomial tests were performed. Using an FDR of 1%, and requiring the presence of at least two significant variant sites per gene, in total, 19 840 genes showed ASE, ∼1190 genes per tissue (Supplementary Table S5), in line with the amount of loci identified in previous studies (1,(41)(42)(43). In a next step, it was examined if some genes from Table 2--the genes with one (or more) monoallelic methylated SNP(s) in their genic regions--were also characterized by ASE. Indeed, 21 of the 43 genes were covered by at least two variant sites in some tissues, of which 19 showed ASE in one or more tissues (Table 4). Of these 19 genes, 13 also showed biallelic expression in other tissues. For the remaining two genes, GNAS-AS1 and ADAMTS2--although covered by two or more variant sites in some tissues--ASE could not be validated and thus showed evidence of biallelic expression. Of the 19 allele-specific expressed genes, 6 have already been linked to imprinting (15,16) (http://www.geneimprint.com). The other 13 genes represent novel candidate imprinted genes. In addition, for WRB, NHP2L1, NAA60, ZNF331, H19 and GNAS the found monoallelic methylated loci were  Following parameters are indicated: Location (chromosome:location), (Ensembl) Gene ID, Gene symbol, Description and Biotype. *known imprinted gene; **predicted imprinted gene.   Results are shown for the 21 genes with one (or more) monoallelic methylated SNP(s) in their genic regions and reached the thresholds to investigate putative ASE. Six genes exclusively show ASE in one or multiple tissues, 13 genes have both ASE and biallelic expression (BE) in different tissues and 2 genes only show BE. Following columns are indicated: (Ensembl) Gene ID, Gene symbol, number and annotation of tissues for which ASE/BE could be examined (# Tissues and Tissues, respectively) and the Type of expression found for these tissues (ASE or BE). *known imprinted genomic region; **predicted imprinted genomic region. located in their respective promoter region. Supplementary  Table S6 also lists the results for genes called allele-specific expressed if at least one significant variant site was present.

DISCUSSION
Monoallelic gene expression is typically initiated early in the development of an organism and stably maintained. Erroneous monoallelic expression has been related to several non-Mendelian inherited genetic disorders. DNAmethylation plays a significant role in the regulation of monoallelic expression. The choice of the allele to be monoallelically expressed can be either random or a priori defined by imprinting. Here we introduced a methodology to screen for genes that exhibit monoallelic DNAmethylation and thus might regulate monoallelic expression.
Using MethylCap-seq, methylome profiles of 334 samples, mostly human cancer samples of diverse origin, were obtained. In summary, upon extra filtering and data correction, for each SNP locus the Hardy-Weinberg theorem was applied to evaluate whether the observed frequency of samples featured by biallelic methylation is lower than randomly expected. Using a permutation approach, loci with a P-value smaller than the P-value corresponding with a selected FDR of 0.1 were assumed to be monoallelically methylated. Finally, this resulted in the identification of 80 loci that showed significant monoallelic DNA-methylation.
Functional location of these monoallelic events might provide deeper insight in the unraveling of monoallelic mechanisms and are provided in Table 2 and Supplementary Table S4. It is common that imprinted genes are present within clusters and share common regulatory elements, such as non-coding RNAs and differentially methylated regions (DMRs). If these DMRs control the imprinting of one or more genes, these regions are called imprinting control regions (ICRs). It is known that many of these ICRs PAGE 11 OF 14 Nucleic Acids Research, 2014, Vol. 42, No. 20 e157 are located in intergenic regions. As some of the found loci are located in intergenic regions as well as in known (long) non-coding RNAs (lncRNAs) (see Table 2 and Supplementary Table S4, respectively), it is possible that these regions present new regulatory elements involved in imprinting. Furthermore, when we take a closer look at the intergenic regions, the SNP on chromosome 2 with position 207 122 438 also shows significant monoallelic methylation. This is interesting, because this locus falls in GPR1AS, a recently found imprinted lncRNA in the GPR1-ZDBF2 intergenic region (44), corroborating the outcome of this study and indicating that the so-called 'intergenic' regions are also of interest for further analyses. Of the 80 loci, 49 were located in genic regions of which 25 are already linked to imprinting. For example, on chromosome 11 (2.01-2.03 mega base (Mb)), the IGF2/H19 region was highlighted with 6 SNPs (Figure 2). This locus is a well-known imprinted region that is also linked to the Beckwith-Wiedemann syndrome and Wilms' tumour (4,(45)(46)(47). The H19 gene codes for a lncRNA of which expression is negatively correlated with the expression of the neighbouring gene insulin-like growth factor 2 (IGF2). Usually the paternal copy of H19 is methylated and silent, while the maternal copy is hypo-or unmethylated and expressed. The same is true for the imprinted region on chromosome 15 that is correlated to the Prader-Willi syndrome (20.7-30.3 Mb) (4).
In SNRPN, one of the genes in this region where loss of imprinting is linked to the Prader-Willi syndrome, 2 significant SNPs were identified. For a couple of genes (or regions), like, for example, H19, more than one significant SNP locus was found. Because some of these SNPs are at a distance of more than 400 bp (the cut-off length of sequence reads during mapping) of each other, these prove independently the presence of monoallelic DNA-methylation in that particular region. These SNPs thus provide 'multiple proof' in the identification of the particular monoallelically methylated region and lend added value to the results. Not unexpectedly, Figure 3 and the enrichment analysis clearly demonstrated enrichment for monoallelic methylation in promoter regions, although it should be noted that this enrichment is rather limited in absolute number.
In females, most of the genes on one X-chromosome are transcriptionally silenced by monoallelic epigenetic events like DNA-methylation and histone modifications. Early in development, each cell makes an independent, random choice which chromosome to inactivate. Once this decision is set, all further descendants of that cell keep the same pattern. As our method is designed to specifically detect a deviation from the Hardy-Weinberg equilibrium, it is necessary that for one sample, the same allele is (un)methylated for all cells--and thus not randomly chosen. In summary, as random monoallelic methylation would lead to the detection of more heterozygotes, the fact that no monoallelic methylated loci were found on the X-chromosome reassures the detection of stable monoallelic methylation with our method.
Further validation was performed using 14 publicly available WGBS data sets, comprising 10 normal samples and 4 samples derived from non-normal tissue, including two samples of cancerous origin (colon tumour tissue and a human liver carcinoma cell line) and two brain samples from Alzheimer patients. Of the 80 significant loci, 44 were cov-ered by a heterozygous sample and could thus be further examined. Twenty-nine of the 44 loci (65.9%) obtained a global P-value lower than 0.05 of which 24 (54.5%) had a global P-value of virtually 0, indicating monoallelic methylation in one or more samples. For only 9 of these 24 SNP loci evidence of imprinting already exists, so that with this subset of 14 WGBS samples at least 15 new monoallelically methylated regions, found with our new data-analytical framework, are validated. Furthermore, from the samplespecific P-values in Table 3 it can be seen that these 24 loci are mainly validated in samples of normal origin, whereby each locus is validated in multiple normal samples--and thus not or not only in cancerous/diseased samples.
In addition, with the ASE analysis of 16 different tissues of normal origin it was possible to investigate if (some of) the found monoallelically methylated regions are associated with ASE--and thus might regulate this monoallelic expression. Of the 43 genes featured by one or more of the detected monoallelic methylated loci, 21 had variants with sufficient coverage and base quality, and were further examined. Of these, 13 genes showed both allele-specific as well as biallelic expression in different tissues, while 6 genes were only featured by ASE ( Table 4). The remaining two genes could not be validated as allele-specific expressed in a tissue. Of the 19 genes featured by ASE, only 6 have already been linked to imprinting, suggesting the identification of novel candidate imprinted genes. In fact, for two of these 'novel' imprinted genes, WRB and NHP2L1, recent evidence by Docherty et al. strongly suggests that these are indeed putatively imprinted as (i) they show ASE in some tissues and (ii) their methylation patterns are consistent with allelic maternal methylation (48). Furthermore, for WRB and NHP2L1 as well as NAA60, ZNF331, H19 and GNAS monoallelic promoter methylation was found.
There are a couple of important remarks that come with the proposed methodology: (i) The basic assumption that MethylCap-seq data from biallelically methylated loci are generally in Hardy-Weinberg equilibrium only holds for samples originating from a panmictic population (i.e. a single population that is long-term randomly mating). If this is not the case and the samples are not panmictic, this could possibly give rise to some false positives. Thus, for samples that slightly deviate from the assumption of panmixia, an extra validation of the resulting loci is necessary to assure qualitative results (as was done in this study). (ii) The approach does not take into account that loci with monoallelic methylation will be picked up less efficiently than biallelic loci resulting in less power leading to a less efficient detection of monoallelically methylated loci. By consequence, the methodology is less sensitive and thus too conservative, though this has no effect on the reliability of those results deemed significant. (iii) It is known that aligning to a reference genome at sites of DNA-variants generates a bias towards higher mapping rates of the reference allele compared with the alternative allele (37). Recently, Stevenson et al. (31) showed that increasing the number of mismatches sig- nificantly improved measures of allelic abundance, and demonstrated that a maximum of three mismatches provides a good trade-off, as implemented in this study. Also, as the SNP density in the human genome is approximately one SNP per kilobase (49), this trade-off is deemed to be sufficient to both allow the occurrence of real SNPs as well as to lower the presence of possible sequencing errors. By not calculating the observed allele frequencies from sample-specific allelic abundances but from the observed genotypes (see Additional Methods), the possible influence of an allelic bias is further minimized. In conclusion, these precautions together with the fact that our method is conservative ensure that our method is not affected by a possible bias alignment. Indeed, an extra quality control of the monoallelically methylated loci showed no difference in sample coverage between samples homozygous for the reference and the alternative allele (Wilcoxon rank-sum test, P-value = 0.21, data not shown). (iv) To eliminate sequencing errors as well as to reduce the computational time and effort a filtering step was performed. Consequently, some data will not be analysed and this could interfere with the detection of monoallelic DNA-methylation. However, the benefits of filtering outweigh the possible drawbacks: the computational load reduces significantly and the power to detect loci that do not pass the filter cut-off will be typically insufficient. (v) The approach used to correct for possible sequencing errors disfavours the presence of monoallelic DNAmethylation: only two correction rounds were performed and the sequencing error estimate of 0.25% is the lowest estimate reported (32). But although the correction method can be considered a bit too stringent, it will assure a better quality of the obtained results and will not give rise to more false positives. In fact, it will possibly reduce the amount of false negatives and thus allows a more sensitive identification of monoallelically methylated loci that would otherwise not have been detected. (vi) The analysis was performed with samples originating from different tissues that were mostly cancer tumours. The fact that tumours are epigenetically less stable than healthy tissue (50), makes it probably more difficult to detect monoallelic methylation. On the other hand, it is known that chromosomal deletions and loss of heterozygosity frequently happen in cancer, both leading to possible 'monoallelic' methylation events. However, as a mixture of different cancer tumours was used and it is very unlikely that these are all characterized by the same chromosomal deletion, this latter phenomenon will have had little effect on our stringent analysis. To justify this, additional analyses were performed on 14 WGBS data sets (of which 10 were of normal origin) as well as on mRNA-seq data of 16 tissues of normal origin, validating a notable number of the identified loci and detecting putatively imprinted genes.
Although the employed experimental set-up to test our methodology is somewhat challenging--using a mixture of samples originating from different tissues, mostly cancer tumours--the proposed methodology allowed the identification of loci known to be generally imprinted and involved in genetic and/or imprinting disorders (e.g. IGF2/H19, KCNQ10T1, SNURF/SNRPN, GNAS, . . . ) demonstrating the robustness and biological relevance of our method. Additionally, the extra ASE analysis identified monoallelic methylated loci associated with ASE, thereby identifying 6 known and 13 novel candidate imprinted genes (e.g. WRB, NHP2L1, . . . ). As we opted to use a stringent approach, the outcome further demonstrates that our methodology is still sensitive enough and produces satisfying results.
As recent evidence suggests that monoallelic DNAmethylation is often tissue-or cell-type specific (19,51), it would be particularly interesting to apply the methodology on MethylCap-seq samples of normal, single-tissue origin, ideally from a single population. Next to MethylCap-seq, our approach also opens the door to other applications, like ChIP-seq-based detection of monoallelic protein-DNA binding events and histone modifications.

AVAILABILITY
Additional Data (filtered and sequence error corrected SNP-loci of the 334 samples, i.e. starting data of the dataanalytical framework) as well as corresponding scripts of the developed statistical methodology are available on the authors website: http://www.biobix.be/MAM/.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.