Base-resolution methylation patterns accurately predict transcription factor bindings in vivo

Detecting in vivo transcription factor (TF) binding is important for understanding gene regulatory circuitries. ChIP-seq is a powerful technique to empirically define TF binding in vivo. However, the multitude of distinct TFs makes genome-wide profiling for them all labor-intensive and costly. Algorithms for in silico prediction of TF binding have been developed, based mostly on histone modification or DNase I hypersensitivity data in conjunction with DNA motif and other genomic features. However, technical limitations of these methods prevent them from being applied broadly, especially in clinical settings. We conducted a comprehensive survey involving multiple cell lines, TFs, and methylation types and found that there are intimate relationships between TF binding and methylation level changes around the binding sites. Exploiting the connection between DNA methylation and TF binding, we proposed a novel supervised learning approach to predict TF–DNA interaction using data from base-resolution whole-genome methylation sequencing experiments. We devised beta-binomial models to characterize methylation data around TF binding sites and the background. Along with other static genomic features, we adopted a random forest framework to predict TF–DNA interaction. After conducting comprehensive tests, we saw that the proposed method accurately predicts TF binding and performs favorably versus competing methods.


Candidate TFBS selection
Candidate transcription factor binding sites were determined by sequence motif scores represented by the position-specific weight matrices (PWM). The PWM for CTCF, MAX, SIX5, USF1, BCL11A, EGR1,  NANOG, RAD21, RFX5, SRF, USF2, GABP, NRSF, YY1, CJUN, JUND , OCT4, TCF12  Candidate TF sites located within CpG islands are not considered by default. This is due to the widespread hypomethylation known to occur at the CpG islands, which could potentially biase the methylation model estimation. Since CpG islands only account for less than 1% of the genome, such exclusion shouldn't affect the result much. Methylphet software provides an option for users to choose to include candidate sites within CpG islands.  Figure S1 shows the profiles of CpG methylation (CG), CpG hydroxyl-methylation (5hmC) and CpH methylation (CH) from H1-hESC for a number of TFs.     3.2 Methylation profile (CG, CH) in IMR90. Figure S2 shows the profiles of CG and CH from IMR90 for a number of TFs. Because the TAB-seq data are not available, the 5hmC profiles are not shown.  3.3 Methylation profile (CG, 5hmC and CH) in mESC. Figure S3 shows the profiles of CG, 5hmC and CH methylation from mESC for a number of TFs.

Methylphet Model
Beta-binomial model is used to characterize methylation (including 5mC, 5hmC and CH methylation) patterns at a genomic region. For a certain candidate site, 21 window with 30bps each centered at the center of the motif. The methylation reads and total reads in these 21 windows are used to capture methylation patterns for TF binding sites as well as background methylation level. Inside each window, if there is at least one CG dinucleotide that is covered by at least one read (either methylated or not), we recorded the total number of methylated and un-methylated reads. Assume there are n candidate sites. In the jth window ( = 1,2, … ,21) of the ith candidate site ( = 1,2, … ), we use !" and !" to denote the number of methylated and unmethylated reads and let !" = !" + !" . The counts, given underlying "true" methylation levels are assumed to follow binomial distribution The methylation levels ! 's are assumed to follow a beta distributions, but with different parameters at TF binding sites and background. For candidate sites that are bound by TFs, it is shown ( Figure 2) that methylation levels dip toward the motif site from both directions. For candidate sites that are not bound by TFs, the methylation levels from all 21 windows are assumed to be identical and similar to those from the genomic background (close to fully-methylated). Therefore we assume that each ! follows a different Beta distribution. Define indicator ! to denote binding ( ! = 1) or not ( ! = 0) for candidate site i, we have: To build the predictive model, parameters of ! , ! , ′ , ′ will be estimated using the training data.
Here, we used method of moments to obtain the estimate for these parameters. It is convenient to consider an alternative representation of beta binomial compound model. We define: Then the beta binomial model can be reparameterized as: The expectation and variance for are: The sample mean and sample variance for are: By method of moments, we have: Hence, we can obtain the point estimate for ! , ! and further to obtain the point estimate for ! , ! by    Table S3.

Evaluation of Feature Importance
The importance of each feature is measured by the Gini Importance. This measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. Therefore, the more decrease in node impurity the feature causes, the more important this feature is. In this binary classification case, the node impurity is measured by the Gini index (or Gini coefficient, Gini ratio). It is calculated as: Where ! and ! are the fraction of items labeled as 0 and 1 respectively.
In Figure S7

Model convergence and tree number
In the random forest model, Out-of-bag error (OOB error, details in the manual of R package 'randomForest') was used to evaluate model convergence as the number of trees grows. In our model, we examine the model convergence as the number of trees used in training grows from 0 to 200. In general, the model becomes stable when number of trees reaches <100. (Figure S7-1

Choice of bin sizes in constructing methylation model
In constructing methylation model, we take a 300-bp window at each candidate sites (centered at the center of the motif) and divide them into 30-bp windows. The data in the windows are used to estimate methylation patterns. The window size is configurable in the software and can be specified by the users. We set the default window size to be 30-bp because it provides a good balance of the model estimation precision and spatial resolution of the results based on the data we tested.
Smaller window could lead to high uncertainty in model estimation due to potential lower coverage of CpG sites. On the other hand, larger window will hurt the spatial resolution of the results. If the read coverage of the data are deep, one can afford to use a smaller window because there will be enough reads to cover the CpG sites. Otherwise, using a relatively larger window can provide more stable results at the cost of sacrificing a little spatial resolution. We compared the prediction results from Methylphet using 30 bp and 20 bp windows for a number of proteins. Both the ROC and precision recall curves are provided. In general, using 30 bp provides slightly better results overall.  8.5 qPCR of selected Methylphet-predicted sites qPCR was performed on the selected positive and negative site from Methylphet prediction. The specificity of qPCR reaction for selected sites was confirmed by agarose gel electrophoresis. Figure S14. qPCR of selected Methylphet-predicted sites. DG, dentate gyrus; REST=NRSF; Pos, predicted positive site; Neg, predicted negative site. qPCR was performed in duplicates; error bars=std. error To investigate the TF-specificity of Methlphet model, we conducted cross-TF training. We firstly trained Methylphet using methylation data and other features of 9 TFs in H1-hESC as training set. Then we use these models to predict CTCF binding in IMR90 cell line. The ROC curves of this cross-TF cross-cell line prediction are shown in Figure S15.
From this result we see that the CTCF model showed better performance than all of the other models.
In addition, RAD21 model shows up as a close second which is reasonable because RAD21 has very similar motif pattern as CTCF. Models trained from TFs other than CTCF and RAD21 perform much worse.
This result clearly demonstrates the TF-specificity of the Methlphet model. The methylation profile and other genomic characteristics of TF binding are important in the Methylphet model and better to be modeled in a TF-specific manner. Figure S15. ROC curves of Cross-TF-trained models, with H1-hESC data as training set, IMR90 data as testing set.