DEEP: a general computational framework for predicting enhancers

Transcription regulation in multicellular eukaryotes is orchestrated by a number of DNA functional elements located at gene regulatory regions. Some regulatory regions (e.g. enhancers) are located far away from the gene they affect. Identification of distal regulatory elements is a challenge for the bioinformatics research. Although existing methodologies increased the number of computationally predicted enhancers, performance inconsistency of computational models across different cell-lines, class imbalance within the learning sets and ad hoc rules for selecting enhancer candidates for supervised learning, are some key questions that require further examination. In this study we developed DEEP, a novel ensemble prediction framework. DEEP integrates three components with diverse characteristics that streamline the analysis of enhancer's properties in a great variety of cellular conditions. In our method we train many individual classification models that we combine to classify DNA regions as enhancers or non-enhancers. DEEP uses features derived from histone modification marks or attributes coming from sequence characteristics. Experimental results indicate that DEEP performs better than four state-of-the-art methods on the ENCODE data. We report the first computational enhancer prediction results on FANTOM5 data where DEEP achieves 90.2% accuracy and 90% geometric mean (GM) of specificity and sensitivity across 36 different tissues. We further present results derived using in vivo-derived enhancer data from VISTA database. DEEP-VISTA, when tested on an independent test set, achieved GM of 80.1% and accuracy of 89.64%. DEEP framework is publicly available at http://cbrc.kaust.edu.sa/deep/.


Selection of cell-lines and tissues: Number of positive and negative samples
In ENCODE repository there are 6 cell-lines from tier 1 and tier 2 experiments that are characterized by the same 11 histone mark data (with the exception of H3K9me1). For the training of our model we chose randomly 2 cell-lines from tier 1 experiments and other 2 from tier 2. These 4 cell-lines are: Gm12878, H1-Hesc, Hep-G2 and Huvec. The other two celllines, K562 (tier 1) and Hela (tier 2), are used for testing. We did not extend our experiments to tier 3 data since these data are of much lower quality. For these 6 cell-lines annotation maps from Hoffman et al 2013 are available. The number of positive and negative bins is presented in Supplementary Table 1. DEEP-FANTOM5 component implements tissue-specific models coming from different FANTOM5 tissues. Specifically, we chose 5 tissues from vital organs for training. All other were used for testing. Supplementary Table 2 presents the number of positive samples as well as characteristics related to these tissue-specific candidate enhancer regions. Note that for the number of negative samples we chose 10 times the number of positive regions and we created tissue-specific negative datasets. During the negative dataset generation we preserved the positive dataset distribution meaning that we constructed negative samples randomly with the same minimum, maximum and mean length as the positive enhancer regions.

Selecting relevant features
To build efficient models usually requires removing redundant features and may result in the selection of a small and optimized feature set. The small number of more relevant features can sometimes improve the prediction performance of the deployed models and result in faster, more reliable and more cost-effective classification. For the DEEP-ENCODE model, we initially tested the classification performance using three different feature subsets. One subset consisting of 11 histone marks, the other one that contains 351 sequence-derived attributes, and the third that contains all the available features (362 in total). However, when we used all features or when we used only the sequence-derived features we achieved much lower performance compared to the one obtained with feature set that contains only 11 histone marks. For this reason the DEEP-ENCODE model is trained using this small subset of features. In addition, the relative small number of features (only 11 attributes) allows for the application of an exhaustive feature selection procedure. The exhaustive feature selection measures the classification performance for all possible combinations of features, which in our case are 2048 combinations.
For each cell-line we chose randomly 20% of the original data (we preserved the ratio 1:10 between positive and negative classes) and we performed classification using the same algorithm as the one deployed in DEEP-ENCODE component. We chose the combination of features that maximizes the geometric mean (GM) of sensitivity and specificity.
Supplementary Table 5 presents the results. Note that the whole process of exhaustive search took around 7 days to finish (sequential implementation in a Intel Xeon 2.5 GHz workstation).
A closer look to the previous feature subsets shows that different sets of features appear optimal for different cell-lines. We also observe that attribute H3K4me1 is always selected. This is consistent with the experimental evidence that associate this histone mark with enhancers. Apart from that, H3K9ac is selected in 5 out of 6 cell-lines, whereas H3K4me2, H3K4me3 and H3K36me3 are selected in 4 out of 6 cell-lines.
For the DEEP-FANTOM5 component it was impossible to perform exhaustive search over 351 characteristics. Instead of that, we utilized a wrapper-based feature selection tool (http://www.cbrc.kaust.edu.sa/dwfs/ that utilizes a GA for identifying relevant features. We chose the default setting and we selected features using Naïve Bayes classifier since it is much faster than other classification techniques available in this tool. Similarly to the DEEP-ENCODE component, selecting features for different tissues results in different relevant feature subsets. However, the performance we obtained using these selected features was always lower than using the performance achieved using the original feature vector. For this reason, we decided to use the original feature vector containing 351 features for the rest of our experimentation. However, identifying a more compact feature subset for this specific component remains a challenging task to be done in future.

Feature vector description used for exhausting search
A description of the 11 histone marks used for exhaustive search of the best feature combination for DEEP-ENCODE is presented in Supplementary Table 3.

Feature vector for DEEP-FANTOM5 and DEEP-VISTA
The DEEP-FANTOM5 and DEEP-VISTA components use 351 features derived from the DNA sequence itself. Supplementary Table 4 describes the feature categories. Note that ChIP-seq histone modification data for the FANTOM5/VISTA tissues are not currently available.

Tuning the ratio between testing and training
Since we chose the simple holdout approach (2-fold cross validation) for assessing the classification performance of the cell-line/tissue specific models, one important question that arises is how many positive and negative samples are sufficient to use for training. To resolve this issue we utilized an internal trial-error approach. Specifically, we performed multiple trainings with different ratios starting from 10% for training and 90% for testing, and progressively decreasing the number of testing samples (and increasing the number of negative). We considered the geometric mean of specificity and sensitivity and the positive predictive value as the most indicative performance metrics for this task. Supplementary Tables 6-11 presents the classification results. For DEEP-ENCODE component we observe that selecting the 20% of the data for training and 80% of the data for testing is most suitable and also the training time is reduced. Note that we were training thousands of SVM models. Similarly, for DEEP-FANTOM5 component we observe that 40% of the data for the training and 60% of the data for testing appears to be a suitable choice.

Testing the effectiveness of other decision-making mechanisms
In the ensemble techniques, there are many ways to combine decisions from individual models and draw the final predictions. We experimented with the simple voting schema applied to the DEEP-ENCODE component as follows: First, we applied predefined thresholds for phases 1 and 2. We considered different proportions of voting for making a final decision and we reported results by applying various decision thresholds starting from 10% up 100% of total votes, meaning that we required at least 100,200,300,400,500,600,700,800,900,950,960,970,980,990 and 1000 classifiers to vote for the positive class. Supplementary  Figure 1 presents the ROC performance curve. It becomes apparent that the ANN based decision mechanism is more sophisticated and has significant advantages over this simple voting schema. Thus, we applied the ANN decision mechanism for the experimentation presented in the manuscript.

Individual ENCODE cell-line specific models achieve poor Positive Predictive Values (PPV)
In order to examine further the capabilities of models for predicting enhancers trained on single cell-lines, we generated Precision-Recall curves for several cell-lines and tissues. Supplementary figure 2 presents these results. Overall, we show that cell-specific models achieve inconsistent performance across different cell-lines and their precision is very low compared to DEEP-ENCODE component.

Classification performance of DEEP-ENCODE with different feature subsets
To measure the effectiveness of different feature subsets used for the DEEP-ENCODE model, we calculated the average classification performance achieved using sequence-derived features and feature set that contains all the 362 attributes. We also measured the performance using different ratios of positive and negative samples. Supplementary Tables  14,15,16,17,18,19 present these results.

TABLES AND FIGURE LEGENDS
Supplementary  Detects Acetylation H3K27me3 Detects trimethylation of Lysine 27 H3K36me3 Marks actively transcribed regions H3K4me1 Associated with enhancers

H3K4me2
Marks promoters and enhancers H3K4me3 Associated with active promoters H3K79me2 Marks transcriptional transition regions H3K9ac Marks promoters in chromatin regions

H3K9me3
Associated with silenced chromatin H4K20me1 Associated with active and accessible regions     For convenience we plot the same ROC curve using the ANN decision-making mechanism.