- Split View
-
Views
-
Cite
Cite
Arshdeep Sekhon, Ritambhara Singh, Yanjun Qi, DeepDiff: DEEP-learning for predicting DIFFerential gene expression from histone modifications, Bioinformatics, Volume 34, Issue 17, September 2018, Pages i891–i900, https://doi.org/10.1093/bioinformatics/bty612
- Share Icon Share
Abstract
Computational methods that predict differential gene expression from histone modification signals are highly desirable for understanding how histone modifications control the functional heterogeneity of cells through influencing differential gene regulation. Recent studies either failed to capture combinatorial effects on differential prediction or primarily only focused on cell type-specific analysis. In this paper we develop a novel attention-based deep learning architecture, DeepDiff, that provides a unified and end-to-end solution to model and to interpret how dependencies among histone modifications control the differential patterns of gene regulation. DeepDiff uses a hierarchy of multiple Long Short-Term Memory (LSTM) modules to encode the spatial structure of input signals and to model how various histone modifications cooperate automatically. We introduce and train two levels of attention jointly with the target prediction, enabling DeepDiff to attend differentially to relevant modifications and to locate important genome positions for each modification. Additionally, DeepDiff introduces a novel deep-learning based multi-task formulation to use the cell-type-specific gene expression predictions as auxiliary tasks, encouraging richer feature embeddings in our primary task of differential expression prediction.
Using data from Roadmap Epigenomics Project (REMC) for ten different pairs of cell types, we show that DeepDiff significantly outperforms the state-of-the-art baselines for differential gene expression prediction. The learned attention weights are validated by observations from previous studies about how epigenetic mechanisms connect to differential gene expression.
Codes and results are available at deepchrome.org.
Supplementary data are available at Bioinformatics online.
1 Introduction
Gene regulation is the process of controlling gene expression. The human body contains hundreds of different cell types. Although these cells include the same set of DNA information, their functions are different. Cells resort to a host of mechanisms to regulate genes differently. Many factors, especially those in the epigenome, can affect how cells express genes differently. As reviewed in Egger et al. (2004) and Meisner and Reif (2015), epigenomics studies how gene expression is altered by a set of chemical reactions over the chromatin that do not alter the DNA sequence.
Histone modification (HM) is one set of critical chemical reactions over the chromatin that plays a crucial role in regulating gene transcription. DNA strings are wrapped around ‘bead’-like structures called nucleosomes that are composed of histone proteins. These histone proteins are prone to a variety of modifications (e.g. methylation, acetylation, phosphorylation, etc.) that can modify the spatial orientation of the DNA structure. Such modifications impact the binding behavior of transcription factor proteins (to DNA) and thus generate different forms of gene regulation. The significant role of histone modifications in influencing gene regulation was evidenced in studies like connecting anomalous histone modification profiles to cancer occurrences (Bannister and Kouzarides, 2011). Contrary to DNA mutations, such epigenetic changes are potentially reversible (Bannister and Kouzarides, 2011). This vital feature has brought histone modifications to the center stage of epigenetic therapy.
Recent advances in next-generation sequencing have allowed researchers to measure gene expression and genome-wide histone modification patterns as read counts across many cell types. These datasets have been made available through large-scale repositories, one latest being the Roadmap Epigenome Project (REMC, publicly available) (Kundaje et al., 2015). REMC has released thousands of genome-wide datasets including gene expression reads (RNA-Seq datasets), and HM reads across 100 different human cells/tissues (Kundaje et al., 2015). Multiple recent papers tried to understand gene regulation by predicting gene expression from large-scale HM signals. Related studies (summarized in Supplementary Table S2) have mostly focused on the formulation under a single cell condition, even though gene regulation undergoes differential changes by environmental triggers, from one tissue type to another, or under different cell development stages.
Differential gene expression, or difference in expression levels of the same gene in two cell conditions, controls functional and structural heterogeneity of cells and has also been implicated in a number of diseases, providing valuable tools for the discovery of therapeutic targets and diagnostic markers. Differential gene expression has been linked to aberrant HM profiles in the literature. For example, Gjoneska et al. (2015) showed the correlation between differential gene expression in different stages of Alzheimer’s disease-like neurodegeneration in mice. Further, the authors observed that the changes in HM patterns associate with the differentially regulated genes. Koch et al. (2007) reported coordinated changes between HM profiles and differential gene expression across the lymphoblastoid cell line GM06990, K562 and HeLa-S3 cell lines. As another example, Weng et al. (2012) showed links between differential expression of naive T cells versus memory T cells, ascribed mainly to changes in histone modifications.
This paper proposes an attention-based deep learning architecture to learn from datasets like REMC, how different histone modifications work together to influence genes’ differential expression pattern between two different cell-types. We argue that such differential analysis and differential understanding of gene regulation from HMs can enable new insights into principles of life and diseases, will allow for the interrogation of previously unexplored regulation spaces, and will become an important mode of epigenomics analysis in the future. Four fundamental data challenges exist when modeling such tasks through machine-learning:
Genome-wide HM signals are spatially structured and may have long-range dependency. For instance, to quantify the influence of a histone modification mark, learning methods typically need to use as input features all of the signals covering a DNA region of length 10 000 base pair (bp) centered at the transcription start site (TSS) of each gene. These signals are sequentially ordered along the genome direction. To develop ‘epigenetic’ drugs, it is important to recognize how an HM mark’s influence varies over different genomic locations.
The core aim is to understand what the relevant HM factors are and how they work together to control differential expression. Various types of HM marks exist in human chromatin that can influence gene regulation. For example, each of the five standard histone proteins can be simultaneously modified with various kinds of chemical modifications, resulting in a large number of varying histone modification marks. As shown in Figure 1, we build a feature vector representing signals of each HM mark surrounding a gene’s TSS position. When modeling genome-wide signal reads from multiple marks, learning algorithms should take into account the modular nature of such feature inputs, where each mark functions as a module. We want to understand how the interactions among these modules influence the prediction (differential gene expression).
Since the fundamental goal of such analysis is to understand how HMs affect gene regulation, it requires the modeling techniques to provide a degree of interpretability and allowing for automatically discovering what features are essential for predictions.
There exist a small number of genes exhibiting a significant change of gene expression (differential patterns) across two human cell types like A and B. This makes the prediction task using differential gene expression as outputs much harder than predicting gene expression directly in a single condition like A alone or B alone.
In this paper, we propose an attention-based deep learning model, DeepDiff, that learns to predict the log-fold change of a gene’s expression across two different cell conditions (assuming cell type A and cell type B in the rest of the paper). Here, and , where M represents the number of HM signals and T represents the number of bins. For each gene, its input signals consist of (the histone modification signals from A), (the histone modification signals from B) and (the difference matrix between HM signals of these two conditions). All three cover the gene’s neighboring 10 000 base pair regions centered at the transcription start site (TSS). (i) To tackle the first challenge of modeling spatially structured input signals, we use the Long Short-term Memory (LSTM) (Section 3.4) deep-learning module that can represent interactions among signals at the different positions of a chromatin mark. Because we model multiple HM marks, resulting in multiple LSTMs learning to embed various HM marks. (ii) To handle the second challenge of modeling how HM marks work together, we use a second-level LSTM to learn complex dependencies among different marks. (iii) For the third challenge of interpretability, we borrow ideas from the AttentiveChrome (Singh et al., 2017) that focuses on cell-specific predictions. We train two levels of ‘soft’ attention weights, to attend to the most relevant regions of a chromatin mark, and to recognize and attend to the critical marks for each differential expression prediction. Through predicting and attending in one unified architecture, DeepDiff allows users to understand how chromatin marks control differential gene regulation between two cell types. (iv) For the last challenge of difficult label situation for differential expression prediction, we design a novel multi-task framework to use the cell-type-specific prediction network as auxiliary tasks to regularize our primary task of differential expression prediction. The cell-type specific system (one for cell A and another one for cell B) also uses attention plus the hierarchical LSTMs formulation. Further, we introduce an additional auxiliary loss term that encourages the learned embeddings of HM inputs and to be far apart for differentially expressed genes.
In summary, DeepDiff provides the following technical contributions:
DeepDiff uses a hierarchy of two levels of gene-specific attention mechanisms to identify salient features at both the bin level and the HM levels. It can model highly modular inputs where each module is highly structured. Attention weights enable our model to explain its decisions naturally by providing ‘what’ and ‘where’ in HM signal inputs is important for the differential gene expression output. This flexibility and interpretability make this model an ideal technique for handling large-scale epigenomic data analysis.
We introduce an auxiliary task and auxiliary loss formulation to aid the main task of differential gene expression prediction. The proposed multitasking framework couples the two related tasks of cell-type-specific predictions and the main task of differential expression prediction. This auxiliary formulation provides the model with additional information from the auxiliary evidence, encouraging richer feature embeddings compared to only difference HM features. It helps DeepDiff to build on top of the state-of-the-art AttentiveChrome (Singh et al., 2017) and can borrow auxiliary features from AttentiveChrome tasks. Further, we introduce a novel auxiliary loss term inspired by the contrastive loss (Hadsell et al., 2006) of the Siamese architecture formulation. This loss term encourages the model to learn embeddings whose neighborhood structures in the model’s representation space are more consistent with the differential gene expression pattern.
To the authors’ best knowledge, DeepDiff is the first deep learning based architecture for relating histone modification and differential gene expression patterns. DeepDiff provides more accurate predictions than state-of-the-art baselines. Using datasets from REMC, we evaluate DeepDiff on ten different pairs of cell types. We validate the learned attention weights using previous observations obtained from HM enrichment analysis across differentially regulated genes.
2 Previous works
Multiple computational methods have been proposed to employ HMs for predicting gene expression using large-scale histone modification datasets. Recent methods in the literature can be roughly grouped into three categories with respect to the formulation of outputs: regression, classification or ranking. (i) The regression-based models include linear regression (Costa et al., 2011; Karlić et al., 2010) and Support Vector Regression (SVR) (Cheng and Gerstein, 2012). Cheng and Gerstein (2012) divided the DNA regions around TSS (transcription start site) and TTS (transcription terminal site) into small bins of 100 base pairs and used a multiple bin-specific Support Vector Regression (SVR) to model HM signals for gene expression prediction. They extended this SVR model to predict differential gene expression between mouse embryonic stem cell and neural progenitor cell using the difference of the HM signals as features/inputs. This study also uses a two-layer SVR model to integrate information from multiple bins. The first layer is a bin-specific SVR model for histone modification features. A second layer takes as input the predictions from the first layer across all bin positions and predicts a single regression output for differential gene expression. (ii) Frasca and Pavesi (2013) proposed a ranking based cell type-specific model that formulates gene expression prediction as a ranking task such that high ranks correspond to high levels of gene expression and low rank corresponds to low expression. (iii) Multiple studies used classification-based formulation to model the gene expression prediction from HM inputs. This includes support vector machines (Cheng and Gerstein, 2012), random forests (Dong et al., 2012; Li et al., 2015a), rule-based learning (Ho et al., 2015) and deep learning frameworks like DeepChrome (Singh et al., 2016) and AttentiveChrome (Singh et al., 2017). (Dong et al., 2012) used a random forest classifier to predict genes as silent or transcribed, while the authors also used linear and multivariate regression to predict gene expression values. Li et al. (2015a) presented a two-step process—feature selection, then followed by prediction for differential gene expression. It used the so-called ReliefF (Kononenko et al., 1997) based feature selection and Random Forest Classification. DeepChrome (Singh et al., 2016) and Attentive Chrome (Singh et al., 2017) are deep learning based frameworks for cell-type specific gene expression prediction. DeepChrome (Singh et al., 2016) used Convolution Neural Nets (CNN). Differently, AttentiveChrome (Singh et al., 2017) used a hierarchical attention-based deep learning architecture to predict gene expression from HM reads. Supplementary Table S2 compares all the aforementioned related studies for gene expression prediction.
3 Materials and methods
3.1 Background: Recurrent Neural Networks and Long Short-Term Memory (LSTM)
A variant of the RNN, LSTM (Hochreiter and Schmidhuber, 1997), further improves upon the basic RNN [Eq. (1)] to model long-term dependencies. In addition to the hidden state-to-state recurrent component in an RNN, an LSTM layer has a recurrent cell state updating function and gating functions. The gating functions control what information needs to be added or removed from the cell state. This combination of cell state and gating functions allows the LSTM to learn long-term dependencies while avoiding vanishing and exploding gradients. Similar to a basic RNN, when given input vector and the state from previous time step , an LSTM module also produces a new state vector . For our task, we call each bin position on the genome coordinate a ‘time step’.
3.2 Attention-based deep-learning models
Deep neural networks augmented with attention mechanisms have obtained great success on multiple artificial intelligence topics such as machine translation (Dzmitry et al., 2014), object recognition (Jimmy et al., 2014; Volodymyr et al., 2014), image caption generation (Xu et al., 2015), question answering (Ilya et al., 2014), text document classification (Zichao et al., 2016), video description generation (Li et al., 2015b), visual question answering (Huijuan and Kate, 2016), or solving discrete optimization (Oriol et al., 2015). The idea of attention in deep learning is inspired by the properties of the human visual system. When perceiving a scene, the human vision fixates more on some areas over others, depending on the task at hand (Corbetta and Shulman, 2002). Augmenting deep learning models with attention allows them to focus selectively on only relevant features for a prediction. Different attention mechanisms have been proposed in the literature, including ‘soft’ attention (Dzmitry et al., 2014), ‘hard attention’ (Xu et al., 2015; Minh-Thang et al., 2015), or ‘location-aware attention’ (Chorowski et al., 2015). Soft attention (Dzmitry et al., 2014) calculates a ‘soft’ weighting scheme over all the components of an input. These weights indicate the relative importance of each feature component for a given prediction. The weights are then used to compute a summary representation of the input as a weighted combination of the components. The magnitude of an attention weight correlates highly with the degree of significance of the corresponding component to the prediction. This property is particularly ideal for adapting deep learning to biology tasks, as it gives users interpretable information regarding how features contribute to a prediction. Recently, AttentiveChrome (Singh et al., 2017) introduced two levels of attention at the bin and the histone-modification level, enabling the user to get information about which features were responsible for each prediction at the sample level.
3.3 Input generation
We focus on the predictive modeling of differential gene expression given the histone modification profiles of a gene in two cell-types. We formulate the output as the log fold change in expression given the histone modification profiles for the two cell-types under consideration. Similar to DeepChrome (Singh et al., 2016) and AttentiveChrome (Singh et al., 2017), we divided the 20 000 basepair (bp) DNA region (±10 000 bp) around the transcription start site (TSS) of each gene into bins of length 100 bp. Each bin includes 100 bp long adjacent positions flanking the TSS of a gene.We consider five core histone modification marks that have been uniformly profiled across multiple cell types in the REMC database (Kundaje et al., 2015). Supplementary Table S4 summarizes the 5 HMs we use and their associated functional regions on the genome. Figure 1(a) summarizes our input matrix generation strategy. More concretely, our input for each gene includes two 5 × 200 matrices, each matrix corresponding to each of the two cell types under consideration [depicted in Fig. 1(b)]. Columns and rows in each matrix represent bins and histone modifications, respectively. Thanks to the capability of neural networks for learning meaningful representations, we do not perform any feature selection before feeding the matrices in Figure 1(b) to the proposed DeepDiff model variations, a hierarchical attention-based DNN. Compared to previous studies (Supplementary Table S2), DeepDiff does not need to explore the best-bin, averaged or other feature selection strategies. As an end-to-end strategy (raw features to predictions), DeepDiff eliminates the need to evaluate different feature selection strategies.
3.4 DeepDiff: learning meaningful representations through two levels of embedding and attention modules
Notations: Our training set consists of Nsamp gene samples in the form of pairs, where . Given two cell-types A and B, and a gene g under consideration, the HM profile of gene g in A and B is denoted as and , respectively. We consider M = 5 HM marks for each gene. Each HM signal across the T = 200 bins in cell-type A is represented by a row vector in . Similarly, for cell-type B, each HM signal is represented by a row vector in .
We do not perform any feature selection on the raw histone modification features. Instead, we use deep learning modules to learn sensible features. Our raw features have two important properties (as depicted in Fig. 1): (i) In addition to the raw HM signals from the two cell types under consideration, we use difference and concatenation of the raw HM signals. This results in modular raw input features with four possible input matrices, rows corresponding to HM vectors [as shown in Fig. 1(b)]. (ii) These HM vectors are spatially structured along the genome coordinate. Considering the spatially structured raw features and their modular property, we use two levels of basic embedding modules: Level I and Level II embedding units coupled with two levels of attention modules. These basic modules used in DeepDiff variations are illustrated in Figure 2. Now, we explain how we use deep learning to learn the representation of each matrix in Figure 1(b).
Level I embedding (f1): The Level I Embedding module consists of a bin-level LSTM for learning the embedding of every HM, followed by a bin level attention mechanism. The bin level LSTM sequentially models the signal at each bin position. We name LSTMs, for all input HMs, put together, as the Level I Embedding module. The Level I Embedding module f1 consists of multiple bidirectional LSTMs, one for each input HM. The LSTMj corresponding to HM j, takes as input the jth HM, i.e, jth row vector in matrix X, . A bidirectional LSTM has one LSTM in each direction. The forward LSTM models dependencies in in the direction 1 to T, i.e. where is of hidden state size D. The backward LSTM models the dependencies from T to 1: that is . Hence, . The output of the bidirectional LSTM is a concatenation of the hidden state output of the forward and backward LSTMs at each t position: , where is of size .
Level II embedding (f2): To efficiently represent the combinatorial dependencies between the various HMs, we use another LSTM as a second level embedding module. This LSTM takes as input the Level I Embedding output where . In detail, the jth HM embedding from Level I Embedding module is used as input to the jth time step in a bidirectional LSTM. The LSTM will generate an embedding vector for jth time step: .
This attention weight βj intuitively represents the relative contribution of the HM to the summary representation of the whole matrix X. To get a summarized embedding of the HMs, we use the outputs at all time steps of the HM level LSTM weighted by its attention score as the final embedding of describing X i.e. . Including the bin level as well as HM-level attention weights representing X allows us to interpret which bins in which HMs were relatively more important for the current prediction.
3.5 DeepDiff main task: an end-to-end deep-learning architecture for regression
We formulate the differential gene expression prediction (our main task) as a regression task. The target label for a gene is the log fold change of its expression in Cell-type A versus its expression in Cell-type B. In DeepDiff, the learned representation vector v from Level II Embedding is fed into a multi-layer perceptron (MLP) module to learn a regression function, a mapping from HM profiles to the target real value representing differential pattern. In detail, this prediction module comprises a standard, fully connected multi-layer perceptron network with multiple alternating linear and non-linear layers. Each layer learns to map its input to a hidden feature space, and the last output layer learns the mapping from the hidden space to the output label space.
3.6 DeepDiff with multitasking: learning better representations with auxiliary tasks
We then extend the basic DeepDiff mentioned above into a multi-task learning formulation, in order to learn better joint representations informed by auxiliary tasks (details below). Multi-task learning (Multitasking) was initially proposed by Caruana (1997) to find common feature representations across multiple relevant tasks. Most of the multitasking studies have focused on neural networks (Caruana, 1997), where some hidden layers are shared between various tasks. If different tasks are sufficiently related, multitasking can lead to better generalization. In this paper, we consider two such related tasks as auxiliary tasks for multi-task learning with our DeepDiff main task:
Cell-Specific Auxiliary – (Auxiliary-Task-A + Auxiliary-Task-B): We posit the cell-type specific gene expression prediction in each of the two cell-types (A and B) as the Cell-Specific Auxiliary tasks to regulate the main DeepDiff in the training phase. We call these two tasks in cell-types A and B as Auxiliary-Task-A and Auxiliary-Task-B, respectively. Since our main DeepDiff task uses the log-fold change of the counts of gene expression as the target y, we use the log of counts of gene expression in cell-type A and cell-type B as the target value in Auxiliary-Task-A and Auxiliary-Task-B, respectively. To handle zero values of expression (log 0 gives NA), we add 1 to all counts. Besides, we also evaluate using binarized cell-type specific gene expression as the label for the auxiliary tasks (i.e. as classification as opposed to regression. See details in Supplementary Material). For Auxiliary-Task-A, is passed through two levels of embedding and attention modules that are specific for cell-type A. The output embedding of the Level II Embedding unit is passed through an MLP layer . is used to map (from ) to the target value for the gene in Cell-type A. Similarly, for the Auxiliary-Task-B, , obtained by passing through two levels of embedding modules, is passed through an MLP layer for Cell-type B specific gene expression prediction. By jointly training cell-type specific gene expression with differential gene expression, we can improve the main DeepDiff task performance (see section 4 for experimental results). We train both of these auxiliary tasks using the sum of the MSE loss averaged over the training set (Eq. (6)) between the gene expression target and predicted values for both cell-types. We denote this cumulative Cell-Specific Auxiliary loss for both cell-types as .
Many possible variations of DeepDiff exist through the various combinations of the auxiliary tasks in the multi-tasking framework as well as the variations of raw HM features. For example, we can use all auxiliary tasks to multi-task with the main task. This means the sum of the main and auxiliary task losses is used as part of the training objective: (i) Differential expression prediction loss for the main task (), (ii) Auxiliary task loss () from the Cell-specific prediction tasks and (iii) Contrastive Siamese Loss () from the Siamese-Auxiliary Task. The network is trained using similar steps outlined in the main DeepDiff task with the sum of these losses as training objective.
In our experiments, we have evaluated the following DeepDiff variations:
(Raw: d) Raw Difference Features;
(Raw: c) Concatenation of Raw HM features;
(Raw) Concatenation and difference of raw HM features;
(Aux) Auxiliary Embeddings as Features;
(Raw+Aux) Concatenation and Difference of HMs + Embeddings from Auxiliary tasks;
(Aux+Siamese) Auxiliary Features with Siamese Contrastive Loss;
(Raw+Aux+Siamese) Raw and Auxiliary Features with Siamese Contrastive Loss
Figure 3 and Supplementary Table S1 show the different auxiliary tasks and resulting different DeepDiff variations through various combinations with the raw HM features. Due to space limitation, we have the detailed description of each variation in Supplementary Section S1.3. In the rest of the paper, we use the short names enclosed in parantheses above to refer to each variation.
4 Experimental setup and results
4.1 Dataset
We downloaded gene expression and HM signal data of five core histone modification signals for ten different cell types from the REMC database (Kundaje et al., 2015). Supplementary Table S5 summarizes the IDs and information of the ten cell-types we use that have been extensively profiled by the Roadmap Epigenomics Project. Supplementary Table S4 describes the five core histone modifications marks we use along with their known important roles in gene regulation. For the regression labels, we use the log fold change of raw counts in the two cell-types under consideration as the target label for the main task and the log of the raw counts in each cell type for the two cell-specific auxiliary tasks. In total, we apply DeepDiff variations and baselines (see Section 4.2) on ten pairs of cell-types from REMC. Supplementary Table S6 provides a list of the ten cell-type pairs. For each cell type pair, we have a sample set of total 18 460 genes. This set was divided into 3 separate folds: training (10 000 genes), validation (2360 genes) and test (6100 genes) folds.
4.2 Baselines
We compare DeepDiff to two variations of the Support Vector Regression (SVR) (Cheng and Gerstein, 2012), the ReliefF based feature selection followed by Random Forest Regression (Li et al., 2015a) and the AttentiveChrome (Singh et al., 2017). In details:
Single Layer SVR (Cheng and Gerstein, 2012,): The authors selected 160 bins from regions flanking each gene TSS and TTS. Each bin uses a separate SVR, resulting in 160 different bin-specific SVR models. The radial basis kernel is used for the SVR. Cheng and Gerstein (2012) proposed to use the best-bin strategy. Therefore, by using cross-validation, we pick the best bin based on Pearson Correlation Coefficient (PCC) used by Cheng and Gerstein (2012). The best bin model is then used for prediction on the test set.
Two Layer SVR (Cheng and Gerstein, 2012,): The two-layer model in Cheng and Gerstein (2012) seeked to combine the signals of all HMs across all the 160 bins. In the first layer, it predicts expression levels in each of the bins using a bin-specific SVR model in each individual bin. Then the expression levels predicted by each bin are combined in the second layer using another SVR model to make a final prediction. The radial basis kernel is used for both layers of the SVR.
ReliefF Feature Selection + Random Forest (Li et al., 2015a,): We implement the best performing combination in Li et al. (2015a): ReliefF algorithm for feature selection followed by random forest. While Li et al. (2015a) treat the problem as a binary classification with two classes of upregulated versus downregulated genes, we used this baseline for our regression formulation. The number of features for ReliefF is selected from the set of {50, 70, 100}. The number of trees for Random Forest is selected from the set of .
AttentiveChrome (Singh et al., 2017): We also compare our models to the differential patterns derived from predictions made by AttentiveChrome. We train AttentiveChrome for cell type specific gene expression as a regression task. Then we calculate the differential gene expression prediction as the log fold change between the predicted expression from the two trained models that are specific to the two cell-types under consideration. In detail, for each pair of cell-types, we train two cell-specific AttentiveChrome models independently from each other. We then use the cell type specific predictions of each model to calculate differential gene expression. Clearly, this baseline is not an end-to-end solution for differential gene expression prediction.
We implemented SVR and Random Forest baselines using the scikit-learn (Pedregosa et al., 2011) package. We implemented AttentiveChrome and DeepDiff models in Pytorch. We use Pearson Correlation Coefficient to evaluate our models. We train all the variations of DeepDiff (summarized in Supplementary Table S1) were trained using our training set and tune the hyperparameter on the validation set. The best performing models were then evaluated on the test set. The details about evaluation metric and hyperparameters for DeepDiff and baselines are in the Supplementary Section S1.5.
4.3 Performance evaluation
Figure 4 shows the Pearson Correlation Coefficient (y-axis) for all DeepDiff variations versus the baselines. The x-axis shows the ten cases (cell-type pairs) in our experiments. The deep learning based models outperform both the SVR as well as Random Forest baselines. The two-layer SVR model performs better than one layer SVR. When comparing DeepDiff variations to the AttentiveChrome baseline, DeepDiff also outperforms, indicating the need for modeling differential gene expression prediction. Among the DeepDiff variations, the Raw: d model performs the worst in 9 out of the 10 cases. This indicates that simply taking the difference of the HM signals is not enough to model the combinatorial interactions of HMs for differential regulation. Instead, using all HM features, instead of only the difference, is clearly helpful, as indicated by the higher PCC of Raw: c in comparison to Raw: d across all 10 cases. The results of Raw+Aux show that adding cell-type specific gene expression prediction in the two cell-types as auxiliary tasks clearly helps in improving the prediction performance. Combining Aux and Raw features gives better performance than only Aux and Raw in 7 and 6 out of the 10 cell type pairs, respectively. Aux+Siamese is the best performing model in 5 out of the 10 cases. This shows that adding the Siamese-based contrastive loss improves the prediction performance. Table 1 shows the mean and median of the relative performance (%) with respect to PCC when being compared to the best performing baselines: two-layer SVR and AttentiveChrome. For example, as shown in Table 1, when averaged across the 10 cases, combining the raw and auxiliary features (Raw+Aux variation) results in a relative PCC with respect to the two-layer SVR baseline of 162.23%.
Method . | Two-layer SVR . | AttentiveChrome . | ||
---|---|---|---|---|
. | Mean . | Median . | Mean . | Median . |
Raw: d | 153.72 | 156.81 | 108.90 | 102.14 |
Raw: c | 163.59 | 166.64 | 115.75 | 107.64 |
Raw | 162.94 | 166.55 | 115.26 | 107.71 |
Aux | 157.64 | 166.58 | 111.20 | 106.13 |
Raw+Aux | 162.23 | 164.73 | 114.61 | 106.62 |
Aux+Siamese | 161.85 | 168.90 | 114.19 | 107.87 |
Raw+Aux+Siamese | 161.95 | 166.07 | 114.44 | 107.56 |
Method . | Two-layer SVR . | AttentiveChrome . | ||
---|---|---|---|---|
. | Mean . | Median . | Mean . | Median . |
Raw: d | 153.72 | 156.81 | 108.90 | 102.14 |
Raw: c | 163.59 | 166.64 | 115.75 | 107.64 |
Raw | 162.94 | 166.55 | 115.26 | 107.71 |
Aux | 157.64 | 166.58 | 111.20 | 106.13 |
Raw+Aux | 162.23 | 164.73 | 114.61 | 106.62 |
Aux+Siamese | 161.85 | 168.90 | 114.19 | 107.87 |
Raw+Aux+Siamese | 161.95 | 166.07 | 114.44 | 107.56 |
Method . | Two-layer SVR . | AttentiveChrome . | ||
---|---|---|---|---|
. | Mean . | Median . | Mean . | Median . |
Raw: d | 153.72 | 156.81 | 108.90 | 102.14 |
Raw: c | 163.59 | 166.64 | 115.75 | 107.64 |
Raw | 162.94 | 166.55 | 115.26 | 107.71 |
Aux | 157.64 | 166.58 | 111.20 | 106.13 |
Raw+Aux | 162.23 | 164.73 | 114.61 | 106.62 |
Aux+Siamese | 161.85 | 168.90 | 114.19 | 107.87 |
Raw+Aux+Siamese | 161.95 | 166.07 | 114.44 | 107.56 |
Method . | Two-layer SVR . | AttentiveChrome . | ||
---|---|---|---|---|
. | Mean . | Median . | Mean . | Median . |
Raw: d | 153.72 | 156.81 | 108.90 | 102.14 |
Raw: c | 163.59 | 166.64 | 115.75 | 107.64 |
Raw | 162.94 | 166.55 | 115.26 | 107.71 |
Aux | 157.64 | 166.58 | 111.20 | 106.13 |
Raw+Aux | 162.23 | 164.73 | 114.61 | 106.62 |
Aux+Siamese | 161.85 | 168.90 | 114.19 | 107.87 |
Raw+Aux+Siamese | 161.95 | 166.07 | 114.44 | 107.56 |
4.4 Interpreting differential regulation using attention
Finally, we analyze the attention weights of Level II Embedding for one of the best performing case: cell type pair E116 and E123. Here, E116 represents ‘normal’ blood cell (GM12878) whereas E123 represents the leukemia cell (K562). We aim to validate that the learned attention weights can provide some insights into the differential gene regulation across these two selected cell types: a normal and a diseased (cancer) cell state. We only use the attention weights obtained on the test set for this analysis. First, we get a list of down-regulated genes (with log fold change ) and a list of up-regulated genes (with log fold change >8.0) from the test set. Figure 5 plots the average attention weights of each of the 5 HMs when considering all of our selected up-regulated (black bars) and down-regulated genes (gray bars). We observe that for the top upregulated genes, H3K4me1 (enhancer associated) and H3K4me3 (promoter associated) get the highest weights (i.e. contributing more importance). This is consistent with observations by Grégoire et al. (2016) that upregulated genes are more enriched for H3K4me1 and H3K4me3 when under cancer condition.
For top down-regulated genes (Fig. 5), H3K27me3 (Polycomb Repression associated) gets a comparatively higher weight (the second highest attention weight). Grégoire et al. (2016) also reported this trend showing that down-regulated genes are more enriched with the repressive H3K27me3 under the cancer condition whereas up-regulated genes do not show variation of this HM between normal and cancer conditions. Figure 5 shows that H3K9me3 (heterochromatin linked) gets low attention weights for both up-regulated and down-regulated genes. Grégoire et al. (2016) also reported this trend that the divergence of expression is less likely due to this HM. We want to emphasize again that our model learns the importance of HMs for the differentially expressed genes in an end-to-end manner.
5 Conclusion
We have presented DeepDiff, a deep learning framework for differential gene expression prediction using histone modifications (HMs). DeepDiff is an attention-based deep learning architecture designed to understand how different HMs work together to influence changes in expression patterns of a gene between two different cell types. DeepDiff uses a modular architecture to represent the spatially structured and long-range HM signals. It incorporates a two-level attention mechanism that gives it the ability to find salient features at the bin level as well as the HM level. Additionally, to deal with fewer differentially expressed genes between two cell types, we design a novel multi-task framework to use the cell-type-specific prediction network as auxiliary tasks to regularize our primary task of differential expression prediction. We also incorporate a Siamese contrastive loss term to further improve the learned representations. For the future work, we will evaluate the performance and the attention scores of DeepDiff on more cell type pairs. We will incorporate additional epigenomic signals that may relate to differential gene expression. We would also like to explore different ways to interpret and validate the attention weights. In summary, leveraging deep learning’s ability to extract rich representations from data can enhance our understanding of gene regulation by HMs, thus enabling insights into principles of gene regulation through epigenetic factors.
Conflict of Interest: none declared.
Funding
This work was partly supported by the National Science Foundation under NSF CAREER award No. 1453580. Any Opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.
References