Abstract

Motivation

Nucleosome positioning plays significant roles in proper genome packing and its accessibility to execute transcription regulation. Despite a multitude of nucleosome positioning resources available on line including experimental datasets of genome-wide nucleosome occupancy profiles and computational tools to the analysis on these data, the complex language of eukaryotic Nucleosome positioning remains incompletely understood.

Results

Here, we address this challenge using an approach based on a state-of-the-art machine learning method. We present a novel convolutional neural network (CNN) to understand nucleosome positioning. We combined Inception-like networks with a gating mechanism for the response of multiple patterns and long term association in DNA sequences. We developed the open-source package LeNup based on the CNN to predict nucleosome positioning in Homo sapiens, Caenorhabditis elegans, Drosophila melanogaster as well as Saccharomyces cerevisiae genomes. We trained LeNup on four benchmark datasets. LeNup achieved greater predictive accuracy than previously published methods.

Availability and implementation

LeNup is freely available as Python and Lua script source code under a BSD style license from https://github.com/biomedBit/LeNup.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Nucleosome positioning broadly indicates where nucleosomes are located with respect to the genomic DNA sequence (Struhl and Segal, 2013). Composed of DNA and a protein core, nucleosomes are about 10 nm in diameter and are the fundamental repeating unit of chromatin structure of eukaryotic DNA (Kornberg and Lorch, 1999; Richmond and Davey, 2003). The core is an octamer containing two copies each of histones H2A, H2B, H3 and H4. The histone octamer forms a wedge-shaped disk, around which 147 base pairs of DNA are tightly wrapped in approximately 1.7 turns in a left-handed superhelix (Luger et al., 1997). The DNA segment connecting two adjacent nucleosomes is referred to as a linker. Nucleosome positioning is critical to various biological processes, primarily because this precise positioning modulates the accessibility of underlying genomic sequence to DNA-binding proteins to regulate transcription (Liu M. et al., 2015; Schones et al., 2008; Tilgner et al., 2009; Whitehouse et al., 2007), genetic replication (Eaton et al., 2010; Liu et al., 2017; Vasseur et al., 2016), and recombination (Pulivarthy, 2016; Smagulova et al., 2011). Therefore, the identification of nucleosome positioning along genomic sequences may allow an in-depth understanding of various biological outcomes. Although many studies provide for support that the genome-wide pattern of nucleosome positioning is associated with DNA sequence, nucleosome remodelers and transcription factors including activators, components of the preinitiation complex and elongating Pol II (Segal and Widom, 2009; Struhl and Segal, 2013), the determinant factors of the nucleosome positioning are still far from a quantitative understanding. The intrinsic DNA sequence preferences of nucleosomes may be a dominant role in the nucleosome organization (Kaplan et al., 2009). Early discoveries indicate that distinctive sequence motifs play an important role in nucleosome positioning. Part of these discoveries include 10-bp interval repetition of AA/TT/TA dinucleotides (Ioshikhes et al., 1996; Satchwell et al., 1986), and TATAAACGCC repeat sequence (Widlund et al., 1999). Some research results establish that nucleosome organization is encoded in eukaryotic genomes and that this intrinsic organization can explain approximately 50% of the in vivo nucleosome positions (Segal et al., 2006). Another work claimed that about 75% of nucleosomes are characterized by sequences (Ioshikhes et al., 2011).

In the last decade, high-throughput genome-wide data with respect to nucleosome positioning come from a number of related techniques, such as MNase-seq (Jiang and Pugh, 2009; Kaplan et al., 2009; Weiner et al., 2009), DNase-seq (Bell et al., 2011; Guertin and Lis, 2013; Liu et al., 2016; Zhong et al., 2016) and ChIP-seq (Schones et al., 2008). These techniques have an idea in common to cut DNA between nucleosomes and map protected DNA regions. High-resolution genome-wide nucleosome maps were obtained for several organisms including yeast (Brogaard et al., 2012; Lee et al., 2007; Yuan et al., 2005), drosophila (Mavrich et al., 2008a), Caenorhabditis elegans (Valouev et al., 2008) and human (Barski et al., 2007; Schones et al., 2008; Valouev et al., 2011). The high-resolution data have been deeply promoting the development of computational methods for accurately predicting nucleosome positioning (Awazu, 2017; Chen et al., 2012; Guo et al., 2014; Gupta et al., 2008; Morozov et al., 2009; Segal et al., 2006; Van der Heijden et al., 2012; Wang et al., 2012; Xi et al., 2010).

Assuming that each 147-bp sequence in favor of histone-DNA interaction is a Markov chain, Segal et al. proposed a probabilistic model (Segal et al., 2006) to predict genome-wide nucleosome positioning in yeast. The model was improved by incorporating the information of linker sequences (Field et al., 2008). N-score (Yuan and Liu, 2008) distinguished nucleosomal sequences from non-nucleosome sequences adopting a wavelet analysis based model and a logistic regression model for predicting nucleosome positions from DNA sequence. NuPoP (Xi et al., 2010) models the DNA sequence with a duration hidden Markov model of two alternative states: nucleosome (N) and linker (L). A fourth order time-dependent Markov chain was trained for the N state, and a homogeneous fourth-order Markov chain for the L state. NuPoP outputs nucleosome occupancy score and nucleosome affinity score. Stimulated by the PseAAC approach (Chou, 2001; Chou, 2005), a sequence-based predictor called iNuc-PseKNC (Guo et al., 2014) for nucleosome positioning in genomes with pseudo k-tuple nucleotide composition was proposed. Here, the samples of DNA sequences were formulated using six basic DNA local structural properties and a support vector machine (SVM) classifier was trained on datasets from H. sapiens, C. elegans and Drosophila melanogaster. It was shown that iNuc-PseKNC had better performance in the prediction of nucleosome positioning than previously developed predictors. Furthermore, using the similar methodology to iNuc-PseKNC, more recently improved models (Awazu, 2017; Chen et al., 2016) were developed for the prediction of nucleosome positioning.

The computational methods and tools promoted and advanced the understanding on nucleosome positioning. However, most of these algorithms deeply depend on either the recognition of distribution of the nucleotides in nucleosome sequences (Awazu, 2017; Guo et al., 2014; Segal et al., 2006; Van der Heijden et al., 2012; Wang et al., 2012; Xi et al., 2010) or the measurement of biophysical and(or) physicochemical properties (Chen et al., 2012; Minary and Levitt, 2014). As Nucleosome positioning is strongly affected by DNA sequence (Gonzalez, 2016; Miele et al., 2008; Segal and Widom, 2009; Struhl and Segal, 2013; Zhang et al., 2009), computers may automatically learn the representation of Nucleosome positioning from the DNA sequences. This idea can be achieved by deep learning (Hinton and Salakhutdinov, 2006; Kelley et al., 2016; LeCun et al., 2015; Leung et al., 2014) that allows computational models to learn representations of data (Bengio et al., 2013) from multiple levels of abstraction. Deep learning has produced extremely promising results in image recognition (Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), natural language understanding (Collobert et al., 2011a), genetic variants scoring (Xiong et al., 2015), Go play (Silver et al., 2016) and cancer classification (Esteva et al., 2017).

In this study, a novel nucleosome positioning predictor was developed based on the convolutional neural networks (CNN). We set up a rigorous intellectual deep-learning network mainly composed by GoogleNet Inception convolutional neural network architecture (Szegedy et al., 2016) and gated convolutional networks (Dauphin et al., 2016). After training, the performance of the system was measured on a different set of examples called a test set. This predictor exhibited more excellent performance than the recently developed predictors for the same benchmark datasets of human, worm, fly and yeast genomes.

2 Materials and methods

To learn nucleosome positioning, we introduce a new deep convolutional architecture which is composed by the Inception deep convolutional architecture (Szegedy et al., 2015) and gated convolutional networks (Dauphin et al., 2016). Gated convolutional networks (Dauphin et al., 2016) were originally introduced for language modeling which outperformed strong recurrent models on language modeling.

2.1 Benchmark datasets of nucleosome positioning and nucleosome-disfavoring sequences

The benchmark datasets of nucleosome positioning and nucleosome-disfavoring sequences were downloaded from the Supplementary Material of two published papers (Chen et al., 2016; Guo et al., 2014). These datasets involve H. sapiens, C. elegans, D. melanogaster (Guo et al., 2014) and Saccharomyces cerevisiae (Chen et al., 2016). Only the low-biased benchmark datasets were used to train and test LeNup in this study.

2.2 Principle of a deep learning network for one-dimensional sequences

In a deep learning network, one processing step is usually called a layer, which could be a convolution layer, a ReLU layer, a pooling layer, a dropout layer, a normalization layer, a fully connected layer, a loss layer, etc. Unlike the three-dimensional feature tensor of an image, a one-dimensional DNA sequence has only a two-dimensional feature matrix. The width and depth of the matrix correspond with the number of row and column of the feature matrix. There is no height of a 2D matrix, but we say that the height is equal to 1 to be consistent with 3D feature tensors. Figure 1 illustrates the process of multiple convolutions, ReLU and pooling to the sequence feature matrix. Suppose we are considering the lth layer, whose inputs form a two-dimensional feature matrix Xl with Xlm×N. Assuming D filters are used and each filter is of spatial span m × k (for instance, k = 3), we pad the feature matrix by adding k2 columns with all elements being zero to the head and tail of the matrix. Therefore, the width of new features after convolutional operation with stride 1 is still N. The rectified linear unit (ReLU) f(z) = max(0, z) is applied in the networks. Pooling operations where every two adjacent elements are merged into one element. As shown in Figure 1, the outputs form the two-dimensional feature matrix Xl+1 with Xl+1D×N2.

Fig. 1.

The schema of convolution, ReLU and pooling in a deep learning network for one-dimensional sequence

2.3 Inception networks

Starting in 2014, the quality of network architectures significantly improved by utilizing deeper and wider networks. The Inception architecture of GoogLeNet (Szegedy et al., 2015) performed well even under strict constraints on memory and computational budget. The Inception models used to be trained in a partitioned manner, where each replica was partitioned into a multiple sub-networks in order to be able to fit the whole model in memory (Szegedy et al., 2016). A practically useful aspect of the Inception-style networks is that it aligns with the intuition that nucleosome positioning information should be processed at various scales and then be aggregated so that the next stage can abstract features from different scales simultaneously. Figure 2 shows the original Inception model (Szegedy et al., 2015) in visual recognition.

Fig. 2.

Original Inception module for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014

2.4 Gated convolutional networks

Gates allow a network to control which information should be propagated in the hierarchy of layers (Dauphin et al., 2016). This mechanism makes it easier to catch long-range dependencies for language modeling as it allows the model to select which words are relevant to predict the next word.

Figure 3 shows that a DNA sequence is converted to a ‘one hot code’ representation, where each position has a four-element vector with one component set to one and the others set to zero. Further, the sequence is converted to the data in hdf5 format. Figure 3 illustrates that we compute the hidden layers H0,…, HL as
Xl+1=(WlXl+bl)σ(VlXl+cl)
(1)
where Xlm×N is the input of layer Hl, that is either the vector sequence converted from nucleosome nucleotide sequence or the output of previous layers, Wlm×k×D,blD,Vlm×k×D,clD are the learned parameters, σ is the sigmoid function, ⊗ is the element-wise product between matrices, m × k is the size of filters (convolutional kernels) and D is the number of filters. Initially, N is the length of nucleosome DNA sequence and m = 4. The pooling operation maps all elements in a window with the width w into a single value. A row of new features obtained by the convolution is pooled by maximum or average pooling with the width w and the stride s as follows,
maximumpooling:Yi=max0j<wXi×s+javeragepooling:Yi=1w0j<wXi×s+j
(2)
where 0 ≤ i < Ns + 1 for no pad feature matrix, and 0 ≤ i < N for the padded feature matrix. Figure 1 shows that the concatenation operator put all the convolution and pooling results together to form a new feature matrix with D rows and N∕2 columns, here the pooling stride s = 2.
Fig. 3.

Each nucleotide of the sequence is converted to a four-element vector with one element setting to one and others setting to zero. The output of each hidden layer is a gated convolutional layer where one convolution layer is modulated by another convolution layer through a sigmoid gate

2.5 Gated inception networks

The advantages of the Inception models and gated convolutional networks inspired us to design a new network in order to fuse these advantages. This network architecture enables the predictor to seize local motifs of nucleosome DNA sequences as well as to capture the long-range association between nucleotides. We have tried dozens of versions of network structure. The version finally selected as the network of LeNuP is depicted in Figure 5. Figure 4a–c show the detailed components used in Figure 5. Some details of the LeNup structure partially shown in Figures 4 and 5 are explained and summarized as follows:

Fig. 4.

The schema of Gated Inception blocks used in Figure 5: (a) Gated convolution-A block; (b) Gated convolution-B block; (c) Gated convolution-C block

Fig. 5.

The overall schema of LeNup. For the detailed modules, please refer to Figure 4a–c for the detailed structure of the various components

  1. Each input tensor is 147 in width, 1 in height and 4 in depth.

  2. The convolution operation comes in pairs. As a gate limitation, one of them passes through a sigmoid function to control another operator result.

  3. All convolution results in Figure 4a–c are passed through the rectified linear unit (ReLU) for activation, which are not shown in these figures.

  4. m × 1, such as 1 × 1, 3 × 1, means that the filter has m in width and 1 in height. The depth of filters is not shown here, which depends on how many filters used in the previous layer. The number 128 or 96 in the parenthesis beneath m × 1 means the convolution with 128 filters or 96 filters. Therefore, the depth of the output of this layer will be 128 or 96.

  5. An average pooling layer or a maximum pooling layer with m × 1 means that the pooling stride is m.

  6. The block of Filter Concat in all figures means that the operation stacks all features from each branch together. For instance, Figure 4a shows that each gated convolutional subnetwork produces 128 features. We get 128 × 3 = 384 features through the filter concatenation.

  7. The output such as 73 × 1 × 384 in Figure 5 means that the dimension of feature maps is 73 in width, 1 in height and 384 in depth.

  8. One dropout layer with 30% of dropped outputs was performed after each pooling operation in LeNup.

  9. We used a linear layer with sigmoid loss as the classifier.

2.6 Training methodology

We have trained our networks running on a single NVidia Quadro M5000 GPU and implemented our models with stochastic gradient descent with momentum in Torch7 (http://torch.ch). Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. We paid attention to exploring the hyperparameter space of models to identify a compact model with good generalization performance. Our experiments used momentum with a decay of 0.98. We used a learning rate of 0.002, and decayed every epoch using an exponential rate of 0.97.

3 Results and discussion

3.1 Rule of performance evaluating

We used training datasets to train the predictor based upon our gated Inception networks. To survey the generalization performance, the predictor was tested by test datasets which are independent on training datasets. We defined the nucleosome-forming sequences as positive samples and the nucleosome-inhibiting sequences as negative samples. In this work, we adopted the sensitivity (Sn), the specificity (Sp), the accuracy (ACC) and the Matthew’s correlation coefficient (MCC) to score the predictive performance of the corresponding method. They are defined as follows:
Sn=TpTp+FnSp=TnTn+FpACC=Tp+TnTp+Fn+Tn+FpMCC=Tp×TnFp×Fn(Tn+Fn)×(Tn+Fp)×(Tp+Fn)×(Tp+Fp)
(3)
where Tp, Tn, Fp and Fn are the numbers of true positives, true negatives, false positives and false negatives, respectively. Sn ∈ [0, 1], Sp ∈ [0, 1], ACC ∈ [0, 1] and MCC ∈ [– 1, 1]. Sn = 0 means that all positives predict to the negatives. When all predictions are incorrect, therefore, Tp = 0 and Tn = 0, we have Sn = 0, Sp = 0, ACC = 0 and MCC = – 1. When all predictions are correct, thus Fp = 0, and Fn = 0, we have Sn = 1, Sp = 1, ACC = 1 and MCC = 1. When all positives are correctly predicted and all negative predictions are wrong, we have Sn = 1, Sp = 0 and MCC = 0. When all negatives are correctly predicted and all positive predictions are wrong, we have Sp = 1, Sn = 0 and MCC = 0. When TpFn=FpTn=1, therefore, Sn = Sp = ACC = MCC = 0.5, the predictor is not better than a random choice. We calculated all evaluation indices according to the test result.

We also used ROC curve (receiver operating characteristic curve) to illustrate the performance of the binary classifier LeNup. The curve is created by plotting the sensitivity (Sn) against the false-positive rate (1–Sp) at various threshold settings. The area under the curve (AUC) represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Usually 0.5 < AUC < 1. The closer the AUC value is to 1, the better the classifier performance.

3.2 LeNup performance

We utilized the 20-fold cross validation to evaluate our predictor. The benchmark datasets of each organism (Guo et al., 2014), that is, H. sapiens, C. elegans and D. melanogaster, were randomly divided into 20 data subsets of approximately equal size. We trained the network 20 times. For every training, one of the 20 sub-datasets was used as the test dataset and the others were combined to form the training dataset. All evaluation indices of our predictor, that is, Sn, Sp, ACC, MCC and AUC, are calculated according to test results in our work. The average values of four metrics Sn, Sp, ACC and MCC defined in Equation (3) over 20 test datasets are listed in Table 1 for the LeNup predictor. Figure 6 shows the ROC curves. The area under the curves, or AUC, is 0.9412, 0.9653 and 0.9401 for H. sapiens, C. elegans and D. melanogaster, respectively.

Table 1.

LeNup performance measured by four metrics via 20-fold cross validation

SpeciesSnSpACCMCCAUC
H. sapiens0.92120.85620.88890.79060.9412
C. elegans0.93390.90410.91880.84440.9653
D. melanogaster0.89740.87130.88470.78280.9401
SpeciesSnSpACCMCCAUC
H. sapiens0.92120.85620.88890.79060.9412
C. elegans0.93390.90410.91880.84440.9653
D. melanogaster0.89740.87130.88470.78280.9401

Note: The datasets were downloaded from the Supplementary Material of Guo et al., 2014.

Table 1.

LeNup performance measured by four metrics via 20-fold cross validation

SpeciesSnSpACCMCCAUC
H. sapiens0.92120.85620.88890.79060.9412
C. elegans0.93390.90410.91880.84440.9653
D. melanogaster0.89740.87130.88470.78280.9401
SpeciesSnSpACCMCCAUC
H. sapiens0.92120.85620.88890.79060.9412
C. elegans0.93390.90410.91880.84440.9653
D. melanogaster0.89740.87130.88470.78280.9401

Note: The datasets were downloaded from the Supplementary Material of Guo et al., 2014.

Fig. 6.

ROC curves obtained from 20-fold cross-validation tests using the genome dataset of H. sapiens, C. elegans and D. melanogaster

3.3 Comparison of LeNup predictions to other algorithms

We compared the performance of our predictor to two recently published predictors with the same benchmark datasets. 3LS was developed by the linear regression model. iNuc-PseKNC is based on SVM. Table 2 shows that the performance of LeNup is much better than 3LS, iNuc-STNC (Tahir et al., 2016), and iNuc-PseKNC for C. elegans and D. melanogaster. For D. melanogaster, compared with 3LS, and iNuc-PseKNC, Matthew’s correlation coefficient (MCC) increased by 17.1% and 30.4%, respectively, the accuracy (ACC) increased by 6.06%, and 10.63%, respectively. For C. elegans, compared with 3LS and iNuc-PseKNC, MCC increased by 11.4%, and 14.1%, respectively, ACC increased by 4.57%, and 5.73%, respectively. For H. sapiens, compared with iNuc-PseKNC, MCC increased by 8.3% and ACC increased by 3.03%. For H. sapiens, LeNup performs slightly worse than 3LS, MCC and ACC decreased by 1.24%. LeNup exhibited perfect performance for the nucleosome positioning prediction. Using the benchmark dataset of yeast genome (Chen et al., 2016), we achieved Sn = Sp = ACC = MCC = 1.0 using 20-fold cross validation. For the same benchmark dataset, the predictor based on DNA deformation energy (Chen et al., 2016) had Sn = 0.982, Sp = 0.980, ACC = 0.981, MCC = 0.963.

Table 2.

Comparison of LeNup predictions to other predictors

SpeciesPredictorSnSpACCMCCAUC
H. sapiensLeNup0.92120.85620.88890.79060.9412
3LS0.91690.88350.90010.80060.9588
iNuc-PseKNC0.87860.84700.86270.730.925
iNuc-STNC0.89310.85910.87600.75
C. elegansLeNup0.93390.90410.91880.84440.9663
3LS0.86540.89210.87860.75760.9505
iNuc-PseKNC0.90300.83550.86900.740.935
iNuc-STNC0.91620.86660.88620.77
D. melanogasterLeNup0.89740.87130.88470.78280.9401
3LS0.84070.82740.83410.66820.9147
iNuc-PseKNC0.78310.81650.79970.600.874
iNuc-STNC0.79760.83610.81670.63
SpeciesPredictorSnSpACCMCCAUC
H. sapiensLeNup0.92120.85620.88890.79060.9412
3LS0.91690.88350.90010.80060.9588
iNuc-PseKNC0.87860.84700.86270.730.925
iNuc-STNC0.89310.85910.87600.75
C. elegansLeNup0.93390.90410.91880.84440.9663
3LS0.86540.89210.87860.75760.9505
iNuc-PseKNC0.90300.83550.86900.740.935
iNuc-STNC0.91620.86660.88620.77
D. melanogasterLeNup0.89740.87130.88470.78280.9401
3LS0.84070.82740.83410.66820.9147
iNuc-PseKNC0.78310.81650.79970.600.874
iNuc-STNC0.79760.83610.81670.63
Table 2.

Comparison of LeNup predictions to other predictors

SpeciesPredictorSnSpACCMCCAUC
H. sapiensLeNup0.92120.85620.88890.79060.9412
3LS0.91690.88350.90010.80060.9588
iNuc-PseKNC0.87860.84700.86270.730.925
iNuc-STNC0.89310.85910.87600.75
C. elegansLeNup0.93390.90410.91880.84440.9663
3LS0.86540.89210.87860.75760.9505
iNuc-PseKNC0.90300.83550.86900.740.935
iNuc-STNC0.91620.86660.88620.77
D. melanogasterLeNup0.89740.87130.88470.78280.9401
3LS0.84070.82740.83410.66820.9147
iNuc-PseKNC0.78310.81650.79970.600.874
iNuc-STNC0.79760.83610.81670.63
SpeciesPredictorSnSpACCMCCAUC
H. sapiensLeNup0.92120.85620.88890.79060.9412
3LS0.91690.88350.90010.80060.9588
iNuc-PseKNC0.87860.84700.86270.730.925
iNuc-STNC0.89310.85910.87600.75
C. elegansLeNup0.93390.90410.91880.84440.9663
3LS0.86540.89210.87860.75760.9505
iNuc-PseKNC0.90300.83550.86900.740.935
iNuc-STNC0.91620.86660.88620.77
D. melanogasterLeNup0.89740.87130.88470.78280.9401
3LS0.84070.82740.83410.66820.9147
iNuc-PseKNC0.78310.81650.79970.600.874
iNuc-STNC0.79760.83610.81670.63

3.4 Impact of cross-validation

3LS and iNuc-PseKNC used a Jackknife test for the cross-validation. During the process of the Jackknife test, each sequence is singled out in turn as a test sample, the remaining sequences are used as training set to calculate test sample’s membership and predict the class. The convolutional neural network as shown in Figures 4 and 5 includes 2 026 880 filter parameters. Training them is very time-consuming, therefore, it is unrealistic to adopt the Jackknife test for the cross-validation. The Jackknife test used by 3LS and iNuc-PseKNC is the extreme situation of k-fold cross validation where k is equal to the total number of sequences in the dataset. We chose k = 5, 10, 20 and 40 to survey the effect of k in k-fold cross validation. Table 3 shows the performance of LeNup with the different k for H. sapiens. As we thought, the performance if gradually improved with the increase of k, because of the training dataset including more training samples with a bigger k. The Matthew’s correlation coefficient of LeNup is 2.32% higher than 3LS predictor, and 12.2% higher than iNuc-PseKNC predictor for the 40-fold cross validation. It is highly possible that the performance of LeNuP can be further improved if we expand the training dataset.

Table 3.

LeNup performance measured by four metrics via 5, 10, 20, 40-fold cross validation

kSnSpACCMCC
50.90240.85110.87680.7695
100.90920.84860.87860.7726
200.92120.85620.88890.7906
400.93350.87560.90450.8192
kSnSpACCMCC
50.90240.85110.87680.7695
100.90920.84860.87860.7726
200.92120.85620.88890.7906
400.93350.87560.90450.8192
Table 3.

LeNup performance measured by four metrics via 5, 10, 20, 40-fold cross validation

kSnSpACCMCC
50.90240.85110.87680.7695
100.90920.84860.87860.7726
200.92120.85620.88890.7906
400.93350.87560.90450.8192
kSnSpACCMCC
50.90240.85110.87680.7695
100.90920.84860.87860.7726
200.92120.85620.88890.7906
400.93350.87560.90450.8192

3.5 SVM classification and Jackknife test

Support vector machine is a more powerful classifier, and it has excellent generalization ability. However, if we used SVM as the classifier instead of the sigmoid function during the training of LeNup, the training time of the model could be several years. We can use LeNup to output the final features of DNA fragment, and then employ SVM to classify the features. Therefore, We used LeNup as a tool to automatically extract features from DNA fragments with 147 bp in length. All 384 features (Supplementary Material) for every DNA fragment in the benchmark datasets of nucleosome forming and inhibiting sequences (Awazu, 2017; Guo et al., 2014) were output from the full connected layer as shown in Figure 5 once the prediction accuracy converged or the overfitting occurred in the test dataset. The overfitting means that the classification accuracy in the training dataset is much better than the test accuracy in the test dataset when we performed LeNup through k-fold cross validation, k = 20 here. After that, the LIBSVM 3.22 package (Fan et al., 2005) was employed as an implementation of SVM with the Gaussian kernel function. The Jackknife test was adopted to examine the performance, where each feature vector in the dataset was in turn singled out as an independent test sample and performed the model training on the remaining data. Sn, Sp, ACC and MCC were evaluated for human, worm and fly genome benchmark datasets (Table 4). The prediction evaluation index shown in Table 4 indicates that the performance of LeNup combining with SVM is far beyond the performance of the recently proposed predictors which are shown in Table 2.

Table 4.

SVM classification and Jackknife test

SpeciesSnSpACCMCC
H. sapiens0.98250.98270.98260.9653
C. elegans0.99610.99380.99490.9899
D. melanogaster0.99490.99430.99420.9884
SpeciesSnSpACCMCC
H. sapiens0.98250.98270.98260.9653
C. elegans0.99610.99380.99490.9899
D. melanogaster0.99490.99430.99420.9884
Table 4.

SVM classification and Jackknife test

SpeciesSnSpACCMCC
H. sapiens0.98250.98270.98260.9653
C. elegans0.99610.99380.99490.9899
D. melanogaster0.99490.99430.99420.9884
SpeciesSnSpACCMCC
H. sapiens0.98250.98270.98260.9653
C. elegans0.99610.99380.99490.9899
D. melanogaster0.99490.99430.99420.9884

3.6 Robustness of LeNup prediction

A benchmark dataset is randomly partitioned into k subsets of approximately equal size to generate the training dataset and the test dataset. The randomness of the data partition leads to the perturbation of training datasets and test datasets between different batches of data partition. To survey the effect of dataset random partition. We produced 5 batches datasets which included 20 subsets from the benchmark dataset of H. sapiens. We trained and tested LeNup using each dataset with 20-fold cross validation. Supplementary Table S1 shows the sample variance of Sn, Sp, ACC and MCC. These variances are five to six orders of magnitude smaller than the average value shown from Tables 1–3. Therefore, we believe that the effect of the randomness of the data partition can be ignored.

3.7 LeNup validates the preference of nucleotide and dinucleotide in nucleosome regions

We scanned the human genome with the previously trained LeNup. The sequence of human chromosome 20 from hg18 version was scanned with stride 1 to 62 435 819 DNA fragments, and the length for each of them is 147 bp. LeNup output the probability of each DNA fragment. We assume that a DNA fragment is a nucleosome preferred position if the probability is greater than 0.85, or it is a nucleosome undesirable position if the probability is less than 0.15. We obtained 25 250 319 nucleosome preferred fragments and 26 950 803 nucleosome undesirable fragments. We calculated the percentage of every nucleotide and dinucleotide in two sorted fragments (Fig. 7). The content of A, T, G, C in the nucleosome preferred fragments is 25.32%, 25.22%, 24.87% and 24.58%, respectively, and the content of A, T, G, C in the nucleosome undesirable fragments is 30.18%, 30.65%, 19.61% and 19.65%, respectively, (Fig. 7a). Figure 7 indicates that nucleosomes preferentially associate with DNA segments exhibiting high CC and GG content, with some degree of exclusion from corresponding A, T, AA and TT rich regions. These predictions are consistent with other publications (Bernstein et al., 2004; Valouev et al., 2011).

Fig. 7.

The preference of nucleotide and dinucleotide in nucleosome preferred regions and nucleosome undesirable regions

3.8 LeNup predicts nucleosomes near transcription start sites

We downloaded 1215 transcription start sites (TSSs) for human chr20 genome from UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgTables). The nucleosome distribution profiles between –1000 bp to 1000 bp around each TSS were predicted by the previously trained LeNup. Supplementary Figure S1 shows the average distribution profile of nucleosome near the TSSs. LeNup predicts that nucleosomes are depleted in the region near the TSSs (Rach et al., 2011). The region may be nucleosome free for expressed genes (Lee et al., 2004; Valouev et al., 2011). Supplementary Figure S1 shows well-positioned +1 nucleosome in the promoter regions. The prediction indicates that the –1 nucleosome probability is intensively lower than +1 nucleosome probability, which has been confirmed existing simultaneously in the active and inactive promoters (Schones et al., 2008), suggesting a potential role in maintaining the nucleosome free region.

3.9 LeNup was further tested by MNase-seg results

We compared our model predictions with MNase-seg results (Supplementary Table S2). Nucleosome score profile for human chr20 in activated CD4 cell was downloaded from https://dir.nhlbi.nih.gov/papers/lmi/epigenomes/hgtcellnucleosomes.aspx. There is a score every 10 base span from nucleotide base position 8006 to 62435275. In general, the higher the score value, the bigger the possibility which the position is occupied by a nucleosome. We obtained nucleosome-preferring sequences by scanning the score profile with a score threshold, mapping the local peak positions of the profile to the genome sequence to get the centers of the nucleosome sequences and extending 73 bases at each side of the centers. We took the score threshold as 2, 5, 10, 15, 20, 25, and 30, and got 143 189, 119 857, 67 569, 28 706, 11 209, 3807 and 1375 nucleosome-preferring sequences, respectively. We input these sequences to the previously trained LeNup with the training dataset mentioned in Table 1. Corresponding to these input nucleosome sequences, our model predicted that 109 930, 99 107, 61 730, 27 467, 10 794, 3744, 1362 sequences are nucleosome-preferring sequences among them. The ratio of the predictions to MNase-seg experiments are 0.768, 0.827, 0.914, 0.957, 0.963, 0.984 and 0.991, respectively.

We scanned the zero score regions in the score profile and mapped the regions to the genome to obtain nucleosome-inhibiting sequences. We got 80 563 sequences with a length of 147 bp in this way. We input these sequences to the same model used above. It output 66 421 nucleosome-inhibiting sequences among them, and the ratio of the model predictions to MNase-seq results is 0.825.

4 Conclusion

Our results yield a solid evidence that Inception-like convolutional neural network with a gating mechanism is a viable method for improving the prediction of nucleosome positioning. The main advantage of this method is automatically learning the feature representation compared to other classification algorithms such as support vector machine depending on the external feature extraction. Furthermore, it can be noted that our method has the competitive advantage over other recently published methods. This success suggests promising opportunities for understanding the genetic determinants.

Acknowledgement

The authors would like to thank the three anonymous reviewers for their constructive comments.

Conflict of Interest: none declared.

References

Awazu
 
A.
(
2017
)
Prediction of nucleosome positioning by the incorporation of frequencies and distributions of three different nucleotide segment lengths into a general pseudo k-tuple nucleotide composition
.
Bioinformatics
,
33
,
42
48
.

Barski
 
A.
 et al.  (
2007
)
High-resolution profiling of histone methylations in the human genome
.
Cell
,
129
,
823
837
.

Bell
 
O.
 et al.  (
2011
)
Determinants and dynamics of genome accessibility
.
Nat. Rev. Genet
.,
12
,
554
564
.

Bengio
 
Y.
 et al.  (
2013
)
Representation learning: a review and new perspectives
.
Pattern Anal. Mach. Intell. IEEE Trans
.,
35
,
1798
1828
.

Bernstein
 
B.E.
 et al.  (
2004
)
Global nucleosome occupancy in yeast
.
Genome Biol
.,
5
,
R62
.

Brogaard
 
K.
 et al.  (
2012
)
A map of nucleosome positions in yeast at base-pair resolution
.
Nature
,
486
,
496
501
.

Chen
 
W.
 et al.  (
2012
)
iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties
.
PLoS One
,
7
,
e4784
.

Chen
 
W.
 et al.  (
2016
)
Using deformation energy to analyze nucleosome positioning in genomes
.
Genomics
,
107
,
69
75
.

Chou
 
K.C.
(
2001
)
Prediction of protein cellular attributes using pseudo-amino acid composition
.
PROTEINS Struct. Funct. Genet
.,
43
,
246
255
.

Chou
 
K.C.
(
2005
)
Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes
.
Bioinformatics
,
21
,
10
19
.

Collobert
 
R.
 et al.  (
2011a
)
Natural language processing (almost) from scratch
.
J. Mach. Learn. Res
.,
12
,
2493
2537
.

Dauphin
 
Y.N.
 et al.  (
2016
) A Language modeling with gated convolutional networks. arXiv: 1612.08083v1.

Eaton
 
M.L.
 et al.  (
2010
)
Conserved nucleosome positioning defines replication origins
.
Genes Dev
.,
24
,
748
753
.

Esteva
 
A.
 et al.  (
2017
)
Dermatologist classification of skin cancer with deep neural networks
.
Nature
,
542
,
115
118
.

Fan
 
R.E.
 et al.  (
2005
)
Working set selection using second order information for training support vector machines
.
J. Mach. Learn. Res
.,
6
,
1889
1918
.

Field
 
Y.
 et al.  (
2008
)
Distinct modes of regulation by chromatin encoded through nucleosome positioning signals
.
PLoS Comput. Biol
.,
4
,
e1000216
.

Guertin
 
M.J.
,
Lis
J.T.
(
2013
)
Mechanisms by which transcription factors gain access to target sequence elements in chromatin
.
Curr. Opin. Genet. Dev
.,
23
,
116
123
.

Guo
 
S.
 et al.  (
2014
)
iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition
.
Bioinformatics
,
30
,
1522
1529
.

Gupta
 
S.
 et al.  (
2008
)
Predicting human nucleosome occupancy from primary sequence
.
PLoS Comput. Biol
.,
4
,
e1000134
.

Gonzalez
 
S.
(
2016
)
Nucleosomal signatures impose nucleosome positioning in coding and noncoding sequences in the genome
.
Genome Res
.,
26
,
1532
1543
.

Hinton
 
G.
,
Salakhutdinov
R.
(
2006
)
Reducing the dimensionality of data with neural networks
.
Science
,
313
,
504
507
.

Hinton
 
G.
 et al.  (
2012
)
Deep neural networks for acoustic modeling in speech recognition
.
IEEE Signal Process. Mag
.,
29
,
82
97
.

Ioshikhes
 
I.
 et al.  (
1996
)
Nucleosome DNA sequence pattern revealed by multiple alignment of experimentally mapped sequences
.
J. Mol. Biol
.,
262
,
129
139
.

Ioshikhes
 
I.
 et al.  (
2011
)
Variety of genomic DNA patterns for nucleosome positioning
.
Genome Res
.,
21
,
1863
1871
.

Jiang
 
C.
,
Pugh
B.F.
(
2009
)
Nucleosome positioning and gene regulation: advances through genomics
.
Nat. Rev. Genet
.,
10
,
161
172
.

Kaplan
 
N.
 et al.  (
2009
)
The DNA-encoded nucleosome organization of a eukaryotic genome
.
Nature
,
458
,
362
366
.

Kelley
 
D.R.
 et al.  (
2016
)
Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks
.
Genome Res
.,
26
,
990
999
.

Kornberg
 
R.D.
,
Lorch
Y.
(
1999
)
Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome
.
Cell
,
98
,
285
294
.

Krizhevsky
 
A.
 et al.  (
2012
)
ImageNet classification with deep convolutional neural networks
.
In Proc. Adv. Neural Information Process. Syst
.,
25
,
1090
1098
.

LeCun
 
Y.
 et al.  (
2015
)
Deep learning
.
Nature
,
521
,
436
444
.

Lee
 
C.K.
 et al.  (
2004
)
Evidence for nucleosome depletion at active regulatory regions genome-wide
.
Nat. Genetics
,
36
,
900
905
.

Lee
 
W.
 et al.  (
2007
)
A high- resolution atlas of nucleosome occupancy in yeast
.
Nat. Genet
.,
39
,
1235
1244
.

Leung
 
M.K.
 et al.  (
2014
)
Deep learning of the tissue-regulated splicing code
.
Bioinformatics
,
30
,
i121
i129
.

Liu
 
B.
 et al.  (
2016
)
iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble
.
Bioinformatics
,
32
,
2411
2418
.

Liu
 
M.
 et al.  (
2015
)
Determinants of nucleosome positioning and their influence on plant gene expression
.
Genome Res
.,
25
,
1182
1195
.

Liu
 
S.
 et al.  (
2017
)
RPA binds histone H3-H4 and functions in DNA replication-coupled nucleosome assembly
.
Science
,
355
,
415
420
.

Luger
 
K.
 et al.  (
1997
)
Crystal structure of the nucleosome core particle at 2.8 Å resolution
.
Nature
,
389
,
251
260
.

Mavrich
 
T.N.
 et al.  (
2008
)
Nucleosome organization in the Drosophila genome
.
Nature
,
453
,
358
362
.

Miele
 
V.
 et al.  (
2008
)
DNA physical properties determine nucleosome occupancy from yeast to fly
.
Nucleic Acids Res
.,
11
,
3746
3756
.

Minary
 
P.
,
Levitt
M.
(
2014
)
Training-free atomistic prediction of nucleosome occupancy
.
Proc. Natl. Acad. Sci. USA
,
111
,
6293
6298
.

Morozov
 
A.V.
 et al.  (
2009
)
Using DNA mechanics to predict in vitro nucleosome positions and formation energies
.
Nucleic Acids Res
.,
37
,
4707
4722
.

Pulivarthy
 
S.R.
 et al.  (
2016
)
Regulated large-scale nucleosome density patterns and precise nucleosome positioning correlate with V(D)J recombination
.
Proc. Natl. Acad. Sci. USA
,
113
,
E6427
E6436
.

Rach
 
E.A.
 et al.  (
2011
)
Transcription initiation patterns indicate divergent strategies for gene regulation at the chromatin level
.
PLoS Genetics
,
7
,
e1001274
.

Richmond
 
T.J.
,
Davey
C.A.
(
2003
)
The structure of DNA in the nucleosome core
.
Nature
,
423
,
145
150
.

Satchwell
 
S.C.
 et al.  (
1986
)
Sequence periodicities in chicken nucleosome core DNA
.
J. Mol. Biol
.,
191
,
659
675
.

Schones
 
D.E.
 et al.  (
2008
)
Dynamic regulation of nucleosome positioning in the human genome
.
Cell
,
132
,
887
898
.

Segal
 
E.
 et al.  (
2006
)
A genomic code for nucleosome positioning
.
Nature
,
442
,
772
778
.

Segal
 
E.
,
Widom
J.
(
2009
)
What controls nucleosome positions?
Trends Genet
.,
25
, 335–243.

Silver
 
D.
 et al.  (
2016
)
Mastering the game of Go with deep neural networks and tree search
.
Nature
,
529
,
484
489
.

Smagulova
 
F.
 et al.  (
2011
)
Genome-wide analysis reveals novel molecular features of mouse recombination hotspots
.
Nature
,
472
,
375
378
.

Struhl
 
K.
,
Segal
E.
(
2013
)
Determinants of nucleosome positioning
.
Nat. Struct. Mol. Biol
.,
20
,
267
273
.

Szegedy
 
C.
 et al.  (
2015
) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
IEEE, Boston, MA, USA
, pp. 1–9.

Szegedy
 
C.
 et al.  (
2016
) Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv: 1602.06271v1.

Tahir
 
M.
,
Hayat
M.
(
2016
)
iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chous PseAAC
.
Mol. Biosyst
.,
12
,
2587
.

Tilgner
 
H.
 et al.  (
2009
)
Nucleosome positioning as a determinant of exon recognition
.
Nat. Struct. Mol. Biol
.,
16
,
996
1001
.

Valouev
 
A.
 et al.  (
2008
)
A high-resolution, nucleosome position map of C. elegans reveals lack of universal sequence-dictated positioning
.
Genome Res
.,
18
,
1051
1063
.

Valouev
 
A.
 et al.  (
2011
)
Determinants of nucleosome organization in primary human cells
.
Nature
,
474
,
516
520
.

van der Heijden
 
T.
 et al.  (
2012
)
Sequence-based prediction of single nucleosome positioning and genome-wide nucleosome occupancy
.
Proc. Natl. Acad. Sci. USA
,
109
,
E2514
E2522
.

Vasseur
 
P.
 et al.  (
2016
)
Dynamics of nucleosome positioning maturation following genomic replication
.
Cell Rep
.,
16
,
2651
2665
.

Wang
 
J.Y.
 et al.  (
2012
)
Calculation of nucleosomal DNA deformation energy: its implication for nucleosome positioning
.
Chromosome Res
.,
20
,
889
902
.

Weiner
 
A.
 et al.  (
2009
)
High-resolution nucleosome mapping reveals transcription-dependent promoter packaging
.
Genome Res
.,
2010
,
90
100
.

Widlund
 
H.R.
 et al.  (
1999
)
Nucleosome structural features and intrinsic properties of the TATAAACGCC repeat sequence
.
J. Biol. Chem
.,
274
,
31847
31852
.

Whitehouse
 
L.
 et al.  (
2007
)
Chromatin remodelling at promoters suppresses antisense transcription
.
Nature
,
450
,
1031
1035
.

Xi
 
L.Q.
 et al.  (
2010
)
Predicting nucleosome positioning using a duration Hidden Markov Model
.
BMC Bioinformatics
,
11
,
1
9
.

Xiong
 
H.Y.
 et al.  (
2015
)
The human splicing code reveals new insights into the genetic determinants of disease
.
Science
,
347
,
1254806
.

Yuan
 
G.C.
 et al.  (
2005
)
Genome-scale identification of nucleosome positions in S-cerevisiae
.
Science
,
309
,
626
630
.

Yuan
 
G.C.
,
Liu
J.S.
(
2008
)
Genomic sequence is highly predictive of local nucleosome depletion
.
PLoS Computat. Biol
.,
4
,
e13
.

Zhang
 
Y.
 et al.  (
2009
)
Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo
.
Nat. Struct. Mol. Biol
.,
16
,
847
852
.

Zhong
 
J.
 et al.  (
2016
)
Mapping nucleosome positions using DNase-seq
.
Genome Res
.,
26
,
351
364
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
Associate Editor: John Hancock
John Hancock
Associate Editor
Search for other works by this author on:

Supplementary data