Abstract

Recognizing binding sites of DNA-binding proteins is a key factor for elucidating transcriptional regulation in organisms. ChIP-exo enables researchers to delineate genome-wide binding landscapes of DNA-binding proteins with near single base-pair resolution. However, the peak calling step hinders ChIP-exo application since the published algorithms tend to generate false-positive and false-negative predictions. Here, we report the development of DEOCSU (DEep-learning Optimized ChIP-exo peak calling SUite), a novel machine learning-based ChIP-exo peak calling suite. DEOCSU entails the deep convolutional neural network model which was trained with curated ChIP-exo peak data to distinguish the visualized data of bona fide peaks from false ones. Performance validation of the trained deep-learning model indicated its high accuracy, high precision and high recall of over 95%. Applying the new suite to both in-house and publicly available ChIP-exo datasets obtained from bacteria, eukaryotes and archaea revealed an accurate prediction of peaks containing canonical motifs, highlighting the versatility and efficiency of DEOCSU. Furthermore, DEOCSU can be executed on a cloud computing platform or the local environment. With visualization software included in the suite, adjustable options such as the threshold of peak probability, and iterable updating of the pre-trained model, DEOCSU can be optimized for users’ specific needs.

Introduction

Chromatin immunoprecipitation (ChIP) has been a widely used approach to investigate DNA-binding proteins, such as transcription factors (TFs) or transcriptional machinery, and their binding locations at the genome scale level [1–8]. ChIP-exonuclease (ChIP-exo) outperforms other ChIP assays such as ChIP-chip (microarray) or ChIP-seq (deep sequencing) by overcoming hybridization bias in microarray and size heterogeneity issue in deep sequencing, respectively [9]. By digesting immunoprecipitated DNA using exonuclease, shorter-length peak pairs can be detected as compared with those generated by ChIP-seq, and their spatial resolutions are enhanced to near single-base pair level [10]. Hence, the DNA-binding motifs information can be updated more precisely than results from other assays.

Although ChIP-exo increases sensitivity and allows researchers to identify high-resolution binding sites, the peak calling step for selecting bona fide peaks could be error-prone, time-consuming and labor-intensive which is a major rate-limiting step of ChIP-exo data analysis. However, several computational tools have been developed to analyze ChIP-exo data and predict binding sites of DNA-binding proteins. MACE is one of the first developed tools specialized for ChIP-exo data using border peak detection by Chebyshev Inequality and Gale-Shapley’s stable matching algorithm [11]. It provides border pairs with good sensitivity, allowing most of the true-positive peaks to be conserved after the peak calling process. While MACE performance has been evaluated on ChIP-exo datasets of mammalian and yeast cells, MACE outputs in ChIP-exo datasets of bacteria require further curation to remove false-positive peaks and to add true peaks that were otherwise omitted by the peak calling software. Even though MACE offers an option to specify high-quality pairs called elite border pairs (MACE-elite), the curation step is still needed to identify bona fide peaks. Tools such as PeakXus [12] and ChExMix [13] were developed to discover peaks overlapping with motif sequences specific to DNA-binding proteins. PeakXus was able to determine candidate peaks from both ChIP-exo and ChIP-nexus experiments and was also applicable for investigating single nucleotide polymorphism (SNP)-induced affinity change of TF binding. However, weaker binding patterns might be overlooked by PeakXus as it only reports peak candidates fitting to high coverage sites. ChExMix has demonstrated its applicability in both ChIP-exo and ChIP-seq data using 393 ChIP-seq experiments performed in K562 cells. It also provides multiple modes for validating replicate consistency analysis and for discovering binding event subtypes even in lower-resolution data such as ChIP-seq. However, ChExMix only verified its performance for eukaryotic ChIP-exo dataset such as human and yeast.

Biological applications of machine learning methods including deep learning-based models have advanced genomic data analysis. There are also several cases such as DeepBind [14], DeepSEA [15] and D-AEDNet [16] reported for their prediction of DNA-binding sites of proteins across genomes. However, these models were trained and validated with eukaryotic ChIP-derived datasets. Moreover, DeepBind and DeepSEA are developed not for detecting peaks but for predicting DNA-binding sites of proteins using 1D images of one-hot encoded DNA sequence to feed the convolutional neural network (CNN). In the case of D-AEDNet, the tool demands sufficient labeled data to detect binding sites of DNA-binding proteins. Therefore, it is still necessary to develop a peak calling tool exploiting the unique ChIP-exo peak pattern that can be applied to unlabeled ChIP-exo data of the whole three-domain system.

In this study, we develop a DEep-learning Optimized ChIP-exo peak calling SUite (DEOCSU), which uses CNN to predict bona fide peaks in ChIP-exo data. DEOCSU was trained with image-converted peaks from ChIP-exo biological replicates of sigma factors (RpoD, RpoH, RpoN and RpoS) in Escherichia coli str. K-12 substr. MG1655. The peak calling performance of DEOCSU was compared with published tools including MACE, MACE-elite, ChExMix and PeakXus using the in-house RpoN ChIP-exo datasets of six E. coli strains [BL21(DE3), CFT073, Crooks, O157:H7 EDL933, HS and K-12 str. W3110] and other bacteria (Klebsiella and Shigella). Then, its performance was validated using the curated ChIP-exo peaks of prokaryotic TFs obtained from the public database proChIPdb [17]. Subsequently, the extensibility of DEOCSU application to eukaryotes and archaea was demonstrated using the ChIP-exo datasets from NCBI or EMBL-EBI. To this end, it was shown that DEOCSU predicted bona fide ChIP-exo peaks from data of different organisms with high accuracy, sensitivity and low negative predictive values.

Materials and methods

Materials

Public data resources

Except for our in-house ChIP-exo datasets, we collected a number of bacteria, eukaryotes and archaea ChIP-exo data from NCBI (Gene Expression Omnibus, Sequence Read Archive) and EMBL-EBI (ArrayExpress). Curated binding peak information for public bacteria ChIP-exo data was obtained from proChIPdb. The detailed description of each ChIP-exo data such as target TF, reference genome and accession number was described in Supplementary Table S1.

Methods

ChIP-exo data processing

Sequencing reads were mapped onto each reference genome using bowtie with the default option [18]. Subsequent sorting and indexing of the BAM files were conducted using SAMtools [19]. Peak callings by MACE, MACE-elite, PeakXus and ChExMix were performed using their respective default options. Genomic annotations, ChIP-exo signals and peak calling outputs were visualized for direct comparisons using MetaScope (https://github.com/SBML-Kimlab/MetaScope, Supplementary Figure S1).

Motif search and analysis

Motif analysis was conducted using the MEME suite [20]. Binding motifs were calculated from at least 90% of the input sequences. For ChIP-exo datasets of sigma factors, binding sites were extended to include 15 bp at each end, and motifs with a width of 25 bp were searched. For ChIP-exo datasets of eukaryotes and archaea, we extended binding sites to include 20 bp on either side. The remaining ChIP-exo datasets were analyzed according to the settings specified in proChIPdb. Position-dependent letter probability matrix of each motif was obtained from the MEME outputs. For principal component analysis (PCA) of RpoN and Cra motifs, the positions of core motifs were identified from raw matrices listed in Supplementary Table S2.

Basic concept of DEOCSU (DEep-learning Optimized ChIP-exo peak calling SUite)

The overall framework of DEOCSU consists of three steps which are (1) border peak candidate detection and dimension conversion from signal to spatio, (2) bona fide peak prediction using CNN, and (3) peak position optimization and estimation of the sizes of binding sites.

Border peak candidates detection and dimension conversion from signal to spatio

ChIP-exo datasets of sigma factors in E. coli str. K-12 substr. MG1655 were used to generate the input training data (Supplementary Table S1). Border peak candidate detection was performed, and a total of 42 900 candidates for training data were labeled by curation. To determine bona fide peaks, the calculated signal-to-noise value and the shape of each candidate peak were considered. It is expected that the ChIP-exo signal coverage at a candidate binding site is significantly higher than the background signals. Therefore, comparing mapping scores of peak candidates with the neighboring regions allows the positions of local maximum points to be detected from each strand. We filtered the range of neighboring sequences to 400 bp using the signal.find_peaks function in scipy. Supplementary Figure S2A shows the representative example of peak detection for the ChIP-exo signal of Cra. Mapping scores of 128 bp upstream and downstream from each detected local point were determined on both forward and reverse strands of two biological replicates. The concatenated scores were converted into (256, 4, 1)-dimensional spatial channels that were fed into the CNN framework to be classified into true peaks or not. Supplementary Figure S2B and C show representative examples for images of true and false peaks, respectively.

Deep learning framework for prediction of bona fide peaks

In this paper, to distinguish true peaks from peak candidates, a CNN was proposed. It captures the relationship between consolidated peak images and labels expressed as Boolean values. There are three main types of layers to a CNN architecture: convolution layers, pooling layers and fully connected layers. The convolution layers are used for extracting information from input images and generating the feature maps. The pooling layer is proposed to reduce the tensor size of computed feature maps. The fully connected layers classify the images, which are the results of convolution and pooling layers, into defined labels.

In our CNN, the convolution layers map the relations between the patterns of border peaks from replicate data and predicted labels which are expressed with Boolean values. The convolutional network consists of two convolution operations with the ReLU activation function and one layer of Maxpooling. After two layers of convolutional networks, fully connected layers were applied. Finally, the dropout rate was specified to 25 and 50% in order (Supplementary Figure S3A). The model was trained for 10 iterations to avoid overfitting, and the accuracy curve and loss plot are shown in Supplementary Figure S3B and C.

Optimization of peak position

A true ChIP-exo signal is expected to be significantly higher (or significantly different) from its flanking regions. Based on this, the start and end positions of each peak were recalculated to further refine the binding width of a protein. Along the length of each binding site, we computed the discrete differences between mapping scores of adjacent nucleotide positions. This resulted in an array of differences in which the maximum value (largest positive difference) and the minimum value (largest negative difference) were determined. This calculation was repeated for both strands of two ChIP-exo replicates. The average of positions for maximum values was assigned as the new starting point, and the average of positions for minimum values was assigned as the new ending point of a binding site. Optimal binding width is the distance between the new start and the new end point. It leads to single nucleotide resolution without curation.

Performance evaluation

The DEOCSU was trained with the ChIP-exo data of sigma factors (RpoD, RpoH, RpoN and RpoS) in E. coli str. K-12 substr. MG1655. The dataset was randomly split into a training set (70%), validation set (21%) and test set (9%), respectively. To evaluate the performance of DEOCSU, two metrics were used which include the percent accuracy for model fidelity and the false negative rate (FNR). FNR is the proportion of false predicted peaks that were labeled during curation as true labels for model reliability (Equation (1)).
(1)

Result

DEOCSU predicts bona fide ChIP-exo peak more accurately than other existing tools

DEOCSU is an automatic ChIP-exo peak calling pipeline which utilizes deep CNN framework. The deep CNN was trained with curated image-converted peaks of the ChIP-exo dataset of sigma factors including RpoD, RpoN, RpoS and RpoH (both under heat stress or not) in E. coli str. K-12 substr. MG1655 (MG1655). The pipeline processes ChIP-exo sequence reads in three steps as follows: (i) detection of border peak candidates and consolidation of image-converted peaks from ChIP-exo data; (ii) deep CNN feeding and (iii) classification of bona fide peaks (Figure 1). Sequence reads generated from ChIP-exo were first mapped onto the reference genome and processed to obtain the peak mapping scores at each nucleotide position. As the first step of DEOCSU, it detects the local maxima of peak candidates and extends the width by 128 bp to both sides of each local maximum point (Supplementary Figure S2A). Mapping scores of 256 bp were normalized to range of 0–1 using MinMaxScaler followed by data reshaping to a one-dimensional (1D) array. Border detection of peaks and mapping score normalization are repeated on both strands of two ChIP-exo replicates, generating four 1D arrays for each peak candidate. These arrays are then consolidated and converted to an image with the dimensions of (256, 4, 1). The images for every peak are resized to the same size using image decimation and ready to feed deep CNN.

Overview of DEOCSU. A multilayer convolutional layer is used to predict bona fide peaks from image-converted ChIP-exo data. The workflow begins with the processing of ChIP-exo sequencing reads followed by aligning of reads to the reference genome. Peak position detection is performed to calculate the local maxima. Mapping scores of upstream (128 bp) and downstream (128 bp) local maxima are normalized and reshaped into a 1D array. The process is repeated for both forward(+) and reverse(−) strands of two biological replicates (Sample1 and Sample2) to generate four 1D arrays for each peak. The resulting arrays are consolidated as input for deep CNN. The CNN classifies if a peak is bona fide or not. As the final step of DEOCSU, binding width was optimized by recalculating the start and end positions of each predicted peak. The calculation is based on the differences between mapping scores of adjacent nucleotides along the length of the binding site. Genome-wide distribution of bona fide peaks can be visualized using MetaScope along with the raw ChIP-exo signals and genome annotations.
Figure 1

Overview of DEOCSU. A multilayer convolutional layer is used to predict bona fide peaks from image-converted ChIP-exo data. The workflow begins with the processing of ChIP-exo sequencing reads followed by aligning of reads to the reference genome. Peak position detection is performed to calculate the local maxima. Mapping scores of upstream (128 bp) and downstream (128 bp) local maxima are normalized and reshaped into a 1D array. The process is repeated for both forward(+) and reverse(−) strands of two biological replicates (Sample1 and Sample2) to generate four 1D arrays for each peak. The resulting arrays are consolidated as input for deep CNN. The CNN classifies if a peak is bona fide or not. As the final step of DEOCSU, binding width was optimized by recalculating the start and end positions of each predicted peak. The calculation is based on the differences between mapping scores of adjacent nucleotides along the length of the binding site. Genome-wide distribution of bona fide peaks can be visualized using MetaScope along with the raw ChIP-exo signals and genome annotations.

Performance validation of the deep CNN model in detecting bona fide peaks was displayed in the box-whisker plot (Figure 2A) and confusion matrix (Figure 2B). The results indicated its high accuracy (96.2%), high precision (95.1%) and high recall (96.2%). Additionally, peak prediction for each sigma factor revealed the average accuracy of at least 92.9% when randomly resampling with 10 iterations (Figure 2A). The average false negative rate (FNR) of 10 resamplings was 2.9% for deep CNN model. For dataset of each sigma factor, FNRs reported were approximately 3% (RpoH with heat stress: 2.5%; RpoD: 1.9%; RpoN: 2.6%; RpoS: 3.9%), except for RpoH without heat stress (12.4%) where there were only about 10 true labels. Thus, DEOCSU can efficiently distinguish the pattern for consolidated images of true peaks (Supplementary Figure S2B) from those of false ones (Supplementary Figure S2C).

Performance validation of DEOCSU using the ChIP-exo data of sigma factors in Escherichia coli K-12 str. MG1655. (A) Box-whisker plots showing the test accuracy of the pre-trained model in DEOCSU and test accuracy of each sigma factor. (B) Confusion matrix indicating the performance of DEOCSU in bona fide peak prediction for the binding sites of E. coli MG1655 sigma factors.
Figure 2

Performance validation of DEOCSU using the ChIP-exo data of sigma factors in Escherichia coli K-12 str. MG1655. (A) Box-whisker plots showing the test accuracy of the pre-trained model in DEOCSU and test accuracy of each sigma factor. (B) Confusion matrix indicating the performance of DEOCSU in bona fide peak prediction for the binding sites of E. coli MG1655 sigma factors.

The performance of DEOCSU was compared with the existing tools, which include MACE, MACE-elite, PeakXus and ChExMix. Peak calling by these five models was performed on the RpoN ChIP-exo data of E. coli wild-type strain W3110. RpoN-binding motifs identified from peaks selected by each model were compared with the previously reported one. All of them showed the representative RpoN motif sequence of ttGGcacgnnnntTGC (n = uncharacterized) [21]. However, the motif calculated from peaks predicted by DEOCSU showed the highest enrichment (Figure 3A) with the lowest E-value of 7.7e−502 as compared with ChExMix (5.7e−0.67), MACE elite (2.1e−246), MACE (2.0e+75) and PeakXus (3.3e+93).

De novo motif comparison between DEOCSU and competing tools using RpoN ChIP-exo datasets. (A) The RpoN motifs in Escherichia coli K-12 str. W3110 calculated from peaks predicted by DEOCSU and other published tools. The number of sites containing each motif sequence is indicated in parentheses. (B) Principal component analysis (PCA) plot based on the letter-probability values of RpoN motifs in different E. coli strains, BL21(DE3), CFT073, Crooks, O157:H7 EDL933, HS and K-12 str. W3110, and other bacteria, Klebsiella pneumoniae MGH78578 (KP) and Shigella flexneri 2457T (SF). The bacterial strains are specified by the shape of each data point (up-pointing triangle: BL21(DE3), left-pointing triangle: CFT073, right-pointing triangle: K-12 str. W3110, down-pointing triangle: HS, square: Crooks, star: O157:H7 EDL933, pentagon: KP and hexagon: SF). Except for PeakXus, a data point indicates the motif result from both ChIP-exo replicates of a bacterial strain. As PeakXus considers only one replicate for its peak calling, there are two data points for PeakXus where each of them represents the motif result from one replicate. (C) Bar graphs comparing the intracluster distance of motifs resulting from the prediction by different tools. The lower value calculated for each bar graph indicates the more consistent sequence of motifs. The peak calling tools are color coded as follows: DEOCSU: blue, ChExMix: yellow, MACE-elite: green, MACE: orange and PeakXus: purple.
Figure 3

De novo motif comparison between DEOCSU and competing tools using RpoN ChIP-exo datasets. (A) The RpoN motifs in Escherichia coli K-12 str. W3110 calculated from peaks predicted by DEOCSU and other published tools. The number of sites containing each motif sequence is indicated in parentheses. (B) Principal component analysis (PCA) plot based on the letter-probability values of RpoN motifs in different E. coli strains, BL21(DE3), CFT073, Crooks, O157:H7 EDL933, HS and K-12 str. W3110, and other bacteria, Klebsiella pneumoniae MGH78578 (KP) and Shigella flexneri 2457T (SF). The bacterial strains are specified by the shape of each data point (up-pointing triangle: BL21(DE3), left-pointing triangle: CFT073, right-pointing triangle: K-12 str. W3110, down-pointing triangle: HS, square: Crooks, star: O157:H7 EDL933, pentagon: KP and hexagon: SF). Except for PeakXus, a data point indicates the motif result from both ChIP-exo replicates of a bacterial strain. As PeakXus considers only one replicate for its peak calling, there are two data points for PeakXus where each of them represents the motif result from one replicate. (C) Bar graphs comparing the intracluster distance of motifs resulting from the prediction by different tools. The lower value calculated for each bar graph indicates the more consistent sequence of motifs. The peak calling tools are color coded as follows: DEOCSU: blue, ChExMix: yellow, MACE-elite: green, MACE: orange and PeakXus: purple.

To expand the comparative analysis of tools’ performance and assess DEOCSU robustness to datasets from different bacterial origins, we used additional in-house RpoN ChIP-exo data of several E. coli wild-type strains (BL21(DE3), CFT073, Crooks, O157:H7 EDL933 and HS), Klebsiella pneumoniae MGH78578 and Shigella flexneri 2457T. RpoN amino acid sequence is well conserved in these species with complete consensus sequence at the DNA-binding H-T-H motif (Supplementary Figure S4). Therefore, we expected their binding motifs to be highly similar. To compare the uniformity of identified RpoN motifs among tools, we performed 2D PCA using values in letter-probability matrices of each RpoN motif (Figure 3B). The cluster formed by motifs calculated from DEOCSU predicted peaks was distinct, having the smallest intracluster distance of 0.75 as compared with other existing tools (Figure 3C). In other words, peak calling by DEOCSU resulted in the most consistent binding motifs of RpoN, reflecting the conservative nature of RpoN sequences among the tested bacteria. These results suggested that DEOCSU is a robust peak calling model which identifies bona fide ChIP-exo peaks of sigma factors including RpoN from data of different bacterial strains and species.

DEOCSU can predict the binding sites of unexplored prokaryotic DNA-binding proteins

The generalizability of DEOCSU to other DNA-binding proteins was assessed using the publicly available prokaryotic ChIP-exo information obtained from proChIPdb—a database for prokaryotic ChIP-derived data. For this analysis, the ChIP-exo data for catabolite repressor/activator (Cra) in E. coli MG1655 and ferric uptake regulator (Fur) in several strains were selected as model input (Supplementary Table S1). Besides the motif sequences, the distribution of binding site widths was considered in the evaluation of peak calling performance. It was observed that the width of predicted peaks for each TF was highly distributed over a specific length (Figure 4 and Supplementary Figure S5). DEOCSU identified peaks with binding sites of consistent width from each ChIP-exo data that conforms with the specific binding of TFs across the bacterial genome.

Results of DEOCSU application to ChIP-exo data of unexplored prokaryotic transcription factors. (A) PCA plot based on the letter-probability values of Cra motifs on four different carbon source conditions (star: EcoCyc, x: acetate, +: glucose, down-pointing triangle: galactose and circle: fructose). The motifs identified from DEOCSU prediction are compared with the references and motifs identified from predictions by other tools. The color coded as follows: EcoCyc: red, proChIPdb: black, DEOCSU: blue, ChExMix: yellow, MACE: orange, MACE-elite: green and PeakXus: purple. (B) Binding width distributions and Fur boxes identified from DEOCSU prediction for Fur ChIP-exo data of Klebsiella, Salmonella and Yersinia.
Figure 4

Results of DEOCSU application to ChIP-exo data of unexplored prokaryotic transcription factors. (A) PCA plot based on the letter-probability values of Cra motifs on four different carbon source conditions (star: EcoCyc, x: acetate, +: glucose, down-pointing triangle: galactose and circle: fructose). The motifs identified from DEOCSU prediction are compared with the references and motifs identified from predictions by other tools. The color coded as follows: EcoCyc: red, proChIPdb: black, DEOCSU: blue, ChExMix: yellow, MACE: orange, MACE-elite: green and PeakXus: purple. (B) Binding width distributions and Fur boxes identified from DEOCSU prediction for Fur ChIP-exo data of Klebsiella, Salmonella and Yersinia.

ChIP-exo data of Cra in E. coli MG1655 were obtained under four different sole carbon source conditions, including acetate, fructose, galactose and glucose [8]. The Cra motifs identified from peak calling by DEOCSU and by other existing tools were compared with the known Cra motifs retrieved from proChIPdb and EcoCyc (tGCTGAATCGATTCAGCa, https://ecocyc.org). The PCA plot showed that only the Cra motifs resulting from DEOCSU predictions formed a well-defined cluster that was close to the known Cra motif from EcoCyc. This cluster also overlapped with those motifs reported by proChIPdb on acetate, galactose and glucose. In case of fructose condition, proChIPdb provided a different binding pattern of Cra from those on other carbon sources. Interestingly, when we reduced the peak threshold to 0.5, the motif on fructose from DEOCSU peaks was similar to that of proChIPdb (Supplementary Figure S5B). Therefore, it can be inferred that DEOCSU outperforms other tools on unlabeled ChIP-exo data of TF in E. coli, emphasizing its applicability in determining the genome-wide binding events of E. coli DNA-binding proteins under different conditions.

Next, we confirmed the generalizability of DEOCSU on unlabeled ChIP-exo data of TF in diverse bacterial species using the data of Fur in E. coli, Klebsiella, Salmonella and Yersinia. Although proChIPdb also provides ChIP-exo of Fur in Shigella and Pseudomonas, we excluded them in our analyses due to their poor quality. The dataset of Shigella showed a high level of background noise, and there was inconsistent dataset information provided for Pseudomonas in proChIPdb and NCBI. For the remaining datasets, we identified Fur binding motifs containing a consensus sequence resembling the known Fur box [4] (Figure 4B). The binding widths were normally distributed. Furthermore, most of the binding sites predicted by DEOCSU overlapped with those reported in the reference database (Supplementary Figure S6), indicating the reliability of DEOCSU outputs.

Collectively, DEOCSU not only predicted a lower number of false positives than other competing tools but also achieved accurate bona fide peak prediction on datasets of different prokaryotic DNA-binding proteins in several species, suggesting its versatility in the genome-wide studies of prokaryotes.

DEOCSU is generalizable for ChIP-exo data including eukaryotes and archaea

The remaining question was whether DEOCSU applies to the ‘three domains of life’. To address this, DEOCSU performance was evaluated on ChIP-exo data of yeast Saccharomyces cerevisiae TF Reb1, mouse Forkhead box A1 (FoxA1), human CCCTC-binding factor (CTCF), archaeon Saccharolobus solfataricus transcription factor B (TFB) and Sulfolobus solfataricus transcription factor E beta subunit (TFEβ). Overall, the binding width distributions took the shape of a leptokurtic distribution, indicating the highly uniform length of the predicted binding sites (Figure 5). More importantly, features such as motif sequence or distance from transcription start site (TSS) were consistent with published findings. For instance, the predicted binding sites of yeast Reb1 showed a consistent width (35.69 bp on average) and the binding motifs identified were highly similar to the previously elucidated sequence containing the consensus TACCCG [22] (Figure 5A). Considering that the yeast ChIP-exo were produced by the state-of-art ChIP-exo 5.0 protocol [10], it is likely that DEOCSU is applicable for the analysis of ChIP-exo data generated by various protocols from the original protocol to the cutting-edge version.

Results of DEOCSU application to eukaryotic ChIP-exo data. Binding width distribution (top) and the corresponding motifs identified from DEOCSU prediction (bottom) for ChIP-exo data of (A) yeast (S. cerevisiae Reb 1), (B) mouse (Mus musculus FoxA1) and (C) human (Homo sapiens CTCF).
Figure 5

Results of DEOCSU application to eukaryotic ChIP-exo data. Binding width distribution (top) and the corresponding motifs identified from DEOCSU prediction (bottom) for ChIP-exo data of (A) yeast (S. cerevisiae Reb 1), (B) mouse (Mus musculus FoxA1) and (C) human (Homo sapiens CTCF).

Previously, ChIP-exo of mouse FoxA1 revealed that the most common binding width of this TF to be 33 bp [23]. Motif search from DEOCSU outputs not only showed a similar average binding width of 35.31 bp but also their Forkhead motif-containing features such as RYAAAYA (R = A/G, Y=C/T) [24] (Figure 5B). Out of six ChIP-exo datasets of three biological replicates from mouse liver samples, we randomly selected two datasets [Accession number: ERR336935 (Sample ID: do2435) and ERR336942 (Sample ID: do2160)] as the model input. Therefore, the results suggested that DEOCSU can maintain its accuracy in determining bona fide peaks from mouse ChIP-exo data and also its robustness to account for the variations among biological replicates.

As compared to the previous result which established 35 161 peaks with an average width of 52 bp for human CTCF [9], DEOCSU predicted 15 815 peaks with an average binding width estimated to be 47.59 bp (Figure 5C) while about 78% of peaks (12 345/15 815) were found to contain the combinatorial modules for CTCF-bound locations. When we lowered the threshold of peak probability to 0.5, the total number of peaks increased to 24 376. Of these, 18 467 peaks were identified to contain the motif. In addition, not only the modules 2 and 3, which were observed at approximately 97% (34 061/35 161) of the total peaks in the reference study, but also the minor modules 1 and 4 were identified when extending the window size of motif to 50 bp (Supplementary Figure S7). These results indicate that DEOCSU can certainly be useful for investigating the characteristics of human transcriptional activator and repressor.

Lastly, we validated the applicability of DEOCSU for archaeal studies using S. solfataricus transcription initiation factors (TFB and TFEβ) ChIP-exo datasets. Binding width distribution for TFB (Supplementary Figure S8A) and TFEβ (Supplementary Figure S8C) were positive kurtosis with similar average size of 51.43 and 47.71 bp, respectively. Both genome-wide occupancy of the TFB (Supplementary Figure S8B) and TFEβ (Supplementary Figure S8D) were found mostly around the TSS, which agrees with the previous evidence [25]. Furthermore, the aggregate profiles of predicted peaks location from TSS were highly similar between two transcription initiation factors in which they were both asymmetric with positive skew, which are well-matched with their characteristics of ternary assembly to form pre-initiation complex together with RNA polymerase and predominant occupancy in the promoter region, respectively.

Collectively, it can be inferred that DEOCSU can be used to determine the binding sites of DNA-binding proteins from novel ChIP-exo data regardless of its organism type without additional curation.

Discussion

Despite its informative advantages to identify binding sites of DNA-interacting proteins with high resolution, ChIP-exo requires time-consuming and labor-intensive peak curation steps, which limits its use. Despite the development of several peak calling tools, the issue of false peak prediction leading to failure of motif search and additional rounds of curation still remains. To overcome these hurdles, we developed DEOCSU—a deep CNN-based model for bona fide peak prediction which ultimately aims to relieve this bottleneck of ChIP-exo analysis. The model trained the consolidated image patterns of curated ChIP-exo peaks from E. coli MG1655 sigma factors. The performance of DEOCSU was validated and compared with other competing tools. Subsequently, its extensibility was evaluated on ChIP-exo data of different DNA-binding proteins in several species of prokaryote, eukaryote and archaea.

With the RpoN ChIP-exo datasets of E. coli strains and other bacteria belonging to Enterobacteriaceae family such as Klebsiella and Shigella, DEOCSU demonstrated a better performance than other tools. Its resulting motifs shared the most consistent sequence, reflecting the complete conservation of the H-T-H motif found in the multiple sequence alignment of RpoN orthologs. In the PCA cluster of DEOCSU RpoN motifs, the data point of Klebsiella was located relatively further from those of other bacteria. This could be due to the higher level of mismatches of Klebsiella RpoN sequence as shown in the protein alignment (Supplementary Figure S4). Thus, Klebsiella RpoN might have a distinct binding pattern from the other species, resulting in diverse gene regulation by different transcription initiation. The data points of E. coli pathogenic strains (CFT073, EDL933) clumped together and stood separately from the other non-pathogenic E. coli strains. Shigella that has been known as pathogen was not grouped together. Further investigation into the roles of RpoN in regulating pathogenic gene expression in E. coli might provide interesting insights. Indeed, RpoN has been reported as a regulator of virulence factors among various pathogens including E. coli [2, 26].

With the public ChIP-exo data of prokaryotes (Cra, Fur), eukaryotes (yeast Reb1, mouse FoxA1 and human CTCF) and archaea (TFB and TFEβ), DEOCSU has proven its generalizability in these unlabeled datasets by reproducing the canonical motifs with high accuracy. Although DEOCSU was only trained with the ChIP-exo data of sigma factors in E. coli MG1655, DEOCSU can distinguish the nuances between image patterns of bona fide peak from that of false peak arising from the aligning errors to the reference genome. Therefore, DEOCSU can maintain its performance on data of other DNA-binding proteins originating from diverse species. However, its peak calling performance might still be dependent on the quality of ChIP-exo data. DEOCSU can perform peak calling even without biological replicates by data duplication of the single data. However, the number of peak candidates might be overestimated by the model. Indeed, we observed the increase in peak numbers and the decrease in peak resolution (increase the average value of binding width) when simulating single replicate was used for analysis with E. coli K-12 W3110 RpoN ChIP-exo data (Supplementary Figure S10).

On top of its accurate peak calling performance, DEOCSU has several additional advantages. Firstly, DEOCSU provides the probability value for true peak confidence where its threshold can be adjusted. As shown in the analysis of Cra ChIP-exo data, the number of predicted peaks could be increased by lowering the threshold for bona fide peaks prediction from the default parameter value of 0.8–0.5 (Supplementary Table S3). The resulting Cra binding motif on fructose changed to the similar motif pattern reported in proChIPdb (Supplementary Figure S5B). Therefore, this customizable feature allows researchers to be flexible with their ChIP-exo analysis purposes. Secondly, it is possible to optimize DEOCSU performance to detect more diverse binding patterns by iterative training novel data. To analyze single ChIP-exo data without replicates, it is possible to do peak calling with DEOCSU by duplicating images of each peak candidate. As of now, the performance of DEOCSU has only been evaluated for proteins binding to regulatory elements in form of monomer or homodimers, DEOCSU has not been optimized for ChIP-seq data or ChIP-exo data of multimeric binding DNA-binding proteins. Nevertheless, DEOCSU has demonstrated its usefulness in elucidating the in vivo binding locations and their sequence motifs of monomeric or homodimeric DNA-binding proteins in organisms across different domains.

In summary, DEOCSU provides a highly accurate prediction of DNA binding sites from ChIP-exo datasets by distinguishing image patterns of bona fide peaks using the deep CNN approach. The transferability of DEOCSU across different organisms was demonstrated in this study. Its peak calling allowed the identification of enriched and consistent motif instances from various pro-/eukaryotic and archaea ChIP-exo data. Additional merits such as customizable probability threshold, fine-tuning the built-in classification model through transfer learning and no limit on the number of datasets for peak calling highlight the user-friendly features of DEOCSU with tremendous potential in investigating DNA–protein interactions.

Key Points
  • DEOCSU entails a deep convolutional neural network model that was trained with curated ChIP-exo peak data to distinguish true-positive peaks from false peaks. Performance validation of DEOCSU indicated high accuracy, precision and recall.

  • While DEOCSU was primarily trained with ChIP-exo data of sigma factors in E. coli MG1655, the model maintained its peak calling efficacy on ChIP-exo datasets of different DNA-binding proteins in other bacterial strains, eukaryotes and archaea. As demonstrated in this paper, DEOCSU outperforms the competing tools as it accurately captured the true-positive peaks, allowing the most consistent binding motif sequence to be identified.

  • Features such as adjustable parameters, iterable updating of the pre-trained model and permitting peak calling of single ChIP-exo replicate by data duplication have enabled flexible optimization of DEOCSU for users’ specific needs. DEOCSU can be executed on a cloud computing platform or a local environment. The suite was developed with visualization software to facilitate the visual assessment of peak calling outputs against the raw ChIP-exo signals and genome annotations.

Acknowledgments

We gratefully acknowledge Deocsu Park for his abundant support and encouragement for the completion of this paper.

Authors’ contributions

I.B., S-M.L. and S.P designed the tool. I.B. generated and acquired data. S-M.L. and I.B. analyzed data. S.P. conceptualized and built the deep learning framework. I.B., S-M.L., S.P. and J.Y.P. prepared figures. S-M.L. drafted the paper. I.B., S-M.L., S.P., L.K.N., B.O.P. and D.K. edited the manuscript. D.K. provided mentorship, resources and guidance in planning and implementation. All authors participated in writing the paper.

Funding

National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (MSIT) [2022M3A9I5018934, 2021M3A9I402 4840]; UNIST Center for Waste Plastics Carbon Cycling (UWCC) funded by The Circle Foundation, Republic of Korea.

Data availability

The source code and documentation for DEOCSU is freely available at the public GitHub repository (https://github.com/SBML-Kimlab/DEOCSU).

Author Biographies

Ina Bang is a PhD student at the Ulsan National Institute of Science and Technology. Her research interests focus on transcriptional regulation and systems biology.

Sang-Mok Lee is a postdoctoral researcher at the Ulsan National Institute of Science and Technology. His research interests focus on studying transcriptional regulation of bacteria based on systems biology approach.

Seojoung Park is a PhD student at the Ulsan National Institute of Science and Technology. Her research interests focus on developing machine learning frameworks and analytical methods.

Joon Young Park is a PhD student at the Ulsan National Institute of Science and Technology. His research interests focus on systems biology.

Linh Khanh Nong is a PhD student at the Ulsan National Institute of Science and Technology. Her research interests focus on bacterial transcription regulation.

Ye Gao is a postdoctoral researcher at the Department of Bioengineering, University of California San Diego.

Bernhard O. Palsson is a professor at the Department of Bioengineering, University of California San Diego.

Donghyuk Kim is an associate professor at the School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology.

References

1.

Seo
SW
,
Kim
D
,
Szubin
R
, et al.
Genome-wide reconstruction of OxyR and SoxRS transcriptional regulatory networks under oxidative stress in Escherichia coli K-12 MG1655
.
Cell Rep
2015
;
12
:
1289
99
.

2.

Gao
Y
,
Yurkovich
JT
,
Seo
SW
, et al.
Systematic discovery of uncharacterized transcription factors in Escherichia coli K-12 MG1655
.
Nucleic Acids Res
2018
;
46
:
10682
96
.

3.

Seo
SW
,
Kim
D
,
O'Brien
EJ
, et al.
Decoding genome-wide GadEWX-transcriptional regulatory networks reveals multifaceted cellular responses to acid stress in Escherichia coli
.
Nat Commun
2015
;
6
:
7970
.

4.

Seo
SW
,
Kim
D
,
Latif
H
, et al.
Deciphering Fur transcriptional regulatory network highlights its complex role beyond iron metabolism in Escherichia coli
.
Nat Commun
2014
;
5
:
4910
.

5.

Park
JY
,
Rimal
H
,
Bang
I
, et al.
Genome-wide identification of DNA-protein interaction to reconstruct bacterial transcription regulatory network
.
Biotechnol Bioprocess Eng
2020
;
25
:
944
54
.

6.

Nguyen-Vo
TP
,
Ko
S
,
Ryu
H
, et al.
Systems evaluation reveals novel transporter YohJK renders 3-hydroxypropionate tolerance in Escherichia coli
.
Sci Rep
2020
;
10
:
19064
.

7.

Sastry
AV
,
Gao
Y
,
Szubin
R
, et al.
The Escherichia coli transcriptome mostly consists of independently regulated modules
.
Nat Commun
2019
;
10
:1–14.

8.

Kim
D
,
Seo
SW
,
Gao
Y
, et al.
Systems assessment of transcriptional regulation on central carbon metabolism by Cra and CRP
.
Nucleic Acids Res
2018
;
46
:
2901
17
.

9.

Rhee
HS
,
Pugh
BF
.
Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution
.
Cell
2011
;
147
:
1408
19
.

10.

Rossi
MJ
,
Lai
WKM
,
Pugh
BF
.
Simplified ChIP-exo assays
.
Nat Commun
2018
;
9
:
2842
.

11.

Wang
L
,
Chen
J
,
Wang
C
, et al.
MACE: model based analysis of ChIP-exo
.
Nucleic Acids Res
2014
;
42
:
e156
.

12.

Hartonen
T
,
Sahu
B
,
Dave
K
, et al.
PeakXus: comprehensive transcription factor binding site discovery from ChIP-Nexus and ChIP-exo experiments
.
Bioinformatics
2016
;
32
:
i629
38
.

13.

Yamada
N
,
Kuntala
PK
,
Pugh
BF
, et al.
ChExMix: a method for identifying and classifying protein-DNA interaction subtypes
.
J Comput Biol
2020
;
27
:
429
35
.

14.

Alipanahi
B
,
Delong
A
,
Weirauch
MT
, et al.
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
.
Nat Biotechnol
2015
;
33
:
831
8
.

15.

Zhou
J
,
Troyanskaya
OG
.
Predicting effects of noncoding variants with deep learning-based sequence model
.
Nat Methods
2015
;
12
:
931
4
.

16.

Zhang
Y
,
Wang
Z
,
Zeng
Y
, et al.
High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method
.
Brief Bioinform
2021
;
22
:bbab273.

17.

Decker
KT
,
Gao
Y
,
Rychel
K
, et al.
proChIPdb: a chromatin immunoprecipitation database for prokaryotic organisms
.
Nucleic Acids Res
2022
;
50
:
D1077
84
.

18.

Langmead
B
,
Trapnell
C
,
Pop
M
, et al.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
.
Genome Biol
2009
;
10
:1–10.

19.

Li
H
,
Handsaker
B
,
Wysoker
A
, et al.
The sequence alignment/map format and SAMtools
.
Bioinformatics
2009
;
25
:
2078
9
.

20.

Bailey
TL
,
Boden
M
,
Buske
FA
, et al.
MEME SUITE: tools for motif discovery and searching
.
Nucleic Acids Res
2009
;
37
:
W202
8
.

21.

Bonocora
RP
,
Smith
C
,
Lapierre
P
, et al.
Genome-scale mapping of Escherichia coli sigma54 reveals widespread
.
conserved intragenic binding. PLoS Genet
2015
;
11
:
e1005552
.

22.

Grau
J
,
Posch
S
,
Grosse
I
, et al.
A general approach for discriminative de novo motif discovery from high-throughput data
.
Nucleic Acids Res
2013
;
41
:
e197
.

23.

Serandour
AA
,
Brown
GD
,
Cohen
JD
, et al.
Development of an Illumina-based ChIP-exonuclease method provides insight into FoxA1-DNA binding properties
.
Genome Biol
2013
;
14
:1–9.

24.

Li
J
,
Dai
S
,
Chen
X
, et al.
Mechanism of forkhead transcription factors binding to a novel palindromic DNA site
.
Nucleic Acids Res
2021
;
49
:
3573
83
.

25.

Blombach
F
,
Fouqueau
T
,
Matelska
D
, et al.
Promoter-proximal elongation regulates transcription in archaea
.
Nat Commun
2021
;
12
:
5524
.

26.

Riordan
JT
,
Tietjen
JA
,
Walsh
CW
, et al.
Inactivation of alternative sigma factor 54 (RpoN) leads to increased acid resistance, and alters locus of enterocyte effacement (LEE) expression in Escherichia coli O157: H7
.
Microbiology (Reading)
2010
;
156
:
719
30
.

Author notes

Ina Bang, Sang-Mok Lee and Seojoung Park contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)