-
PDF
- Split View
-
Views
-
Cite
Cite
Fu-Ying Dao, Hao Lv, Dan Zhang, Zi-Mei Zhang, Li Liu, Hao Lin, DeepYY1: a deep learning approach to identify YY1-mediated chromatin loops, Briefings in Bioinformatics, Volume 22, Issue 4, July 2021, bbaa356, https://doi.org/10.1093/bib/bbaa356
- Share Icon Share
Abstract
The protein Yin Yang 1 (YY1) could form dimers that facilitate the interaction between active enhancers and promoter-proximal elements. YY1-mediated enhancer–promoter interaction is the general feature of mammalian gene control. Recently, some computational methods have been developed to characterize the interactions between DNA elements by elucidating important features of chromatin folding; however, no computational methods have been developed for identifying the YY1-mediated chromatin loops. In this study, we developed a deep learning algorithm named DeepYY1 based on word2vec to determine whether a pair of YY1 motifs would form a loop. The proposed models showed a high prediction performance (AUCs|$\ge$|0.93) on both training datasets and testing datasets in different cell types, demonstrating that DeepYY1 has an excellent performance in the identification of the YY1-mediated chromatin loops. Our study also suggested that sequences play an important role in the formation of YY1-mediated chromatin loops. Furthermore, we briefly discussed the distribution of the replication origin site in the loops. Finally, a user-friendly web server was established, and it can be freely accessed at http://lin-group.cn/server/DeepYY1.
Introduction
Higher order chromatin structure is critically important for basic biological processes, such as cell differentiation, transcriptional regulation, genome replication and DNA repair [1]. The spatial structure of genome organization is hierarchical, and its most basic structural unit is called the topologically associating domain (TAD) [2]. Specific DNA interactions between the proximal promoters and distal regulatory elements form a chromatin loop within these relatively stable chromatin domains [3]. Large DNA loops encompassing genes and their regulatory elements depend on the chromatin architectural protein CCCTC-binding factors (CTCF) interactions. CTCF and the associated cohesin complex play a central role in insulator function and higher-order chromatin organization of mammalian genomes as shown in Figure 1A. However, most enhancer–promoter interactions do not employ structural protein CTCF. The protein Yin Yang 1 (YY1) [4, 5] has been reported to bind hypo-methylated DNA sequences form homodimers, and it can contribute to enhancer–promoter interactions in a manner analogous to DNA looping mediated by CTCF [6] (Figure 1A). The majority of YY1 preferentially occupies interacting enhancers and promoters, and the loop structure mediated by YY1 is related to gene activation and repression and to gene dysregulation in cancer.

The outline of DeepYY1. (A) YY1 is a structural regulator of enhancer–promoter interactions and facilitates gene expression. (B) Visualization of the detailed architecture of DeepYY1.
Precise identification of interactions between regulatory elements is critically important not only for elucidating transcriptional regulation but also for understanding the mechanisms underlying complex human diseases [7]. Dissecting the factors underlying the formation and maintenance of genome structure requires high-throughput sequencing. Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA–PET) [8] and High-through chromosome conformation capture (Hi-C) [9, 10] techniques can capture the chromosome interaction regions, and massive sets of experimental data have been accumulated. In addition, recent advances in deep learning methods have provided novel alternative approaches for chromatin interaction identification [7, 11–13]. Artificial intelligence can complement wet experiments for characterizing chromatin interactions [14, 15].
To date, many computational methods have been developed to detect chromatin interactions at the level of regulatory elements [11–13, 16–22]. Most of the aforementioned studies are based on the genomic signature information and CTCF binding site information in large DNA loops. Weintraub et al. [6]. found YY1 is a structural regulator of enhancer-promoter loops in a manner analogous to the DNA interactions mediated by CTCF. YY1 preferentially occupies interacting enhancers and promoters, whereas CTCF preferentially occupies sites distal from these regulatory elements that tend to form larger loops and participate in insulation (Figure 1A). Therefore, the use of YY1 binding site information is important for the identification of small DNA loops.
Here, we developed a deep learning algorithm based on word embedding called DeepYY1 to predict if a pair of YY1 motifs would form a loop using the YY1 binding site sequence features. The DeepYY1 method has a high predictive ability with high AUC values. We show that our method can produce models with a high discriminating ability and enhance our understanding of the factors controlling YY1-mediated chromatin loops. Generally, we emphasize that the DNA sequence contains important information for complex chromatin architectures.
Materials and methods
Data collection and preprocessing
Mumbach et al. [23] presented a protein-centric chromatin conformation method named HiChIP, which combined the advantages of Hi-C and ChIA–PET to obtain higher-resolution three-dimensional (3D) chromatin structural information with smaller datasets. Weintraub et al. [6] conducted YY1 HiChIP experiments in two different human cell types, including K562 and HCT116. They found that YY1 generally occupies sites involved in enhancer–promoter interactions in mammalian cells. Therefore, we downloaded YY1 HiChIP and YY1 ChIP-seq data of K562 and HCT116 cell types, respectively.
We defined the positive sample as the HiChIP chromatin loop (confidence probability |$\ge$| 0.9), with either side of the paired regions having a unique ChIP-seq YY1 peak. We defined the negative sample as the HiChIP loop with confidence probability = 0. For each pair in both the positive and negative samples, a sequence of 506 bp with YY1 motif as the center was extracted. The YY1 motif PWM was obtained from JASPAR (ID: MA0095.1) [24, 25].
To objectively evaluate the proposed models, we separated the benchmark dataset into two parts: the independent dataset (testing dataset) and the training dataset, according to the ratio of 3:7 [26]. The training dataset was used to train the model by 10-fold cross-validate strategy, whereas the independent dataset was used to test the corresponding obtained model. Details on these benchmark datasets have been provided in Supplementary Table S1 available online at https://academic.oup.com/bib.
Feature description
Formulating DNA sequences with an effective mathematical expression is a key step to build classification model [27, 28], as well as choosing discriminative vector features facilitates to generate model with a high-generalization ability.
Word2vec
The popular word2vec [29] method is a two-layer neural network model, which uses distributed representation to reduce word vectors from high-dimensional to low-dimensional space to solve the dimensional disasters often encountered in deep learning. In natural language processing (NLP) [30], the quality of a model depends largely on the effect of word vectors. Word2vec converts independent words into dense vectors. In such a model, words that appear in the same contexts share semantic meaning; that is, words are embedded in a continuous vector space where ‘semantically similar’ words have closer vectors.
The 506 bp sequence, which was obtained by the above method, was scanned by a k-sized sliding window with a step size of 1 bp to convert multiple subsequences of fixed length k [31]. We considered subsequences of fixed length k as DNA ‘words’. The lengths of all of the YY1 motif sequences in both the positive samples and negative samples were then amplified 250 bp upstream and downstream, respectively. The collection of all possible ‘words’ in DNA sequences of length 506 bp was defined as the vocabulary.
We used the gensim software package [33] to train the DNA sequence lexicalization vectors, which is a Python-based NLP library to extract semantic information from the documents. When training the lexicalization vectors by the CBOW model, the sliding window size of the training parameters is set to 20, and the dimension of the target vector is 100. Because each lexicalization vector has important attributes, the minimum filtering frequency of the model was set to 1; that is, no words were filtered, and the number of iterations was 50.
After training the word2vec model, we obtained a hash table with its keys as DNA words and its values as vectors. We then encoded DNA sequences by the embedding vectors for every DNA word in the DNA sentence and taking the average of the vectors as the vector for the DNA sequence. In this study, we set the value of 6 as the word length.
Convolutional neural networks
Because deep learning [34] can efficiently capture local contextual features from unstructured data, the convolutional neural network (CNN) technique [35] was used in this study to construct the YY1-mediated chromatin loops prediction model. As illustrated in Figure 1B, the basic architecture of the CNN consisted of six distinct layers: one convolutional layer, one pooling layer, three fully connected layers and one softmax layer. First, the encoding array generated by word2vec connects directly with the convolution layer which scanned the encoding array and resulted in a feature vector. Second, we used a max pooling layer and selected the maximum neuron output value of the local patterns to be the output. Third, the output for each layer is generated by fully connected layers based on this vector. Finally, the output vector of the fully connected layer is used as the input of the softmax layer, which gives the corresponding classification probability to a query sequence.
This newly constructed YY1-mediated chromatin loops prediction model was implemented with Python programming language and the TensorFlow library [36]. Detailed information on the parameters of this CNN strategy is available at http://lin-group.cn/server/DeepYY1/download.html.
Results and discussion
Compare word2vec with other feature extraction methods
According to the data and features described in the Materials and methods, we developed a machine learning approach to predict whether a pair of YY1 motifs can form a chromatin loop. Figure 1 illustrates the workflow of our algorithm, named DeepYY1. For each training dataset of different cell types, we first extracted feature vectors by the word2vec algorithm to generate a feature set, in which we took the value of k as 6. Next, the prediction models were trained using CNN in a 10-fold cross-validation test [37]. In this study, we compared word2vec with other feature extraction methods, such as nucleic acid composition (NAC), dinucleotide composition (DNC), trinucleotide composition (TNC), k-mer, electron–ion interaction pseudopotential (EIIP) value and nature vector (NV) [38, 39]. Here, we used AUC values [40] as the evaluation index because they can describe the overall performance of the model. The receiver operating characteristic (ROC) curves in Supplementary Figure S1 available online at https://academic.oup.com/bib showed the prediction performances of different feature extraction methods based on CNN with 10-fold cross-validation. It is obvious that the word2vec always produced superior results (AUCs |$\ge$|0.93) in both of the cell types. In addition, we noticed that k-tuple nucleotide composition (NAC, DNC, TNC and 4-mer) are also good features. For example, the AUC value of 4-mer was better than that of word2vec in the HCT116 cell. Therefore, we further analyzed the models generated by word2vec and the traditional k-mer method based on same k values.
For a more intuitive comparison, the AUC values and program running time generated by the two algorithms were recorded in Figure 2 to further demonstrate word2vec’s ability to encode DNA sequences. From the comprehensive view of the two cell lines, we can draw the following conclusions. First, in k562 and HCT116 cell lines, the word2vec method can lead to the stable performance of models based on different k-mer words, as all AUC values were greater than 0.93 (lines of dashes). However, the performance of the models generated by the traditional k-mer method fluctuated sharply as the feature dimension increases; models had the lowest performance when k = 6. Thus, word2vec can capture more information on short-range sequences while ensuring that the performance of the model is sufficient to maintain its stability. However, model performance decreases under the traditional k-mer method when longer short-range sequence information is extracted, resulting in a non-robust model. In addition, these results indicate the effectiveness of the word2vec method for extracting motifs with long segments. Second, according to the running time curves in Figure 2A and B, when k |$\le$| 4, the calculation time of the two feature extraction methods was similar. However, when k |$>$| 4, the running time increased sharply for the model based on the traditional k-mer method, while word2vec was computationally efficient and saved computing resources.

Comparison of model performance and running time based on word2vec and traditional k-mer methods in K562 (A) and HCT116 (B).
Therefore, word2vec is essentially a dimensionality reduction operation, and the generated dimension (consistent with the number of hidden layer nodes) is generally much smaller than the total number of words. Whereas the vector dimension produced by the traditional k-mer method increases exponentially with the increase of k, which greatly consumes computational resources. What’s more, word2vec can better express the similarity and analogy relationship between different words and show the relevant information between words on the semantic level to better dig out the connections between the words. It is better than traditional k-mer method for representing sequence information. Meanwhile, this finding also demonstrates the importance of considering the computational efficiency of feature coding methods to ensure high performance of the model. Taken together, these results suggest that the sequence-based features obtained by word2vec are informative for predicting YY1 chromatin loops.
Analysis of the features extracted by word2vec
We used the CNN with 10-fold cross-validation to train sequence-based feature sets encoded by word2vec, which was named DeepYY1. DeepYY1 was superior for the construction of a robust classification model based on the above analysis of model performance. However, how the feature set from word2vec contributes to improving model performance remains unclear. To answer this question, we visualized the distribution of the samples in the vector space based on the features obtained from word2vec. t-distributed stochastic neighbor embedding (t-SNE) assigns each data point a location in a two-dimensional (2D) or 3D map and is particularly well suited for the visualization of high-dimensional datasets [41].
We applied t-SNE to reduce the feature space to a 2D space and plotted a 2D plane. As shown in Figure 3, the x-axis and y-axis form the 2D plane, and each point represents a sample, with its color representing the label (either forming a loop or non-loop). We found that most of the positive and negative samples in the feature space were distributed in two clear clusters, suggesting that word2vec can learn the representative features of the loop and non-loop to contribute to the construction of an efficient vector space for loop and non-loop classification. This also explained the result of feature selection in Supplementary Figure S2 available online at https://academic.oup.com/bib, in which the model performance improves with the increase of the feature number in the incremental feature selection (IFS) procedure based on F-score [42]. Therefore, such nearly perfect features do not require further feature selecting.

Visualization using t-SNE based on features from word2vec that were used in DeepYY1.

Independent datasets validation and cross-cell validation were used to analyze the robustness and reliability of propose models. (A) ROC curves of independent datasets for different cell types; AUC values are 0.9338 and 0.9485 for K562 and HCT116, respectively. (B) The heat map showing the values of AUC in cross-cell types prediction. Once a cell-specific model was established on its own training dataset in rows, it was validated on the data from the same cell as well as with the independent data from the other datasets in columns.

The distribution of ORIs in chromatin loops. (A) ORIs tended to be located at the base of the CTCF-mediated chromatin loops. (B) ORIs were enriched in the initial portion of the YY1-mediated chromatin loops.
Finally, many machine learning algorithms, including artificial neural network (ANN) [43], K-nearest neighbor algorithm (K-NN) [44], Naive Bayes (NB) [45], random forest (RF) [46] and AdaBoost [47] were used to compare with CNN to find the best model based on same features produced by word2vec. As shown in Supplementary Figure S3 available online at https://academic.oup.com/bib, CNN is superior to other classification algorithms both in two cell types. Hence, the optimal classification models were obtained by a combination of word2vec and CNN.
The robustness and reliability analysis
To further investigate the robustness and stability of the proposed models, we designed following experiments.
Independent datasets have often been used to further test the robustness and stability of the proposed classification models [48]. Minimal over-fitting on the independent dataset indicates that the model has a strong generalization ability. We established independent datasets for each cell type as shown in Supplementary Table S1 available online at https://academic.oup.com/bib. The corresponding results for the different independent datasets are shown in Figure 4A using ROC curves [49]. The AUC values of the independent test in K562 and HCT116 were 0.9338 and 0.9485, respectively. These results indicate that our classification models based on DeepYY1 were sufficient for identifying the YY1-mediated chromatin loops for unknown sequences.

A semi-screenshot to show the page of the DeepYY1 web server. Its website address is available at http://lin-group.cn/server/DeepYY1/.
In addition, we conducted a cross-cell validation based on the model described above. Therefore, we can see whether a model trained with the data from one cell type could recognize the interaction between two sequences in other cell types. The knowledge of transfer information [50] was used to design experiment to study the relationships of interacting cell types. The two cell-specific models were first constructed by training datasets from two different cell types, respectively. Subsequently, for each model, the two cells’ training datasets were regarded as independent testing datasets to evaluate the performance of the models. A heat map was drawn according to the obtained AUCs in Figure 4B. The models in the rows were tested on the other datasets in columns. It can be seen that the model based on its own dataset always achieves the best accuracy (AUCs > 0.96) but not good results for other cell type dataset, which indicated that the specificity of YY1 binding site sequence among the different cell types is obvious. Therefore, we cannot use model of one cell type to identify the YY1-mediated chromatin loop in another cell types.
Distribution of ORIs on chromatin loops
Much recent work has focused on studying the mechanism of DNA replication in the spatial organization of the genome [51–53]. We are highly familiar with research in the fields relating to DNA replication and are interested in the relationship between the DNA replication sites (ORIs) and chromatin loops in the 3D genome. To address this question, we examined the distribution of ORIs on loops from the K562 cell line. ORIs data were downloaded from the database DeOri [54], which is one of the most popular and well-known databases for ORIs in eukaryotic genomes. For a more comprehensive comparison, we also obtained CTCF-mediated chromatin loops data of the K562 cell line from [55]. We have drawn the distribution curves of ORIs both in chromatin CTCF-mediated loops and YY1-mediated loops in Figure 5.
ORIs were mainly concentrated at the two ends of the CTCF-mediated loops; that is, ORIs were located at the bases of the CTCF-mediated loops (Figure 5A). This result indicates that multiple ORIs in a single replication domain may share DNA replication activation factors, facilitating synchronous activation [51, 56]. However, a similar distribution was not observed in Figure 5B in which the ORIs were enriched in the initial portions of the YY1-mediated loops. These findings show that the distributions of ORIs between the CTCF-mediated loop and YY1-mediated loop differed; nevertheless, the specific mechanism driving these differences requires additional study.
Web server guide
According to proposed models, we built a user-friendly online web server, called DeepYY1, to identify whether a pair of DNA sequences would form a chromatin loop. The web server can is freely available at http://lin-group.cn/server/DeepYY1. When users click the ‘Web-Server’ button, the page will jump to the interface as shown in Figure 6. According to the instructions, users can choose two ways to upload query sequences. The first is to paste the DNA sequences into the two input boxes. It is worth noting that the pasted sequences are not in FASTA format, but fragments of DNA sequences. Users could click on the ‘example’ button to check the right format of sequence. The second is to upload sequences file directly. For the file format, please refer to the instructions or click link to view. Subsequently, click on the ‘submit’ button to find the predicted result. Here, we recommend users to use the second way to upload sequences, because the first way can only query one pair of DNA sequences interaction at a time, and the second way can complete the identification of multiple pairs of sequences interactions.
Conclusion
In this study, a CNN-based model, named DeepYY1, was developed to predict YY1-mediate chromatin loops with the sequence-based features captured by the word2vec model. The results of both the 10-fold cross-validation and independent testing demonstrated that word2vec features alone can predict loop-forming YY1 motif pairs with high performance. Overall, we believe that the DeepYY1 method is an important step for improving our understanding of the principles governing YY1-mediated chromatin loops. We hope that this work could be used in future studies focused on decoding the information encoded in the genome sequences responsible for determining complex chromatin architectures. In addition, the mechanism of DNA replication in 3D genomes requires additional study. We also hope that the results of this study motivate future work in this field.
We provide the first study to identify the YY1-mediated chromatin loop with the sequence-based features captured by the word2vec.
The CNN-based model, named DeepYY1, which combined word2vec and CNNs, provides an algorithm combination perspective.
Excellent performance of DeepYY1 was testified by 10-fold cross-validation and independent testing.
The relevant research on the mechanism of DNA replication in 3D genomes was provided.
Author’s Contribution
F.-Y.D. contributed to the methodology, coding and writing during the original draft preparation. H.Lv contributed to the data curation and writing. D.Z. contributed to the investigation and coding. Z.-M.Z. contributed to the investigation and methodology. L.L. contributed to the validation and conceptualization. H.Lin contributed to the conceptualization, writing–Reviewing and editing.
Funding
This work has been supported by the National Nature Scientific Foundation of China (61772119, 61961031) and the Sichuan Provincial Science Fund for Distinguished Young Scholars (2020JDJQ0012).
Declaration of interests
The authors declare that they have no competing interests.
Data Availability
We provide the benchmark datasets and Python source code of DeepYY1, which are freely available at http://lin-group.cn/server/DeepYY1/download.html.
Fu-Ying Dao is a PhD candidate of the Center for Informational Biology at the University of Electronic Science and Technology of China. Her research interests are bioinformatics, machine learning and DNA replication regulation.
Hao Lv is a PhD candidate of the Center for Informational Biology at the University of Electronic Science and Technology of China. His research interests are bioinformatics, machine learning and DNA and RNA modification.
Dan Zhang is a MS candidate of the Center for Informational Biology at the University of Electronic Science and Technology of China. Her research interests are bioinformatics and machine learning.
Zi-Mei Zhang is a MS candidate of the Center for Informational Biology at the University of Electronic Science and Technology of China. Her research interests are bioinformatics.
Li Liu is a lecturer of the Laboratory of Theoretical Biophysics at the Inner Mongolia University. His research field is bioinformatics.
Hao Lin is a professor of the Center for Informational Biology at the University of Electronic Science and Technology of China. His research is in the areas of bioinformatics and system biology.