-
PDF
- Split View
-
Views
-
Cite
Cite
Chen-Chen Li, Bin Liu, MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks, Briefings in Bioinformatics, Volume 21, Issue 6, November 2020, Pages 2133–2141, https://doi.org/10.1093/bib/bbz133
Close - Share Icon Share
Abstract
Protein fold recognition is one of the most critical tasks to explore the structures and functions of the proteins based on their primary sequence information. The existing protein fold recognition approaches rely on features reflecting the characteristics of protein folds. However, the feature extraction methods are still the bottleneck of the performance improvement of these methods. In this paper, we proposed two new feature extraction methods called MotifCNN and MotifDCNN to extract more discriminative fold-specific features based on structural motif kernels to construct the motif-based convolutional neural networks (CNNs). The pairwise sequence similarity scores calculated based on fold-specific features are then fed into support vector machines to construct the predictor for fold recognition, and a predictor called MotifCNN-fold has been proposed. Experimental results on the benchmark dataset showed that MotifCNN-fold obviously outperformed all the other competing methods. In particular, the fold-specific features extracted by MotifCNN and MotifDCNN are more discriminative than the fold-specific features extracted by other deep learning techniques, indicating that incorporating the structural motifs into the CNN is able to capture the characteristics of protein folds.
Introduction
Protein fold recognition is one of the most important tasks to explore the structures and functions of the proteins based on their sequence information. However, the study on identifying the protein folds only based on the sequence information is still a challenging problem due to their low sequence similarities (less than 25% sequence identity) [1, 2]. In this regard, researchers are exploring new approaches to solve this important and difficult task and have proposed some computational methods, which are mainly divided into three categories: alignment methods, machine learning methods and ensemble methods. Alignment methods [3, 4] focus on detecting the local and global pairwise sequences similarities, including sequence–sequence alignment methods, profile–sequence alignment methods (see [5]) and profile–profile alignment methods (see [6, 7]). Machine learning methods treat protein fold recognition as a fold-level classification task by using the classical machine learning techniques (see [8–11]) or deep learning techniques (see [12]). Ensemble methods adopt the consensus strategy to integrate multiple recognition methods (see [12–18]).
All the aforementioned computational methods mainly rely on the traditional sequence-based or structure-based features or the features extracted by deep learning techniques, and have contributed to the developments of protein fold recognition. However, the feature extraction method is still the bottleneck for the performance improvement of the machine learning-based methods [1, 19–23]. It is highly desirable to combine the intelligent representation of proteins with protein biological structure information to automatically extract the discriminative features of the protein sequences.
A protein structural motif is a supersecondary structure [24], describing the connectivity between secondary structural elements. Because protein structures are more conserved than their sequences, two proteins with low sequence similarity would share similar structural motifs. Therefore, protein structural motifs can be considered as the bridge between the protein structures and their sequences. In this study, we are to propose new fold-specific features by incorporating the structural motifs into the convolutional neural networks (CNNs) to explore more discriminative features for protein fold recognition. We introduced two motif-based CNN models to extract the more discriminative fold-specific features with biological attributes considering the evolutionary information from position-specific frequency matrixes (PSFMs) and the fold-specific features considering the structure information from residue–residue contacts (CCMs). Based on the more discriminative fold-specific features with biological attributes, we constructed the feature vector via the pairwise sequence similarity scores calculated from the fold-specific features following a recent study [25]. Combined with support vector machines (SVMs), a new computational predictor called MotifCNN-fold was established for protein fold recognition. Experimental results on a widely used benchmark dataset (LE) showed that MotifCNN-fold outperformed other competing predictors. Furthermore, we analyzed the fold-specific features and showed that the new fold-specific features extracted by motif-based CNN models are more discriminative than the fold-specific features extracted by the traditional neural network models reported in [25], such as convolution neural network-bidirectional long short-term memory (CNN-BLSTM) and deep convolution neural network-bidirectional long short-term memory (DCNN-BLSTM).
Materials and methods
Benchmark dataset
The benchmark dataset of protein fold recognition should be constructed following a rigorous criterion: proteins in the training set and test set can be in the same fold but should be from different superfamilies [15]. LE dataset [26] is a widely used rigorous dataset based on the SCOP database, which contains 976 sequences with 330 folds derived from SCOP sharing pairwise sequence identity less than 40%. In this dataset, 321 proteins have at least one match at fold level. In order to simulate the protein fold recognition task rigorously, the LE dataset was partitioned into two subsets at fold level to ensure that any two proteins from different subsets can belong to the same fold, but they should be in different superfamilies [1, 14].
Protein representations
In this study, following a previous study [25], two protein representations were employed, including residue–residue contacts (CCMs) generated by running CCMpred [30], and PSFMs [5]. The CCM contains the predicted structural information of proteins, which is a widely used representation for protein fold recognition [1, 13, 31]. The PSFM is a profile-based representation containing the evolutionary information, which is useful for analyzing proteins sharing low sequence similarities [32].
In order to generate the CCM, the target protein was searched against the uniprot20_2016_02 database through the HHblits tool [33] to generate the multiple sequence alignment (MSA) of the target protein. Then, the MSA was analyzed by running CCMpred [30] with default parameters to analyze the coevolution between residues and assign the contact probability for each residues–residues pair.
The PSFM was calculated through the MSA generated by PSI-BLAST [5] with the E-value of 0.001 and the number of iterations of three by searching against the nrdb90 database.
Extracting fold-specific features by motif-based CNNs
Previous studies [1, 2, 13] showed that the extraction of the related features associated with fold types is critical for improving the performance of fold recognition; furthermore, the fold-specific features automatically generated by deep learning techniques are more accurate than other traditional features. An improved method called DeepSVM-fold [25] has been proposed by constructing three different deep neural networks to extract the fold-specific features from evolutionary information (position-specific scoring matrix (PSSM) and position-specific frequency matrix (PSFM)) and structural information (residue–residue contacts (CCMs)). Deep neural networks learn the related features associated with fold types by adjusting the connection strength of their neurons. This process filters out noise and retains only the most relevant functions.
However, the learning process of deep neural networks is like a black box, and the related features lack biological evidences and interpretability. How to incorporate the biological attributes into deep learning technology to improve the biological evidences and interpretability of the related features is the main driving force of this study. In this regard, we have introduced two motif-based CNN models (MotifCNN and MotifDCNN) by incorporating the structural motifs into the CNN, aiming to extract the more discriminative fold-specific features with biological attributes considering the evolutionary information from PSFMs and the structural information from CCMs.
In order to design the motif-based CNN models, some specific structural motifs were selected, based on which the corresponding motif-based convolutional kernels were designed. The motif-based CNN models (MotifCNN and MotifDCNN) were constructed based on the motif-based convolutional kernels to extract the fold-specific features from PSFMs and CCMs, respectively. The MotifCNN was established to extract the evolutionary information from PSFMs, aiming to detect whether a kmer in the proteins matches a structural motif for a specific fold. The MotifDCNN was constructed to extract the structural information from CCMs, aiming to detect if a kmer is related with a structural motif for a specific fold.
The network architecture of MotifCNN to extract fold-specific features from PSFMs.
In order to extract the fold-specific features of the proteins on the benchmark dataset, a comprehensive database [25] was employed to train the models, which consists of 19 772 proteins covering 1211 fold types. All the proteins in the sequence database share <40% sequence similarity with the proteins in the LE dataset. The motif-based CNN models were implemented by the TensorFlow framework. Batch normalization [34] and the dropout technique [35] were used.
Proteins structural motif selection
Structural motifs are important for the protein folds because these motifs form the common structural cores to maintain the particular spatial patterns of protein folds [36], and the prominent features of motifs in protein folds are helices or sheets. In this study, the protein structural motifs were selected from the motif database MegaMotifBase [36], which is a protein family, and the superfamily structural motif database, available at http://caps.ncbs.res.in/MegaMotifbase/sflist.html. MegaMotifBase database provides the key properties of structural motifs, including solvent inaccessible, solvent accessible, alpha helix, beta strand, 3-10 helix, hydrogen bond to main chain amide, hydrogen bond to main chain carbonyl, disulfide bond, positive phi, etc. According to the superfamilies in the LE dataset, the structural motifs were selected, and their corresponding MSAs were extracted from the MegaMotifBase database [36]. Finally, we obtained 128 structural motifs. For detailed information of the 128 structural motifs, please refer to Supplementary Information S1.
Extracting fold-specific features from PSFMs by MotifCNN
Feature extraction is a very important step for constructing a computational predictor [23, 37, 38]. In order to extract more discriminative protein-specific fold features based on evolutionary information profiles (PSFMs), we introduced the protein structural motif kernel so as to design a CNN called MotifCNN. The MotifCNN architecture is shown in Figure 1. MotifCNN contains a randomly initialized convolutional layer and a motif-based convolutional layer, and then the two convolutional layers were respectively processed by Max-pooling layers. The pooling results were concatenated and connected to the fully connected layer simultaneously to learn the fold-specific features. The randomly initialized convolutional layer can identify local patterns in protein sequences, and a motif-based convolutional layer can identify whether a protein sequence contains the particular structural motifs and the occurrence probability of the particular motifs. The protein fold-specific features were further extracted via extracting the dependency information between the motif feature patterns through the fully connected layer. For detailed parameters of MotifCNN, please refer to Supplementary Information S1.
The convolutional layer is achieved by multi-convolutional kernels, local perceptual domains and shared weights, including a randomly initialized convolutional layer and a motif-based convolutional layer. In the convolutional layer, the input matrix is divided into different local perceptual domains, and the information in these perceptual domains will be connected to different convolutional kernels. Specifically, the randomly initialized convolutional layer randomly initializes different convolutional kernels, and automatically adjusts the weights of the convolutional kernels by gradient descent to detect the local patterns of the protein sequence.
The network architecture of MotifDCNN to extract fold-specific features from CCMs.
Then, the feature maps extract the maximum value by Max-pooling layers, respectively, to form the motif features, which are inputted into the fully connected layer to extract the fold-specific features of the protein.
Extracting fold-specific features from CCMs by MotifDCNN
Different from the PSFM profile, CCM contains the protein tertiary structure information, describing the contact likelihood among residues. In order to extract more discriminative protein-specific fold features from CCMs, the motif-based convolutional kernel for CCMs was proposed, which is consistent with the representation of CCM for the deep convolutional network called MotifDCNN. The network architecture is shown in Figure 2. For detailed parameters of MotifDCNN, please refer to Supplementary Information S1.
The differences between MotifDCNN and MotifCNN are as follows: (i) the randomly initialized convolutional layer of MotifDCNN was deeper than that of MotifCNN so as to more accurately identify local patterns in CCMs and (ii) MotifDCNN designed the convolutional kernels of the motif-based convolutional layer based on the CCMs of the structural motifs so as to detect the structural correlations between kmers and structural motif along the proteins.
The convolutional layer connects the local perceptual domain |${\mathbf{D}}_i$| with multiple different randomly initialized convolutional kernels and multiple different motif-based convolutional kernels. The convolutional kernel works as shown in Eqs. (4–6).
Therefore, the protein input matrix is subjected to convolutional layers, including a randomly initialized deep convolutional layer and a motif-based convolutional layer to obtain two feature maps, RF and MF, as shown in Eqs. (7–8). Then, the feature maps extract the maximum value by Max-pooling layers, respectively, to form the motif features, which are inputted into the fully connected layer to extract the fold-specific features of the protein.
Combining fold-specific features with SVMs
Inspired by DeepSVM-fold [25], the vectorization strategy of pairwise sequence similarity scores is efficient for fold recognition on the LE dataset. Therefore, we followed the process of the DeepSVM-fold to construct the predictor. The detailed process is as follows.
Finally, an SVM [40] was used as an effective and fast supervised learning model for constructing the multi-class classifier, which has been widely used in bioinformatics [22, 41, 42]. In this study, the SVM classifier was implemented by using the Python package, Scikit-learn, [43] with the command line ‘svm.svc (random_state=“0”, max_iter=“1000”, kernel=“linear”, decision_function_shape=“ovr”, gamma=“auto”, c=15)’.
The performance comparison of MotifCNN-fold and DeepSVM-fold on LE dataset
| Method . | Modelsa . | Accuracy(%) . |
|---|---|---|
| DeepSVM-fold (PSFM) | CNN-BLSTM | 48.9% |
| MotifCNN-fold (PSFM) | MotifCNN | 61.0% |
| DeepSVM-fold (CCM) | DCNN-BLSTM | 56.7% |
| MotifCNN-fold (CCM) | MotifDCNN | 60.05% |
| DeepSVM-fold (CCM, PSFM) | CNN-BLSTM, DCNN-BLSTM | 67.3% |
| MotifCNN-fold (CCM, PSFM) | MotifCNN, MotifDCNN | 72.55% |
| Method . | Modelsa . | Accuracy(%) . |
|---|---|---|
| DeepSVM-fold (PSFM) | CNN-BLSTM | 48.9% |
| MotifCNN-fold (PSFM) | MotifCNN | 61.0% |
| DeepSVM-fold (CCM) | DCNN-BLSTM | 56.7% |
| MotifCNN-fold (CCM) | MotifDCNN | 60.05% |
| DeepSVM-fold (CCM, PSFM) | CNN-BLSTM, DCNN-BLSTM | 67.3% |
| MotifCNN-fold (CCM, PSFM) | MotifCNN, MotifDCNN | 72.55% |
aThe parameters of MotifCNN and MotifDCNN are respectively given in Supplementary Table S1 and Supplementary Table S2 in Supplementary Information S1. The parameters of CNN-BLSTM and DCNN-BLSTM were reported in [25].
The performance comparison of MotifCNN-fold and DeepSVM-fold on LE dataset
| Method . | Modelsa . | Accuracy(%) . |
|---|---|---|
| DeepSVM-fold (PSFM) | CNN-BLSTM | 48.9% |
| MotifCNN-fold (PSFM) | MotifCNN | 61.0% |
| DeepSVM-fold (CCM) | DCNN-BLSTM | 56.7% |
| MotifCNN-fold (CCM) | MotifDCNN | 60.05% |
| DeepSVM-fold (CCM, PSFM) | CNN-BLSTM, DCNN-BLSTM | 67.3% |
| MotifCNN-fold (CCM, PSFM) | MotifCNN, MotifDCNN | 72.55% |
| Method . | Modelsa . | Accuracy(%) . |
|---|---|---|
| DeepSVM-fold (PSFM) | CNN-BLSTM | 48.9% |
| MotifCNN-fold (PSFM) | MotifCNN | 61.0% |
| DeepSVM-fold (CCM) | DCNN-BLSTM | 56.7% |
| MotifCNN-fold (CCM) | MotifDCNN | 60.05% |
| DeepSVM-fold (CCM, PSFM) | CNN-BLSTM, DCNN-BLSTM | 67.3% |
| MotifCNN-fold (CCM, PSFM) | MotifCNN, MotifDCNN | 72.55% |
aThe parameters of MotifCNN and MotifDCNN are respectively given in Supplementary Table S1 and Supplementary Table S2 in Supplementary Information S1. The parameters of CNN-BLSTM and DCNN-BLSTM were reported in [25].
Performance of various methods for protein fold recognition on LE dataset via 2-fold cross-validation
| Methods . | P . | M . | S . | Accuracy . | Source . |
|---|---|---|---|---|---|
| PSI-BLAST | − | + | − | 4.0% | [26] |
| HMMER | − | + | − | 4.4% | [26] |
| SAM-T98 | − | + | − | 3.4% | [26] |
| BLASTLINK | − | + | − | 6.9% | [26] |
| SSEARCH | + | − | − | 5.6% | [26] |
| SSHMM | − | − | + | 6.9% | [26] |
| THREADER | − | − | + | 14.6% | [26] |
| FUGUE | − | − | + | 12.5% | [9] |
| RAPTOR | − | + | + | 25.4% | [9] |
| SPARKS | − | + | + | 28.7% | [9] |
| SP3 | − | + | + | 30.8% | [9] |
| FOLDpro | + | + | + | 26.5% | [9] |
| HHpred | − | + | + | 25.2% | [16] |
| SP4 | − | + | + | 30.8% | [16] |
| SP5 | − | + | + | 37.9% | [16] |
| BoostThreader | − | + | + | 42.6% | [16] |
| SPARKS-X | − | + | + | 45.2% | [16] |
| RF-Fold | + | + | + | 40.8% | [16] |
| DN-Fold | + | + | + | 33.6% | [16] |
| RFDN-Fold | + | + | + | 37.7% | [16] |
| DN-FoldS | + | + | + | 33.3% | [16] |
| DN-FoldR | + | + | + | 27.4% | [16] |
| FFAS-3D | − | + | + | 35.8% | [14] |
| HH-fold | − | + | − | 42.1% | [14] |
| TA-fold | − | + | + | 53.9% | [14] |
| dRHP-PseRA | − | + | − | 34.9% | [15] |
| MT-fold | + | + | + | 59.1% | [15] |
| DeepFR (Strategy1) | − | − | + | 44.5% | [13] |
| DeepFR (Strategy2) | − | − | + | 56.1% | [13] |
| DeepFRpro (Strategy1) | − | − | + | 57.6% | [13] |
| DeepFRpro (Strategy2) | − | − | + | 66.0% | [13] |
| DeepSVM-fold | − | + | + | 67.3% | [25] |
| MotifCNN-folda | − | + | + | 72.55% | This study |
| Methods . | P . | M . | S . | Accuracy . | Source . |
|---|---|---|---|---|---|
| PSI-BLAST | − | + | − | 4.0% | [26] |
| HMMER | − | + | − | 4.4% | [26] |
| SAM-T98 | − | + | − | 3.4% | [26] |
| BLASTLINK | − | + | − | 6.9% | [26] |
| SSEARCH | + | − | − | 5.6% | [26] |
| SSHMM | − | − | + | 6.9% | [26] |
| THREADER | − | − | + | 14.6% | [26] |
| FUGUE | − | − | + | 12.5% | [9] |
| RAPTOR | − | + | + | 25.4% | [9] |
| SPARKS | − | + | + | 28.7% | [9] |
| SP3 | − | + | + | 30.8% | [9] |
| FOLDpro | + | + | + | 26.5% | [9] |
| HHpred | − | + | + | 25.2% | [16] |
| SP4 | − | + | + | 30.8% | [16] |
| SP5 | − | + | + | 37.9% | [16] |
| BoostThreader | − | + | + | 42.6% | [16] |
| SPARKS-X | − | + | + | 45.2% | [16] |
| RF-Fold | + | + | + | 40.8% | [16] |
| DN-Fold | + | + | + | 33.6% | [16] |
| RFDN-Fold | + | + | + | 37.7% | [16] |
| DN-FoldS | + | + | + | 33.3% | [16] |
| DN-FoldR | + | + | + | 27.4% | [16] |
| FFAS-3D | − | + | + | 35.8% | [14] |
| HH-fold | − | + | − | 42.1% | [14] |
| TA-fold | − | + | + | 53.9% | [14] |
| dRHP-PseRA | − | + | − | 34.9% | [15] |
| MT-fold | + | + | + | 59.1% | [15] |
| DeepFR (Strategy1) | − | − | + | 44.5% | [13] |
| DeepFR (Strategy2) | − | − | + | 56.1% | [13] |
| DeepFRpro (Strategy1) | − | − | + | 57.6% | [13] |
| DeepFRpro (Strategy2) | − | − | + | 66.0% | [13] |
| DeepSVM-fold | − | + | + | 67.3% | [25] |
| MotifCNN-folda | − | + | + | 72.55% | This study |
Note: The first column gives the methods. The columns 2 to 4 denote whether the method belongs to a class (‘+’) or not (‘–’), where the classes are ‘P’ for sequence-based features, ‘M’ for MSA-based features and ‘S’ for predicted structure-based features. The fifth column reports the overall accuracy. The last column reports the result source of each method.
aRefers to the MotifCNN-fold (PSFM, CCM) in Table 1.
Performance of various methods for protein fold recognition on LE dataset via 2-fold cross-validation
| Methods . | P . | M . | S . | Accuracy . | Source . |
|---|---|---|---|---|---|
| PSI-BLAST | − | + | − | 4.0% | [26] |
| HMMER | − | + | − | 4.4% | [26] |
| SAM-T98 | − | + | − | 3.4% | [26] |
| BLASTLINK | − | + | − | 6.9% | [26] |
| SSEARCH | + | − | − | 5.6% | [26] |
| SSHMM | − | − | + | 6.9% | [26] |
| THREADER | − | − | + | 14.6% | [26] |
| FUGUE | − | − | + | 12.5% | [9] |
| RAPTOR | − | + | + | 25.4% | [9] |
| SPARKS | − | + | + | 28.7% | [9] |
| SP3 | − | + | + | 30.8% | [9] |
| FOLDpro | + | + | + | 26.5% | [9] |
| HHpred | − | + | + | 25.2% | [16] |
| SP4 | − | + | + | 30.8% | [16] |
| SP5 | − | + | + | 37.9% | [16] |
| BoostThreader | − | + | + | 42.6% | [16] |
| SPARKS-X | − | + | + | 45.2% | [16] |
| RF-Fold | + | + | + | 40.8% | [16] |
| DN-Fold | + | + | + | 33.6% | [16] |
| RFDN-Fold | + | + | + | 37.7% | [16] |
| DN-FoldS | + | + | + | 33.3% | [16] |
| DN-FoldR | + | + | + | 27.4% | [16] |
| FFAS-3D | − | + | + | 35.8% | [14] |
| HH-fold | − | + | − | 42.1% | [14] |
| TA-fold | − | + | + | 53.9% | [14] |
| dRHP-PseRA | − | + | − | 34.9% | [15] |
| MT-fold | + | + | + | 59.1% | [15] |
| DeepFR (Strategy1) | − | − | + | 44.5% | [13] |
| DeepFR (Strategy2) | − | − | + | 56.1% | [13] |
| DeepFRpro (Strategy1) | − | − | + | 57.6% | [13] |
| DeepFRpro (Strategy2) | − | − | + | 66.0% | [13] |
| DeepSVM-fold | − | + | + | 67.3% | [25] |
| MotifCNN-folda | − | + | + | 72.55% | This study |
| Methods . | P . | M . | S . | Accuracy . | Source . |
|---|---|---|---|---|---|
| PSI-BLAST | − | + | − | 4.0% | [26] |
| HMMER | − | + | − | 4.4% | [26] |
| SAM-T98 | − | + | − | 3.4% | [26] |
| BLASTLINK | − | + | − | 6.9% | [26] |
| SSEARCH | + | − | − | 5.6% | [26] |
| SSHMM | − | − | + | 6.9% | [26] |
| THREADER | − | − | + | 14.6% | [26] |
| FUGUE | − | − | + | 12.5% | [9] |
| RAPTOR | − | + | + | 25.4% | [9] |
| SPARKS | − | + | + | 28.7% | [9] |
| SP3 | − | + | + | 30.8% | [9] |
| FOLDpro | + | + | + | 26.5% | [9] |
| HHpred | − | + | + | 25.2% | [16] |
| SP4 | − | + | + | 30.8% | [16] |
| SP5 | − | + | + | 37.9% | [16] |
| BoostThreader | − | + | + | 42.6% | [16] |
| SPARKS-X | − | + | + | 45.2% | [16] |
| RF-Fold | + | + | + | 40.8% | [16] |
| DN-Fold | + | + | + | 33.6% | [16] |
| RFDN-Fold | + | + | + | 37.7% | [16] |
| DN-FoldS | + | + | + | 33.3% | [16] |
| DN-FoldR | + | + | + | 27.4% | [16] |
| FFAS-3D | − | + | + | 35.8% | [14] |
| HH-fold | − | + | − | 42.1% | [14] |
| TA-fold | − | + | + | 53.9% | [14] |
| dRHP-PseRA | − | + | − | 34.9% | [15] |
| MT-fold | + | + | + | 59.1% | [15] |
| DeepFR (Strategy1) | − | − | + | 44.5% | [13] |
| DeepFR (Strategy2) | − | − | + | 56.1% | [13] |
| DeepFRpro (Strategy1) | − | − | + | 57.6% | [13] |
| DeepFRpro (Strategy2) | − | − | + | 66.0% | [13] |
| DeepSVM-fold | − | + | + | 67.3% | [25] |
| MotifCNN-folda | − | + | + | 72.55% | This study |
Note: The first column gives the methods. The columns 2 to 4 denote whether the method belongs to a class (‘+’) or not (‘–’), where the classes are ‘P’ for sequence-based features, ‘M’ for MSA-based features and ‘S’ for predicted structure-based features. The fifth column reports the overall accuracy. The last column reports the result source of each method.
aRefers to the MotifCNN-fold (PSFM, CCM) in Table 1.
The cluster analysis chart of fold-specific features extracted by MotifDCNN based on CCMs on LE dataset. Similar fold-specific features will be clustered into similar sub-trees.
Evaluation strategies
Results and discussion
Performance of the MotifCNN-fold on LE dataset
The performance of MotifCNN-fold approach based on different pairwise sequence similarity scores and their combinations on LE dataset are shown in Table 1. Compared with the results of DeepSVM-fold [25], we can obviously see that the fold-specific features extracted by motif-based CNN models (MotifCNN and MotifDCNN) are more discriminative than the fold-specific features extracted by the deep neural networks (CNN-BLSTM and DCNN-BLSTM) employed by DeepSVM-fold; especially, the accuracy of MotifCNN-fold (PSFM) is 61%, outperforming the DeepSVM-fold (PSFM) by 12.1%. The MotifCNN-fold (CCM) outperforms the DeepSVM-fold (CCM) by 3.35% in terms of accuracy. Furthermore, the improvement of MotifCNN-fold (PSFM) is more obvious than that of MotifCNN-fold (CCM). The reason is that CCMs already contain the structural information of proteins. Therefore, the results are not surprising that MotifCNN-fold achieves the best performance when combining the PSFM and CCM.
Performance comparison with competing methods on LE dataset
The 32 existing state-of-the-art methods were compared with the proposed method, including alignment methods (PSI-BLAST [5], HMMER [44], SAM-T98 [44], BLASTLINK [26], SSEARCH [45], SSHMM [46], THREADER [47], FUGUE [48], RAPTOR [49], SPARKS [50], SPARKS-X [51], SP3 [52], SP4 [53], SP5 [54], HHpred [55], BoostThreader [56], FFAS-3D [57], HH-fold [14] and dRHP-PseRA [58]), machine learning methods (RF-Fold [8], FOLDpro [9], DN-Fold [16] and DeepFR [13]) and ensemble methods (RFDN-Fold [16], DN-FoldS [16], DN-FoldR [16], TA-fold [14], MT-fold [15], DeepFRpro [13] and DeepSVM-fold [25]). They applied different predicted features for fold recognition, including sequenced-based features, MSA-based features and predicted structure features. The performance of these methods was evaluated on a widely used dataset, the LE dataset [25]. Evaluated by 2-fold cross-validation, their predictive results on the LE dataset are shown in Table 2. From this table, we can see that MotifCNN-fold outperforms all the other competing methods in terms of accuracy. Furthermore, MotifCNN-fold outperforms the DeepSVM-fold [25] by 5.25%. Both DeepSVM-fold and MotifCNN-fold are based on the same features, but the proposed MotifCNN-fold obviously outperformed DeepSVM-fold, indicating that the proposed feature extraction method based on the MotifCNN mainly contributes to the performance improvement.
Analysis of fold-specific features
To investigate whether the extracted fold-specific features by MotifCNN-fold are fold-specific—that is, proteins with the same fold type share similar characteristics, while proteins with different fold types have distinct characteristics—we run bi-clustering on the fold-specific features extracted from MotifDCNN based on CCMs.
In order to highlight the discriminative power of the fold-specific features extracted by motif-based CNN models, for comparison purposes, we compared the clustering analysis of the fold-specific features of the same four fold types reported in [25], including Fold 1_23 [four-helical up-and-down bundle (47161)], Fold 1_4 [DNA/RNA-binding three-helical bundle (46688)], Fold 7_3 [Knottins (small inhibitors, toxins, lectins) (57015)] and Fold 3_1 [TIM beta/alpha-barrel (51350)]. As shown in Figure 3, we can see that the proteins in folds 1_23, 1_4, 7_3 and 3_1 are clustered into distinct sub-trees in MotifCNN-fold based on the fold-specific features, while the proteins in folds 1_23 and 1_4 are clustered into the same sub-trees in DeepSVM-fold based on the fold-specific features (Figure 4 in [25]), indicating that these fold-specific features extracted by MotifDCNN are more effective and discriminative than those extracted by DCNN-BLSTM (CCM). Furthermore, the fold-specific features tend to show similar values for similar protein folds, indicating that these fold-specific features are able to capture the characteristics of protein folds; for example, the 79 fold-specific features (from column 520 to 599 in Figure 4) show high scores for fold 1_23, while they show low scores for other folds, indicating that these fold-specific features can incorporate the information of the three structural motifs: No. 47170, No. 47175 and No. 47195 (see Table S3 in Supplementary Information S1). These structural motifs are all the alpha helix structures, which are able to capture the structural characteristics of Fold 1_23 [four-helical up-and-down bundle (47161)] [36]. The 38 important fold-specific features (from column 813 to 851 in Figure 4) for fold 3_1 incorporate the information of the following structural motifs: No. 51658, No. 51366, No. 51719, No. 51445, No. 51556, etc. (see Table S3 in Supplementary Information S1). These motifs are all the beta strand structures, reflecting the structure of fold 3_1 [TIM beta/alpha-barrel (51350)].
Conclusion
There are many types of motifs in proteins. Among these different motif types, the structural motifs are highly related with protein fold recognition [36]. Therefore, in this study, the MotifCNN and MotifDCNN models were proposed to detect the structural motifs related with the protein folds, based on which the fold-specific features were extracted from the predicted structural information CCMs and PSFMs. Experimental results showed that the fold-specific features based on structural motifs outperformed other features. The reason for the more discriminative power of these fold-specific features is that these features extracted from MotifCNN and MotifDCNN models incorporate the information of structural motifs. As shown in Figure 3, some fold-specific features are highly related with some specific protein folds, indicating that these features can more accurately capture the patterns of protein folds and structural motifs. Feature extraction is one of the keys in constructing computational predictors in the field of bioinformatics. Because the proposed motif-based CNN models are able to capture the information of the structural motifs, it can be anticipated that the motif-based CNN models will be applied to improve the predictive performance of some protein structures and sequence analysis tasks, such as protein remote homology detection, protein disordered region prediction, protein–protein interaction networks [59], etc.
Because the existing fold-specific features lack biological evidences and interpretability, the feature extraction method is still the bottleneck for the performance improvement of the machine learning-based methods. Therefore, it is important to develop efficient feature-extraction methods to improve the discriminative power of the fold-specific features.
Two new feature extraction methods (MotifCNN and MotifDCNN) were proposed by incorporating the structural motifs into the CNNs, aiming to extract the more discriminative fold-specific features with biological attributes considering the evolutionary information from PSFMs and the structure information from CCMs. A new predictor called MotifCNN-fold was proposed by combining SVMs with the pairwise sequence similarity scores based on fold-specific features.
Experimental results on the benchmark dataset showed that MotifCNN-fold obviously outperformed all the other competing methods, indicating that new feature extraction methods are effective for incorporating the protein structural motif information with deep neural networks.
Acknowledgements
The authors are very much indebted to the three anonymous reviewers, whose constructive comments are very helpful in strengthening the presentation of this article.
Funding
This work was supported by the National Natural Science Foundation of China (61672184, 61732012, 61822306), Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063) and Scientific Research Foundation in Shenzhen (JCYJ20180306172207178).
Chen-Chen Li is a master student at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China, and Harbin Institute of Technology, Shenzhen, Guangdong, China. Her expertise is in bioinformatics.
Bin Liu is a professor at the School of Computer Science and Technology, and Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, nature language processing and machine learning.


