Abstract

Protein fold recognition is one of the most critical tasks to explore the structures and functions of the proteins based on their primary sequence information. The existing protein fold recognition approaches rely on features reflecting the characteristics of protein folds. However, the feature extraction methods are still the bottleneck of the performance improvement of these methods. In this paper, we proposed two new feature extraction methods called MotifCNN and MotifDCNN to extract more discriminative fold-specific features based on structural motif kernels to construct the motif-based convolutional neural networks (CNNs). The pairwise sequence similarity scores calculated based on fold-specific features are then fed into support vector machines to construct the predictor for fold recognition, and a predictor called MotifCNN-fold has been proposed. Experimental results on the benchmark dataset showed that MotifCNN-fold obviously outperformed all the other competing methods. In particular, the fold-specific features extracted by MotifCNN and MotifDCNN are more discriminative than the fold-specific features extracted by other deep learning techniques, indicating that incorporating the structural motifs into the CNN is able to capture the characteristics of protein folds.

Introduction

Protein fold recognition is one of the most important tasks to explore the structures and functions of the proteins based on their sequence information. However, the study on identifying the protein folds only based on the sequence information is still a challenging problem due to their low sequence similarities (less than 25% sequence identity) [1, 2]. In this regard, researchers are exploring new approaches to solve this important and difficult task and have proposed some computational methods, which are mainly divided into three categories: alignment methods, machine learning methods and ensemble methods. Alignment methods [3, 4] focus on detecting the local and global pairwise sequences similarities, including sequence–sequence alignment methods, profile–sequence alignment methods (see [5]) and profile–profile alignment methods (see [6, 7]). Machine learning methods treat protein fold recognition as a fold-level classification task by using the classical machine learning techniques (see [8–11]) or deep learning techniques (see [12]). Ensemble methods adopt the consensus strategy to integrate multiple recognition methods (see [12–18]).

All the aforementioned computational methods mainly rely on the traditional sequence-based or structure-based features or the features extracted by deep learning techniques, and have contributed to the developments of protein fold recognition. However, the feature extraction method is still the bottleneck for the performance improvement of the machine learning-based methods [1, 19–23]. It is highly desirable to combine the intelligent representation of proteins with protein biological structure information to automatically extract the discriminative features of the protein sequences.

A protein structural motif is a supersecondary structure [24], describing the connectivity between secondary structural elements. Because protein structures are more conserved than their sequences, two proteins with low sequence similarity would share similar structural motifs. Therefore, protein structural motifs can be considered as the bridge between the protein structures and their sequences. In this study, we are to propose new fold-specific features by incorporating the structural motifs into the convolutional neural networks (CNNs) to explore more discriminative features for protein fold recognition. We introduced two motif-based CNN models to extract the more discriminative fold-specific features with biological attributes considering the evolutionary information from position-specific frequency matrixes (PSFMs) and the fold-specific features considering the structure information from residue–residue contacts (CCMs). Based on the more discriminative fold-specific features with biological attributes, we constructed the feature vector via the pairwise sequence similarity scores calculated from the fold-specific features following a recent study [25]. Combined with support vector machines (SVMs), a new computational predictor called MotifCNN-fold was established for protein fold recognition. Experimental results on a widely used benchmark dataset (LE) showed that MotifCNN-fold outperformed other competing predictors. Furthermore, we analyzed the fold-specific features and showed that the new fold-specific features extracted by motif-based CNN models are more discriminative than the fold-specific features extracted by the traditional neural network models reported in [25], such as convolution neural network-bidirectional long short-term memory (CNN-BLSTM) and deep convolution neural network-bidirectional long short-term memory (DCNN-BLSTM).

Materials and methods

Benchmark dataset

The benchmark dataset of protein fold recognition should be constructed following a rigorous criterion: proteins in the training set and test set can be in the same fold but should be from different superfamilies [15]. LE dataset [26] is a widely used rigorous dataset based on the SCOP database, which contains 976 sequences with 330 folds derived from SCOP sharing pairwise sequence identity less than 40%. In this dataset, 321 proteins have at least one match at fold level. In order to simulate the protein fold recognition task rigorously, the LE dataset was partitioned into two subsets at fold level to ensure that any two proteins from different subsets can belong to the same fold, but they should be in different superfamilies [1, 14].

Protein representations

For a given protein sequence P of length L, it can be expressed as follows [27–29]:
$$\begin{equation} \mathbf{P}={\mathrm{R}}_1,{\mathrm{R}}_2,\dots, {\mathrm{R}}_L, \end{equation}$$
(1)
where |${\mathrm{R}}_i$| represents the |$i$|-th residue and |$L$| represents the length of the protein sequence.

In this study, following a previous study [25], two protein representations were employed, including residue–residue contacts (CCMs) generated by running CCMpred [30], and PSFMs [5]. The CCM contains the predicted structural information of proteins, which is a widely used representation for protein fold recognition [1, 13, 31]. The PSFM is a profile-based representation containing the evolutionary information, which is useful for analyzing proteins sharing low sequence similarities [32].

In order to generate the CCM, the target protein was searched against the uniprot20_2016_02 database through the HHblits tool [33] to generate the multiple sequence alignment (MSA) of the target protein. Then, the MSA was analyzed by running CCMpred [30] with default parameters to analyze the coevolution between residues and assign the contact probability for each residues–residues pair.

The PSFM was calculated through the MSA generated by PSI-BLAST [5] with the E-value of 0.001 and the number of iterations of three by searching against the nrdb90 database.

Extracting fold-specific features by motif-based CNNs

Previous studies [1, 2, 13] showed that the extraction of the related features associated with fold types is critical for improving the performance of fold recognition; furthermore, the fold-specific features automatically generated by deep learning techniques are more accurate than other traditional features. An improved method called DeepSVM-fold [25] has been proposed by constructing three different deep neural networks to extract the fold-specific features from evolutionary information (position-specific scoring matrix (PSSM) and position-specific frequency matrix (PSFM)) and structural information (residue–residue contacts (CCMs)). Deep neural networks learn the related features associated with fold types by adjusting the connection strength of their neurons. This process filters out noise and retains only the most relevant functions.

However, the learning process of deep neural networks is like a black box, and the related features lack biological evidences and interpretability. How to incorporate the biological attributes into deep learning technology to improve the biological evidences and interpretability of the related features is the main driving force of this study. In this regard, we have introduced two motif-based CNN models (MotifCNN and MotifDCNN) by incorporating the structural motifs into the CNN, aiming to extract the more discriminative fold-specific features with biological attributes considering the evolutionary information from PSFMs and the structural information from CCMs.

In order to design the motif-based CNN models, some specific structural motifs were selected, based on which the corresponding motif-based convolutional kernels were designed. The motif-based CNN models (MotifCNN and MotifDCNN) were constructed based on the motif-based convolutional kernels to extract the fold-specific features from PSFMs and CCMs, respectively. The MotifCNN was established to extract the evolutionary information from PSFMs, aiming to detect whether a kmer in the proteins matches a structural motif for a specific fold. The MotifDCNN was constructed to extract the structural information from CCMs, aiming to detect if a kmer is related with a structural motif for a specific fold.

Figure 1

The network architecture of MotifCNN to extract fold-specific features from PSFMs.

In order to extract the fold-specific features of the proteins on the benchmark dataset, a comprehensive database [25] was employed to train the models, which consists of 19 772 proteins covering 1211 fold types. All the proteins in the sequence database share <40% sequence similarity with the proteins in the LE dataset. The motif-based CNN models were implemented by the TensorFlow framework. Batch normalization [34] and the dropout technique [35] were used.

Proteins structural motif selection

Structural motifs are important for the protein folds because these motifs form the common structural cores to maintain the particular spatial patterns of protein folds [36], and the prominent features of motifs in protein folds are helices or sheets. In this study, the protein structural motifs were selected from the motif database MegaMotifBase [36], which is a protein family, and the superfamily structural motif database, available at http://caps.ncbs.res.in/MegaMotifbase/sflist.html. MegaMotifBase database provides the key properties of structural motifs, including solvent inaccessible, solvent accessible, alpha helix, beta strand, 3-10 helix, hydrogen bond to main chain amide, hydrogen bond to main chain carbonyl, disulfide bond, positive phi, etc. According to the superfamilies in the LE dataset, the structural motifs were selected, and their corresponding MSAs were extracted from the MegaMotifBase database [36]. Finally, we obtained 128 structural motifs. For detailed information of the 128 structural motifs, please refer to Supplementary Information S1.

Extracting fold-specific features from PSFMs by MotifCNN

Feature extraction is a very important step for constructing a computational predictor [23, 37, 38]. In order to extract more discriminative protein-specific fold features based on evolutionary information profiles (PSFMs), we introduced the protein structural motif kernel so as to design a CNN called MotifCNN. The MotifCNN architecture is shown in Figure 1. MotifCNN contains a randomly initialized convolutional layer and a motif-based convolutional layer, and then the two convolutional layers were respectively processed by Max-pooling layers. The pooling results were concatenated and connected to the fully connected layer simultaneously to learn the fold-specific features. The randomly initialized convolutional layer can identify local patterns in protein sequences, and a motif-based convolutional layer can identify whether a protein sequence contains the particular structural motifs and the occurrence probability of the particular motifs. The protein fold-specific features were further extracted via extracting the dependency information between the motif feature patterns through the fully connected layer. For detailed parameters of MotifCNN, please refer to Supplementary Information S1.

The PSFM was inputted into the input layer in the MotifCNN as described in Protein representations. The convolutional layer extracts the protein sequence local patterns through the interaction between local perceptual domain and multiple convolutional kernels. The local perceptual domain can be represented as Eq. (2) [39]:
$$\begin{equation} {\mathbf{D}}_i=\left[\left.\begin{array}{@{}cc@{}}{\mathbf{v}}_i& {\mathbf{v}}_{i+1}\end{array}\kern0.5em \begin{array}{cc}\cdots & {\mathbf{v}}_{i+k-1}\end{array}\right],\left(0\le i\le L-k+1\right)\right., \end{equation}$$
(2)
where |$i$| is the starting position of the protein subsequence in the |$i$|-th local perceptual domain in the convolutional layer, and |${\mathbf{v}}_i$| represents the vector of the |$i$|-th residue in the protein sequence with the dimension 20. |$k$| is the length of the convolutional kernel and |$L$| is the length of the protein sequence.

The convolutional layer is achieved by multi-convolutional kernels, local perceptual domains and shared weights, including a randomly initialized convolutional layer and a motif-based convolutional layer. In the convolutional layer, the input matrix is divided into different local perceptual domains, and the information in these perceptual domains will be connected to different convolutional kernels. Specifically, the randomly initialized convolutional layer randomly initializes different convolutional kernels, and automatically adjusts the weights of the convolutional kernels by gradient descent to detect the local patterns of the protein sequence.

Different from randomly initialized convolutional kernels, the convolutional kernels of the motif-based convolutional layer are converted from the protein structural motifs. According to the MSAs of the protein structural motifs, they are converted into a frequency matrix with |$m\times 20$| as the convolutional kernel |$\mathbf{M}$| of the motif-based convolutional layer as shown in Eq. (3):
$$\begin{equation} {\mathbf{M}}_f=\left[\begin{array}{@{}cc@{}}\begin{array}{cc}{g}_{1,1}& {g}_{1,2}\end{array}& \begin{array}{cc}\cdots & {g}_{1,20}\end{array}\\{}\begin{array}{cc}{g}_{2,1}& {g}_{2,2}\\{}\vdots & \vdots \\{}{g}_{m,1}& {g}_{m,2}\end{array}& \begin{array}{cc}\cdots & {g}_{2,20}\\{}\ddots & \vdots \\{}\cdots & {g}_{m,20}\end{array}\end{array}\right], \end{equation}$$
(3)
where 20 represents the number of standard amino acids, |$m$| represents the length of the MSA of the motif and the element |${g}_{i,j}$| in the matrix represents the normalized frequency of the |$j$|-th amino acid in the |$i$|-th position of the protein structural motif during evolution.
The convolutional layer connects the local perceptual domain |${\mathbf{D}}_i$| with multiple different randomly initialized convolutional kernels and multiple different motif-based convolutional kernels. The convolutional kernel works as shown in Eqs. (46) [39]:
$$\begin{equation} {l}_i^f=\sigma \left({\mathbf{W}}_f\ast{\mathbf{D}}_i+{b}_f^1\right), \end{equation}$$
(4)
$$\begin{equation} {c}_i^f=\sigma \left({\mathbf{M}}_f\ast{\mathbf{D}}_i+{b}_f^2\right), \end{equation}$$
(5)
$$\begin{equation} \sigma (z)=\max \left(0,z\right), \end{equation}$$
(6)
where |${\mathbf{W}}_f$| is the weight of the |$f$|-th randomly initialized convolutional kernel, |${\mathbf{M}}_f$| is the weight of the f-th motif-based convolutional kernel and |${b}_f$| is the offset of the |$f$|-th convolutional kernel. σ is the activation function. In this experiment, the activation function of the convolutional layer applies ReLU. |${l}_i^f$| represents the output of the |$i$|-th protein subsequence through the |$f$|–-th randomly initialized convolutional kernel, which represents the importance of the |$i$|-th protein subsequence.|${c}_i^f$| represents the output of the |$i$|-th protein subsequence through the |$f$|-th motif-based convolutional kernel, representing the importance of the protein structural motif |$f$| in the |$i$|-th protein subsequence.
Figure 2

The network architecture of MotifDCNN to extract fold-specific features from CCMs.

Therefore, the protein input matrix is subjected to convolutional layers, including a randomly initialized convolutional layer and a motif-based convolutional layer to obtain two feature maps, RF and MF, as shown in Eqs. (7) and (8) [39]:
$$\begin{equation} \mathbf{RF}=\left[\begin{array}{@{}ccc@{}}{l}_1^1& \cdots & {l}_{L-k+1}^1\\{}\vdots & \ddots & \vdots \\{}\ {l}_1^N& \cdots &\ {l}_{L-k+1}^N\end{array}\right], \end{equation}$$
(7)
$$\begin{equation} \mathbf{MF}=\left[\begin{array}{@{}ccc@{}}{c}_1^1& \cdots & {c}_{L-k+1}^1\\{}\vdots & \ddots & \vdots \\{}\ {c}_1^S& \cdots &\ {c}_{L-k+1}^S\end{array}\right], \end{equation}$$
(8)
where |$N$| represents the number of randomly initialized convolutional kernels and |$S$| represents the number of motif-based convolutional kernels. Each column in the feature map is the feature vector of a certain protein subsequence. The values of the RF are composed of the importance of the subsequence calculated by different randomly initialized convolutional kernels. The values of MF consist of the appearance probability importance of protein structural motifs by motif-based convolutional kernels.

Then, the feature maps extract the maximum value by Max-pooling layers, respectively, to form the motif features, which are inputted into the fully connected layer to extract the fold-specific features of the protein.

Extracting fold-specific features from CCMs by MotifDCNN

Different from the PSFM profile, CCM contains the protein tertiary structure information, describing the contact likelihood among residues. In order to extract more discriminative protein-specific fold features from CCMs, the motif-based convolutional kernel for CCMs was proposed, which is consistent with the representation of CCM for the deep convolutional network called MotifDCNN. The network architecture is shown in Figure 2. For detailed parameters of MotifDCNN, please refer to Supplementary Information S1.

The differences between MotifDCNN and MotifCNN are as follows: (i) the randomly initialized convolutional layer of MotifDCNN was deeper than that of MotifCNN so as to more accurately identify local patterns in CCMs and (ii) MotifDCNN designed the convolutional kernels of the motif-based convolutional layer based on the CCMs of the structural motifs so as to detect the structural correlations between kmers and structural motif along the proteins.

The CCM is inputted into the input layer in the MotifDCNN as described in Protein representations, and then the convolutional layer extracts the protein sequence local patterns through the interaction between local perceptual domain and multiple convolutional kernels. The expression of local perceptual domain is shown in Eq. (9) [39]:
$$\begin{equation} {\mathbf{D}}_i=\left[\begin{array}{@{}ccc@{}}{h}_{i,j}& \cdots & {h}_{i,j+k-1}\\{}\vdots & \ddots & \vdots \\{}{h}_{i+k-1,j}& \cdots & {h}_{i+k-1,j+k-1}\end{array}\right],\left(0\le i\le L-k+1\right), \end{equation}$$
(9)
where |$i$| is the starting position of the protein subsequence in the |$i$|-th local perceptual domain in the convolutional layer, and |${h}_{i,j}$| represents the contact likelihood of the |$i$|-th residue and the |$j$|-th residue in the protein sequence P (Eq. (1)). |$k$| is the length of the convolutional kernel and |$L$| is the length of the protein sequence.
To ensure efficient convolution of residue–residue contact map (CCM), the MSA of a structural motif was converted into a CCM matrix with |$m\times m$| by using CCMpred [30] with default parameters, and then the corresponding CCM was treated as the convolutional kernel M of the motif-based convolutional layer in the MotifDCNN as shown in Eq. (10) [39]:
$$\begin{equation} {\mathbf{M}}_f=\left[\begin{array}{@{}cc@{}}\begin{array}{cc}{g}_{1,1}& {g}_{1,2}\end{array}& \begin{array}{cc}\cdots & {g}_{1,m}\end{array}\\{}\begin{array}{cc}{g}_{2,1}& {g}_{2,2}\\{}\vdots & \vdots \\{}{g}_{m,1}& {g}_{m,2}\end{array}& \begin{array}{cc}\cdots & {g}_{2,m}\\{}\vdots & \vdots \\{}\cdots & {g}_{m,m}\end{array}\end{array}\right], \end{equation}$$
(10)
where |$m$| represents the length of the MSA of the motifs, and the element |${g}_{i,j}$| in the matrix represents contact possibility of the |$i$|-th residue and the |$j$|-th residue in the protein structural motif during the evolution process.

The convolutional layer connects the local perceptual domain |${\mathbf{D}}_i$| with multiple different randomly initialized convolutional kernels and multiple different motif-based convolutional kernels. The convolutional kernel works as shown in Eqs. (46).

Therefore, the protein input matrix is subjected to convolutional layers, including a randomly initialized deep convolutional layer and a motif-based convolutional layer to obtain two feature maps, RF and MF, as shown in Eqs. (78). Then, the feature maps extract the maximum value by Max-pooling layers, respectively, to form the motif features, which are inputted into the fully connected layer to extract the fold-specific features of the protein.

Combining fold-specific features with SVMs

Inspired by DeepSVM-fold [25], the vectorization strategy of pairwise sequence similarity scores is efficient for fold recognition on the LE dataset. Therefore, we followed the process of the DeepSVM-fold to construct the predictor. The detailed process is as follows.

First, pairwise sequence similarity scores based on fold-specific features extracted by motif-based CNN models on the LE dataset were calculated. The pairwise sequence similarity score of any two proteins was measured by the cosine value of the fold-specific features [13, 25]:
$$\begin{equation} {S}_r\left(q,p\right)=\frac{f_q {\cdot}\, {f}_p}{\left\Vert{f}_q\right\Vert \left\Vert{f}_p\right\Vert }, \end{equation}$$
(11)
where |${f}_q$| and |${f}_p$| are the fold-specific features of protein |$q$| and protein |$p$| extracted by motif-based CNN models.
Second, the vectorization strategy reported in [25] was employed to combine the fold-specific information with biological attributes considering the evolutionary information from PSFMs and the structure information from CCMs. Given a benchmark dataset with |$n$| sequences |${\Big\{{\mathbf{X}}^i,{\mathbf{y}}^i\Big\}}_{i=1}^n$| belonging to |$c$| fold types, where |${\mathbf{X}}^i$| is the feature of the |$i$|–-th protein sequence and |${\mathbf{y}}^i$| is the corresponding protein fold type. For a given protein |$q$|⁠, its feature can be represented as a vector |$\mathbf{X}$|⁠:
$$\begin{equation} \mathbf{X}=\left[{\mathbf{X}}_{\mathrm{PSFM}},{\mathbf{X}}_{\mathrm{CCM}}\right], \end{equation}$$
(12)
where |${\mathbf{X}}_r$| is defined as
$$\begin{equation} {\mathbf{X}}_{\boldsymbol{r}}={\left[\begin{array}{@{}c@{}}{s}_r\left(q,{p}_1\right)\\{}\vdots \\{}\begin{array}{c}{s}_r\left(q,{p}_j\right)\\{}\vdots \\{}{s}_r\left(q,{p}_n\right)\end{array}\end{array}\right]}^{\mathrm{T}}(r\in \left[\mathrm{PSFM},\mathrm{CCM}\right],j\in \left[1,\cdots, n\right]) \end{equation}$$
13
where |${p}_j(1\le j\le n)$| represents the |$j$|-th training sample, which would be in the same protein fold with the |$q$|-th query sequence in the test set. |${\mathbf{X}}_r$| represents the similarity score vector from PSFMs and CCMs based on Eq. (11). The feature matrix of the training set consists of the feature vector X of each training sequence, and the feature matrix of the test set consists of the feature vector X of each test sequence (see [25]). When aligning with the sample itself, the corresponding feature value was set as 1.

Finally, an SVM [40] was used as an effective and fast supervised learning model for constructing the multi-class classifier, which has been widely used in bioinformatics [22, 41, 42]. In this study, the SVM classifier was implemented by using the Python package, Scikit-learn, [43] with the command line ‘svm.svc (random_state=“0”, max_iter=“1000”, kernel=“linear”, decision_function_shape=“ovr”, gamma=“auto”, c=15)’.

Table 1

The performance comparison of MotifCNN-fold and DeepSVM-fold on LE dataset

MethodModelsaAccuracy(%)
DeepSVM-fold (PSFM)CNN-BLSTM48.9%
MotifCNN-fold (PSFM)MotifCNN61.0%
DeepSVM-fold (CCM)DCNN-BLSTM56.7%
MotifCNN-fold (CCM)MotifDCNN60.05%
DeepSVM-fold (CCM, PSFM)CNN-BLSTM, DCNN-BLSTM67.3%
MotifCNN-fold (CCM, PSFM)MotifCNN, MotifDCNN72.55%
MethodModelsaAccuracy(%)
DeepSVM-fold (PSFM)CNN-BLSTM48.9%
MotifCNN-fold (PSFM)MotifCNN61.0%
DeepSVM-fold (CCM)DCNN-BLSTM56.7%
MotifCNN-fold (CCM)MotifDCNN60.05%
DeepSVM-fold (CCM, PSFM)CNN-BLSTM, DCNN-BLSTM67.3%
MotifCNN-fold (CCM, PSFM)MotifCNN, MotifDCNN72.55%
a

aThe parameters of MotifCNN and MotifDCNN are respectively given in Supplementary Table S1 and Supplementary Table S2 in Supplementary Information S1. The parameters of CNN-BLSTM and DCNN-BLSTM were reported in [25].

Table 1

The performance comparison of MotifCNN-fold and DeepSVM-fold on LE dataset

MethodModelsaAccuracy(%)
DeepSVM-fold (PSFM)CNN-BLSTM48.9%
MotifCNN-fold (PSFM)MotifCNN61.0%
DeepSVM-fold (CCM)DCNN-BLSTM56.7%
MotifCNN-fold (CCM)MotifDCNN60.05%
DeepSVM-fold (CCM, PSFM)CNN-BLSTM, DCNN-BLSTM67.3%
MotifCNN-fold (CCM, PSFM)MotifCNN, MotifDCNN72.55%
MethodModelsaAccuracy(%)
DeepSVM-fold (PSFM)CNN-BLSTM48.9%
MotifCNN-fold (PSFM)MotifCNN61.0%
DeepSVM-fold (CCM)DCNN-BLSTM56.7%
MotifCNN-fold (CCM)MotifDCNN60.05%
DeepSVM-fold (CCM, PSFM)CNN-BLSTM, DCNN-BLSTM67.3%
MotifCNN-fold (CCM, PSFM)MotifCNN, MotifDCNN72.55%
a

aThe parameters of MotifCNN and MotifDCNN are respectively given in Supplementary Table S1 and Supplementary Table S2 in Supplementary Information S1. The parameters of CNN-BLSTM and DCNN-BLSTM were reported in [25].

Table 2

Performance of various methods for protein fold recognition on LE dataset via 2-fold cross-validation

MethodsPMSAccuracySource
PSI-BLAST+4.0%[26]
HMMER+4.4%[26]
SAM-T98+3.4%[26]
BLASTLINK+6.9%[26]
SSEARCH+5.6%[26]
SSHMM+6.9%[26]
THREADER+14.6%[26]
FUGUE+12.5%[9]
RAPTOR++25.4%[9]
SPARKS++28.7%[9]
SP3++30.8%[9]
FOLDpro+++26.5%[9]
HHpred++25.2%[16]
SP4++30.8%[16]
SP5++37.9%[16]
BoostThreader++42.6%[16]
SPARKS-X++45.2%[16]
RF-Fold+++40.8%[16]
DN-Fold+++33.6%[16]
RFDN-Fold+++37.7%[16]
DN-FoldS+++33.3%[16]
DN-FoldR+++27.4%[16]
FFAS-3D++35.8%[14]
HH-fold+42.1%[14]
TA-fold++53.9%[14]
dRHP-PseRA+34.9%[15]
MT-fold+++59.1%[15]
DeepFR (Strategy1)+44.5%[13]
DeepFR (Strategy2)+56.1%[13]
DeepFRpro (Strategy1)+57.6%[13]
DeepFRpro (Strategy2)+66.0%[13]
DeepSVM-fold++67.3%[25]
MotifCNN-folda++72.55%This study
MethodsPMSAccuracySource
PSI-BLAST+4.0%[26]
HMMER+4.4%[26]
SAM-T98+3.4%[26]
BLASTLINK+6.9%[26]
SSEARCH+5.6%[26]
SSHMM+6.9%[26]
THREADER+14.6%[26]
FUGUE+12.5%[9]
RAPTOR++25.4%[9]
SPARKS++28.7%[9]
SP3++30.8%[9]
FOLDpro+++26.5%[9]
HHpred++25.2%[16]
SP4++30.8%[16]
SP5++37.9%[16]
BoostThreader++42.6%[16]
SPARKS-X++45.2%[16]
RF-Fold+++40.8%[16]
DN-Fold+++33.6%[16]
RFDN-Fold+++37.7%[16]
DN-FoldS+++33.3%[16]
DN-FoldR+++27.4%[16]
FFAS-3D++35.8%[14]
HH-fold+42.1%[14]
TA-fold++53.9%[14]
dRHP-PseRA+34.9%[15]
MT-fold+++59.1%[15]
DeepFR (Strategy1)+44.5%[13]
DeepFR (Strategy2)+56.1%[13]
DeepFRpro (Strategy1)+57.6%[13]
DeepFRpro (Strategy2)+66.0%[13]
DeepSVM-fold++67.3%[25]
MotifCNN-folda++72.55%This study

Note: The first column gives the methods. The columns 2 to 4 denote whether the method belongs to a class (‘+’) or not (‘–’), where the classes are ‘P’ for sequence-based features, ‘M’ for MSA-based features and ‘S’ for predicted structure-based features. The fifth column reports the overall accuracy. The last column reports the result source of each method.

a

aRefers to the MotifCNN-fold (PSFM, CCM) in Table 1.

Table 2

Performance of various methods for protein fold recognition on LE dataset via 2-fold cross-validation

MethodsPMSAccuracySource
PSI-BLAST+4.0%[26]
HMMER+4.4%[26]
SAM-T98+3.4%[26]
BLASTLINK+6.9%[26]
SSEARCH+5.6%[26]
SSHMM+6.9%[26]
THREADER+14.6%[26]
FUGUE+12.5%[9]
RAPTOR++25.4%[9]
SPARKS++28.7%[9]
SP3++30.8%[9]
FOLDpro+++26.5%[9]
HHpred++25.2%[16]
SP4++30.8%[16]
SP5++37.9%[16]
BoostThreader++42.6%[16]
SPARKS-X++45.2%[16]
RF-Fold+++40.8%[16]
DN-Fold+++33.6%[16]
RFDN-Fold+++37.7%[16]
DN-FoldS+++33.3%[16]
DN-FoldR+++27.4%[16]
FFAS-3D++35.8%[14]
HH-fold+42.1%[14]
TA-fold++53.9%[14]
dRHP-PseRA+34.9%[15]
MT-fold+++59.1%[15]
DeepFR (Strategy1)+44.5%[13]
DeepFR (Strategy2)+56.1%[13]
DeepFRpro (Strategy1)+57.6%[13]
DeepFRpro (Strategy2)+66.0%[13]
DeepSVM-fold++67.3%[25]
MotifCNN-folda++72.55%This study
MethodsPMSAccuracySource
PSI-BLAST+4.0%[26]
HMMER+4.4%[26]
SAM-T98+3.4%[26]
BLASTLINK+6.9%[26]
SSEARCH+5.6%[26]
SSHMM+6.9%[26]
THREADER+14.6%[26]
FUGUE+12.5%[9]
RAPTOR++25.4%[9]
SPARKS++28.7%[9]
SP3++30.8%[9]
FOLDpro+++26.5%[9]
HHpred++25.2%[16]
SP4++30.8%[16]
SP5++37.9%[16]
BoostThreader++42.6%[16]
SPARKS-X++45.2%[16]
RF-Fold+++40.8%[16]
DN-Fold+++33.6%[16]
RFDN-Fold+++37.7%[16]
DN-FoldS+++33.3%[16]
DN-FoldR+++27.4%[16]
FFAS-3D++35.8%[14]
HH-fold+42.1%[14]
TA-fold++53.9%[14]
dRHP-PseRA+34.9%[15]
MT-fold+++59.1%[15]
DeepFR (Strategy1)+44.5%[13]
DeepFR (Strategy2)+56.1%[13]
DeepFRpro (Strategy1)+57.6%[13]
DeepFRpro (Strategy2)+66.0%[13]
DeepSVM-fold++67.3%[25]
MotifCNN-folda++72.55%This study

Note: The first column gives the methods. The columns 2 to 4 denote whether the method belongs to a class (‘+’) or not (‘–’), where the classes are ‘P’ for sequence-based features, ‘M’ for MSA-based features and ‘S’ for predicted structure-based features. The fifth column reports the overall accuracy. The last column reports the result source of each method.

a

aRefers to the MotifCNN-fold (PSFM, CCM) in Table 1.

Figure 3

The cluster analysis chart of fold-specific features extracted by MotifDCNN based on CCMs on LE dataset. Similar fold-specific features will be clustered into similar sub-trees.

Evaluation strategies

To rigorously simulate fold recognition task, it must be ensured that any two proteins from different subsets can be in the same fold but not in the same superfamily. Therefore, the LE dataset was partitioned into two subsets at fold level to ensure that any two proteins from different subsets could belong to the same fold, but they should be in different superfamilies [14, 15]. Then, the 2-fold cross-validation was employed to evaluate the performance of various methods on the LE dataset [25]. The performance of various methods was evaluated by the overall accuracy [15, 25]:
$$\begin{align} \mathrm{Accuracy}=\frac{CN}{N}\times 100\%, \end{align}$$
(14)
where CN is the number of the protein samples, which are classified to the fold types correctly, and N is the size of the test dataset.

Results and discussion

Performance of the MotifCNN-fold on LE dataset

The performance of MotifCNN-fold approach based on different pairwise sequence similarity scores and their combinations on LE dataset are shown in Table 1. Compared with the results of DeepSVM-fold [25], we can obviously see that the fold-specific features extracted by motif-based CNN models (MotifCNN and MotifDCNN) are more discriminative than the fold-specific features extracted by the deep neural networks (CNN-BLSTM and DCNN-BLSTM) employed by DeepSVM-fold; especially, the accuracy of MotifCNN-fold (PSFM) is 61%, outperforming the DeepSVM-fold (PSFM) by 12.1%. The MotifCNN-fold (CCM) outperforms the DeepSVM-fold (CCM) by 3.35% in terms of accuracy. Furthermore, the improvement of MotifCNN-fold (PSFM) is more obvious than that of MotifCNN-fold (CCM). The reason is that CCMs already contain the structural information of proteins. Therefore, the results are not surprising that MotifCNN-fold achieves the best performance when combining the PSFM and CCM.

Performance comparison with competing methods on LE dataset

The 32 existing state-of-the-art methods were compared with the proposed method, including alignment methods (PSI-BLAST [5], HMMER [44], SAM-T98 [44], BLASTLINK [26], SSEARCH [45], SSHMM [46], THREADER [47], FUGUE [48], RAPTOR [49], SPARKS [50], SPARKS-X [51], SP3 [52], SP4 [53], SP5 [54], HHpred [55], BoostThreader [56], FFAS-3D [57], HH-fold [14] and dRHP-PseRA [58]), machine learning methods (RF-Fold [8], FOLDpro [9], DN-Fold [16] and DeepFR [13]) and ensemble methods (RFDN-Fold [16], DN-FoldS [16], DN-FoldR [16], TA-fold [14], MT-fold [15], DeepFRpro [13] and DeepSVM-fold [25]). They applied different predicted features for fold recognition, including sequenced-based features, MSA-based features and predicted structure features. The performance of these methods was evaluated on a widely used dataset, the LE dataset [25]. Evaluated by 2-fold cross-validation, their predictive results on the LE dataset are shown in Table 2. From this table, we can see that MotifCNN-fold outperforms all the other competing methods in terms of accuracy. Furthermore, MotifCNN-fold outperforms the DeepSVM-fold [25] by 5.25%. Both DeepSVM-fold and MotifCNN-fold are based on the same features, but the proposed MotifCNN-fold obviously outperformed DeepSVM-fold, indicating that the proposed feature extraction method based on the MotifCNN mainly contributes to the performance improvement.

Analysis of fold-specific features

To investigate whether the extracted fold-specific features by MotifCNN-fold are fold-specific—that is, proteins with the same fold type share similar characteristics, while proteins with different fold types have distinct characteristics—we run bi-clustering on the fold-specific features extracted from MotifDCNN based on CCMs.

In order to highlight the discriminative power of the fold-specific features extracted by motif-based CNN models, for comparison purposes, we compared the clustering analysis of the fold-specific features of the same four fold types reported in [25], including Fold 1_23 [four-helical up-and-down bundle (47161)], Fold 1_4 [DNA/RNA-binding three-helical bundle (46688)], Fold 7_3 [Knottins (small inhibitors, toxins, lectins) (57015)] and Fold 3_1 [TIM beta/alpha-barrel (51350)]. As shown in Figure 3, we can see that the proteins in folds 1_23, 1_4, 7_3 and 3_1 are clustered into distinct sub-trees in MotifCNN-fold based on the fold-specific features, while the proteins in folds 1_23 and 1_4 are clustered into the same sub-trees in DeepSVM-fold based on the fold-specific features (Figure 4 in [25]), indicating that these fold-specific features extracted by MotifDCNN are more effective and discriminative than those extracted by DCNN-BLSTM (CCM). Furthermore, the fold-specific features tend to show similar values for similar protein folds, indicating that these fold-specific features are able to capture the characteristics of protein folds; for example, the 79 fold-specific features (from column 520 to 599 in Figure 4) show high scores for fold 1_23, while they show low scores for other folds, indicating that these fold-specific features can incorporate the information of the three structural motifs: No. 47170, No. 47175 and No. 47195 (see Table S3 in Supplementary Information S1). These structural motifs are all the alpha helix structures, which are able to capture the structural characteristics of Fold 1_23 [four-helical up-and-down bundle (47161)] [36]. The 38 important fold-specific features (from column 813 to 851 in Figure 4) for fold 3_1 incorporate the information of the following structural motifs: No. 51658, No. 51366, No. 51719, No. 51445, No. 51556, etc. (see Table S3 in Supplementary Information S1). These motifs are all the beta strand structures, reflecting the structure of fold 3_1 [TIM beta/alpha-barrel (51350)].

Conclusion

There are many types of motifs in proteins. Among these different motif types, the structural motifs are highly related with protein fold recognition [36]. Therefore, in this study, the MotifCNN and MotifDCNN models were proposed to detect the structural motifs related with the protein folds, based on which the fold-specific features were extracted from the predicted structural information CCMs and PSFMs. Experimental results showed that the fold-specific features based on structural motifs outperformed other features. The reason for the more discriminative power of these fold-specific features is that these features extracted from MotifCNN and MotifDCNN models incorporate the information of structural motifs. As shown in Figure 3, some fold-specific features are highly related with some specific protein folds, indicating that these features can more accurately capture the patterns of protein folds and structural motifs. Feature extraction is one of the keys in constructing computational predictors in the field of bioinformatics. Because the proposed motif-based CNN models are able to capture the information of the structural motifs, it can be anticipated that the motif-based CNN models will be applied to improve the predictive performance of some protein structures and sequence analysis tasks, such as protein remote homology detection, protein disordered region prediction, protein–protein interaction networks [59], etc.

Key Points
  • Because the existing fold-specific features lack biological evidences and interpretability, the feature extraction method is still the bottleneck for the performance improvement of the machine learning-based methods. Therefore, it is important to develop efficient feature-extraction methods to improve the discriminative power of the fold-specific features.

  • Two new feature extraction methods (MotifCNN and MotifDCNN) were proposed by incorporating the structural motifs into the CNNs, aiming to extract the more discriminative fold-specific features with biological attributes considering the evolutionary information from PSFMs and the structure information from CCMs. A new predictor called MotifCNN-fold was proposed by combining SVMs with the pairwise sequence similarity scores based on fold-specific features.

  • Experimental results on the benchmark dataset showed that MotifCNN-fold obviously outperformed all the other competing methods, indicating that new feature extraction methods are effective for incorporating the protein structural motif information with deep neural networks.

Acknowledgements

The authors are very much indebted to the three anonymous reviewers, whose constructive comments are very helpful in strengthening the presentation of this article.

Funding

This work was supported by the National Natural Science Foundation of China (61672184, 61732012, 61822306), Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063) and Scientific Research Foundation in Shenzhen (JCYJ20180306172207178).

Chen-Chen Li is a master student at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China, and Harbin Institute of Technology, Shenzhen, Guangdong, China. Her expertise is in bioinformatics.

Bin Liu is a professor at the School of Computer Science and Technology, and Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, nature language processing and machine learning.

References

1.

Liu
 
B
,
Li
 
S
.
ProtDet-CCH: protein remote homology detection by combining long short-term memory and ranking methods
.
IEEE/ACM Trans Comput Biol Bioinform
 
2019
;
16
:
1203
10
.

2.

Liu
 
B
,
Zhu
 
Y
.
ProtDec-LTR3.0: protein remote homology detection by incorporating profile-based features into learning to rank
.
IEEE Access
 
2019
;
7
:
102499
507
.

3.

Chen
 
J
,
Guo
 
M
,
Wang
 
X
, et al.   
A comprehensive review and comparison of different computational methods for protein remote homology detection
.
Brief Bioinform
 
2018
;
9
:
231
44
.

4.

Zou
 
Q
,
Hu
 
Q
,
Guo
 
M
, et al.   
HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy
.
Bioinformatics
 
2015
;
31
:
2475
81
.

5.

Altschul
 
SF
,
Madden
 
TL
,
Schaffer
 
AA
, et al.   
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
 
1997
;
25
:
3389
402
.

6.

Soding
 
J
.
Protein homology detection by HMM–HMM comparison
.
Bioinformatics
 
2005
;
21
:
951
60
.

7.

Ma
 
JZ
,
Wang
 
S
,
Wang
 
ZY
, et al.   
MRFalign: protein homology detection through alignment of Markov random fields
.
PLoS Comput Biol
 
2014
;
10
:
e1003500
.

8.

Jo
 
T
,
Cheng
 
J
.
Improving protein fold recognition by random forest
.
BMC Bioinformatics
 
2014
;
15
(
Suppl 11
):
S14
.

9.

Cheng
 
J
,
Baldi
 
P
.
A machine learning information retrieval approach to protein fold recognition
.
Bioinformatics
 
2006
;
22
:
1456
63
.

10.

Liu
 
B
,
Chen
 
J
,
Guo
 
M
, et al.   
Protein remote homology detection and fold recognition based on sequence-order frequency matrix
.
IEEE/ACM Trans Comput Biol Bioinform
 
2019
;
16
:
292
300
.

11.

Wei
 
L
,
Zou
 
Q
.
Recent progress in machine learning-based methods for protein fold recognition
.
Int J Mol Sci
 
2016
;
17
:
2118
.

12.

Hou
 
J
,
Adhikari
 
B
,
Cheng
 
J
.
DeepSF: deep convolutional neural network for mapping protein sequences to folds
.
Bioinformatics
 
2018
;
34
:
1295
303
.

13.

Zhu
 
J
,
Zhang
 
H
,
Li
 
SC
, et al.   
Improving protein fold recognition by extracting fold-specific features from predicted residue–residue contacts
.
Bioinformatics
 
2017
;
33
:
3749
57
.

14.

Xia
 
JQ
,
Peng
 
ZL
,
Qi
 
DW
, et al.   
An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier
.
Bioinformatics
 
2017
;
33
:
863
70
.

15.

Yan
 
K
,
Fang
 
X
,
Xu
 
Y
, et al.   
Protein Fold Recognition based on Multi-view Modeling
.
Bioinformatics
 
2019
;
35
:
2985
90
.

16.

Jo
 
T
,
Hou
 
J
,
Eickholt
 
J
, et al.   
Improving protein fold recognition by deep learning networks
.
Sci Rep
 
2015
;
5
:
17573
.

17.

Lin
 
C
,
Zou
 
Y
,
Qin
 
J
, et al.   
Hierarchical classification of protein folds using a novel ensemble classifier
.
PLoS One
 
2013
;
8
:
e56499
.

18.

Chen
 
W
,
Liu
 
X
,
Huang
 
Y
, et al.   
Improved method for predicting protein fold patterns with ensemble classifiers
.
Genet Mol Res
 
2012
;
11
:
174
81
.

19.

Liu
 
B
,
Jiang
 
S
,
Zou
 
Q
.
HITS-PR-HHblits: protein remote homology detection by combining PageRank and hyperlink-induced topic search
.
Brief Bioinform
doi: 10.1093/bib/bby104.

20.

Xu
 
A
,
Chen
 
J
,
Peng
 
H
, et al.   
Simultaneous interrogation of cancer omics to identify subtypes with significant clinical differences
.
Front Genet
 
2019
;
10
:
236
.

21.

Chen
 
J
,
Han
 
G
,
Xu
 
A
, et al.   
Identification of multidimensional regulatory modules through multi-graph matching with network constraints
.
IEEE Trans Biomed Eng
doi: 10.1109/TBME.2019.2927157.

22.

Liu
 
B
,
Li
 
K
.
iPromoter-2L2.0: identifying promoters and their types by combining smoothing cutting window algorithm and sequence-based features
.
Mol Ther Nucleic Acids
doi: 10.1016/j.omtn.2019.08.008.

23.

Liu
 
B
,
Chen
 
S
,
Yan
 
K
, et al.   
iRO-PsekGCC: identify DNA replication origins based on pseudo k-tuple GC composition
.
Front Genet
doi: 10.3389/fgene.2019.00842.

24.

Chiang
 
YS
,
Gelfand
 
TI
,
Kister
 
AE
, et al.   
New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage
.
Proteins
 
2007
;
68
:
915
21
.

25.

Liu
 
B
,
Yi
 
J
,
SV
 
A
, et al.   
QChIPat: a quantitative method to identify distinct binding patterns for two biological ChIP-seq samples in different experimental conditions
.
BMC Genomics
 
2013
;
14
:
S3
.

26.

Lindahl
 
E
,
Elofsson
 
A
.
Identification of related proteins on family. superfamily and fold level
.
J Mol Biol
 
2000
;
295
:
613
25
.

27.

Liu
 
B
,
Gao
 
X
,
Zhang
 
H
.
BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA, and protein sequences at sequence level and residue level based on machine learning approaches
.
Nucleic Acids Research
 
2019
;
47
:
e127
.

28.

Yang
 
H
,
Tang
 
H
,
Chen
 
XX
, et al.   
Identification of secretory proteins in mycobacterium tuberculosis using pseudo amino acid composition
.
Biomed Res Int
 
2016
;
2016
:
5413903
.

29.

Chen
 
XX
,
Tang
 
H
,
Li
 
WC
, et al.   
Identification of bacterial cell wall lyases via pseudo amino acid composition
.
Biomed Res Int
 
2016
;
2016
:
1654623
.

30.

Seemayer
 
S
,
Gruber
 
M
,
Soding
 
J
.
CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations
.
Bioinformatics
 
2014
;
30
:
3128
30
.

31.

Liu
 
B
.
BioSeq-analysis: a platform for DNA, RNA, and protein sequence analysis based on machine learning approaches
.
Brief Bioinform
doi: 10.1093/bib/bbx165.

32.

Rangwala
 
H
,
Karypis
 
G
.
Profile-based direct kernels for remote homology detection and fold recognition
.
Bioinformatics
 
2005
;
21
:
4239
47
.

33.

Remmert
 
M
,
Biegert
 
A
,
Hauser
 
A
, et al.   
HHblits: lightning-fast iterative protein sequence searching by HMM–HMM alignment
.
Nat Methods
 
2011
;
9
:
173
5
.

34.

Ioffe
 
S
,
Szegedy
 
C
.
Batch normalization: accelerating deep network training by reducing internal covariate shift
. In:
International Conference on Machine Learning
,
2015
. pp.
448
56
.
ACM Digital Library
,
New York, USA
.

35.

Srivastava
 
N
,
Hinton
 
G
,
Krizhevsky
 
A
, et al.   
Dropout: a simple way to prevent neural networks from overfitting
.
J Mach Learn Res
 
2014
;
15
:
1929
58
.

36.

Pugalenthi
 
G
,
Suganthan
 
PN
,
Sowdhamini
 
R
, et al.   
MegaMotifBase: a database of structural motifs in protein families and superfamilies
.
Nucleic Acids Res
 
2008
;
36
:
D218
21
.

37.

Tan
 
JX
,
Li
 
SH
,
Zhang
 
ZM
, et al.   
Identification of hormone-binding proteins based on machine learning methods
.
Math Biosci Eng
 
2019
;
16
:
2466
80
.

38.

Tang
 
H
,
Chen
 
W
,
Lin
 
H
.
Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique
.
Mol BioSyst
 
2016
;
12
:
1269
75
.

39.

Liu
 
B
,
Fang
 
L
,
Liu
 
F
, et al.   
iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach
.
J Biomol Struct Dyn
 
2016
;
34
:
220
32
.

40.

Suykens
 
JAK
,
Vandewalle
 
J
.
Least squares support vector machine classifiers
.
Neural Process Lett
 
1999
;
9
:
293
300
.

41.

Li
 
D
,
Ju
 
Y
,
Zou
 
Q
.
Protein folds prediction with hierarchical structured SVM
.
Curr Proteomics
 
2016
;
13
:
79
85
.

42.

Chen
 
W
,
Lv
 
H
,
Nie
 
F
, et al.   
i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome
.
Bioinformatics
 
2019
;
35
:
2796
2800
.

43.

Pedregosa
 
F
,
Varoquaux
 
G
,
Gramfort
 
A
, et al.   
Scikit-learn: machine learning in Python
.
J Mach Learn Res
 
2011
;
12
:
2825
30
.

44.

Karplus
 
K
,
Barrett
 
C
,
Hughey
 
R
.
Hidden Markov models for detecting remote protein homologies
.
Bioinformatics
 
1998
;
14
:
846
56
.

45.

Pearson
 
WR
.
Comparison of methods for searching protein sequence databases
.
Protein Sci
 
1995
;
4
:
1145
60
.

46.

Hargbo
 
J
,
Elofsson
 
A
.
Hidden Markov models that use predicted secondary structures for fold recognition
.
Proteins
 
1999
;
36
:
68
76
.

47.

Jones
 
DT
,
Taylor
 
WR
,
Thornton
 
JMA
.
New approach to protein fold recognition
.
Nature
 
1992
;
358
:
86
9
.

48.

Shi
 
JY
,
Blundell
 
TL
,
Mizuguchi
 
K
.
FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties
.
J Mol Biol
 
2001
;
310
:
243
57
.

49.

Xu
 
J
,
Li
 
M
,
Kim
 
D
, et al.   
RAPTOR: optimal protein threading by linear programming
.
J Bioinform Comput Biol
 
2003
;
1
:
95
117
.

50.

Zhou
 
H
,
Zhou
 
Y
.
Single-body residue-level knowledge-based energy score combined with sequence profile and secondary structure information for fold recognition
.
Proteins
 
2004
;
55
:
1005
13
.

51.

Yang
 
YD
,
Faraggi
 
E
,
Zhao
 
HY
, et al.   
Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates
.
Bioinformatics
 
2011
;
27
:
2076
82
.

52.

Zhou
 
H
,
Zhou
 
Y
.
Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments
.
Proteins
 
2005
;
58
:
321
8
.

53.

Liu
 
S
,
Zhang
 
C
,
Liang
 
SD
, et al.   
Fold recognition by concurrent use of solvent accessibility and residue depth
.
Proteins
 
2007
;
68
:
636
45
.

54.

Zhang
 
W
,
Liu
 
S
,
Zhou
 
Y
.
SP5: improving protein fold recognition by using torsion angle profiles and profile-based gap penalty model
.
PLoS One
 
2008
;
3
:
e2325
.

55.

Soding
 
J
,
Biegert
 
A
,
Lupas
 
AN
.
The HHpred interactive server for protein homology detection and structure prediction
.
Nucleic Acids Res
 
2005
;
33
:
W244
8
.

56.

Peng
 
J
,
Xu
 
J
.
Boosting protein threading accuracy
.
Res Comput Mol Biol
 
2009
;
5541
:
31
45
.

57.

Xu
 
D
,
Jaroszewski
 
L
,
Li
 
Z
, et al.   
FFAS-3D: improving fold recognition by including optimized structural features and template re-ranking
.
Bioinformatics
 
2014
;
30
:
660
7
.

58.

Chen
 
J
,
Long
 
R
,
Wang
 
XL
, et al.   
dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation
.
Sci Rep
 
2016
;
6
:
32333
.

59.

Chen
 
J
,
Peng
 
H
,
Han
 
G
, et al.   
HOGMMNC: a higher-order graph matching with multiple network constraints model for gene–drug regulatory modules identification
.
Bioinformatics
 
2019
;
35
:
602
10
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data