-
PDF
- Split View
-
Views
-
Cite
Cite
Loris Nanni, Sheryl Brahnam, Set of approaches based on 3D structure and position specific-scoring matrix for predicting DNA-binding proteins, Bioinformatics, Volume 35, Issue 11, June 2019, Pages 1844–1851, https://doi.org/10.1093/bioinformatics/bty912
Close -
Share
Abstract
Because DNA-binding proteins (DNA-BPs) play a vital role in all aspects of genetic activity, the development of reliable and efficient systems for automatic DNA-BP classification is becoming a crucial proteomic technology. Key to this technology is the discovery of powerful protein representations and feature extraction methods. The goal of this article is to develop experimentally a system for automatic DNA-BP classification by comparing and combining different descriptors taken from different types of protein representations.
The descriptors we evaluate include those starting from the position-specific scoring matrix (PSSM) of proteins, those derived from the amino-acid sequence (AAS), various matrix representations of proteins and features taken from the three-dimensional tertiary structure of proteins. We also introduce some new variants of protein descriptors. Each descriptor is used to train a separate support vector machine (SVM), and results are combined by sum rule. Our final system obtains state-or-the-art results on three benchmark DNA-BP datasets.
The MATLAB code for replicating the experiments presented in this paper is available at https://github.com/LorisNanni.
1 Introduction
DNA-binding proteins (DNA-BPs) play key roles in all aspects of genetic activity. Identifying these proteins, however, poses a major challenge, as traditional methods are time consuming and expensive. The development of automatic machine learning (ML) methods that quickly and accurately identify such proteins is rapidly becoming a critical proteomic technology.
According to Chou (2011), for the sake of clarity and practicality, a ML process designed to predict a protein should be documented in terms of the following five procedures: (i) the construction of a benchmark dataset for testing and training ML predictors, (ii) the formulation of discrete protein representations appropriate for the prediction problem, (iii) the development of ML approaches to perform the prediction, (iv) the evaluation of the accuracy of the proposed methods according to fair testing protocols and (v) the establishment of user-friendly web-servers that are accessible to the public.
Because the performance of a computational model is highly dependent on the power of its feature representation and because many different protein representations have been proposed in the literature, it is important to investigate which of these and their combinations are most suited for DNA-BP prediction. For this reason, the focus of this study is on Chou’s second procedure, the formulation of powerful discrete numerical representations of proteins.
ML feature representations of proteins are broadly classified based on two types of protein models: sequence-based models and structure-based models. Representations based on structural models rely on such information as the high-resolution 3D structure of protein sequences. Nimrod et al. (2010), for example, demonstrate how structural characteristics of proteins can be computed for DNA-BP identification from their average surface electrostatic potentials, dipole moments and cluster-based amino acid conservation patterns. For a recent survey of representations based on structural models, see Xiong et al. (2018). Many structure-based methods also combine structural information with sequential features for identifying DNA-BP. In Zhang and Liu (2017), for instance, different features extracted from the position-specific frequency matrix are tested for DNA-BP classification. Unfortunately, structural information of proteins is not always available, making methods based on this model unsuitable for predicting protein sequences without known structural information.
Representations based on sequence models are usually easier to extract since they are based on extracting features from the simple amino acid composition (AAC). The vector representation of AAC (Nakashima et al., 1986) is of length 20 since it represents proteins as the normalized occurrence frequencies of the 20 native amino acids. Pseudo amino acid composition (PseAAC) (Chou, 2001, 2009), which has become one of the most popular protein vector representations, expands AAC by retaining additional information embedded in protein sequences, such as the protein’s sequential order, using, among other modes, a series of rank-different correlation factors along a protein chain. Sixteen variants of PseAAC are described in Chou (2009), along with an historical account and practical guide for using them. In terms of DNA-BP identification (Rahman et al., 2018), recently used random forest to rank PseACC features and then recursive feature elimination to extract an optimal set of features that are trained using a support vector machine (SVM) with a linear kernel, and (Liu et al., 2014) combined PseACC features with a physicochemical distance transformation (Liu et al., 2015) using a reduced alphabet approach to improve prediction and decrease computation time.
Additional approaches based on the AAC discrete model include dipeptide (Lin and Ding, 2011; Nanni and Lumini, 2008; Waris et al., 2016), tripeptide (Ding et al., 2012) and tetrapeptide (Lin et al., 2013), where each protein is represented by a vector of length 20 that includes the normalized occurrence frequencies of a given n-peptide. Nanni and Lumini (2009) developed a multi-classifier approach based on grouped weight and a genetic algorithm for selecting a set of reduced alphabets taken directly from the AAC. There are also protein representations that rely on physicochemical properties (Liu et al., 2015; Song et al., 2014; Xu et al., 2014).
One other important approach based on the ACC model incorporates the evolutionary information embedded in sequence profiles that are automatically generated by position-specific iterated BLAST (PSI-BLAST) (Altschul et al., 1997). Many studies have demonstrated the importance of including evolutionary features for improving DNA-BP prediction (Chowdhury et al., 2017; Liu et al., 2015; Wei et al., 2017; Xu et al., 2015), with Liu et al. (2015) integrating PseAAC with profile-based evolutionary information retrieved by PSI-BLAST, discovering that negative samples in the training model improve prediction.
Several powerful features have been derived from the position-specific scoring matrix (PSSM) (Gribskov et al., 1987). PSSM describes a protein starting from the evolutionary information contained in a PSI-BLAST similarity search, see Nanni et al. (2013), for a survey of research using protein descriptors extracted from PSSM. In Waris et al. (2016), for instance, DNA-BP prediction is improved using a combination of features extracted from dipeptide composition, split AAC and PSSM. Wang et al. (2017) show improved performance by combining a 200-dimension normalized Moreau-Broto autocorrelations feature vector (Feng and Zhang, 2000) with a 1040-dimension feature vector called PSSM-DWT (PSSM compressed by a discrete wave transform) and a 100-dimension PSSM-DCT (PSSM compressed by a discrete cosine transform). Nanni et al., (2013) combine multiple matrix representations into a high performing general protein ensemble starting from PSSM.
Finally, several studies have demonstrated the power of extracting texture descriptors from matrix representations of proteins that are treated as images (Kavianpour and Vasighi, 2017; Li et al., 2018; Nanni et al., 2010b, 2012). In Nanni et al. (2010b), for instance, the authors extract and combine sets of well-known texture descriptors (e.g. variants of local binary patterns and features based on the Radon feature transform and Haralick descriptors) from the protein backbone image, demonstrating a significant improvement in classification rates on datasets for protein fold recognition, DNA-BP recognition, biological processes and molecular function recognition, and in Nanni et al. (2012), the authors analyze and compare several feature extraction methods used in protein classification that are based on the calculation of texture descriptors starting from a wavelet representation of the protein.
The main objective of this study is to search for an ensemble of protein features that work well across different DNA-BP classification datasets. To accomplish this objective, we investigate several state-of-the-art protein descriptors: those based on the ACC model, different types of matrix representations of the protein (such as PSSM) and the 3D tertiary structure of the protein. We develop as the result of our investigation a powerful ensemble of representations for DNA-BP identification, obtaining state-of-the-art performances across the benchmark datasets.
Although all MATLAB source codes used in this article are provided (see abstract for the URL), it should also be noted that most of the protein feature extraction methods used in this article can also be generated using such user-friendly tools as Pse-in-One 2.0 (Liu et al., 2017), BioSeq-Analysis (Liu, 2017) and Pse-Analysis (Liu et al., 2017). These tools take a given benchmark dataset as input and construct an optimized predictor based on samples taken from the dataset. Of note is the web-server Pse-in-One 2.0, which provides a powerful series of feature analysis approaches. Some additional PseAAC web servers include PseAAC-Builder (Du et al., 2012), propy (Cao et al., 2013) and PseAAC-General (Du et al., 2014).
2 Materials and methods
2.1 Machine learning approach for protein classification
As noted in the introduction, much recent research has focused on finding a compact and effective representation of proteins, one that ideally produces a fixed-length descriptor so that the classification problem can be solved by a ML approach. In this work, we explore several solutions evaluated based on a general representation that can be used with an ensemble of general-purpose classifiers, such as an ensemble of SVMs.
From each of the protein representations (described below in Section 2.2), several different types of features (as detailed in Section 2.3) are extracted. Some feature extraction methods are applied multiple times, once for each of the physicochemical properties that are considered in the extraction process. The set of physicochemical properties are obtained from the amino acid index database (Kawashima and Kanehisa, 1999) available at http://www.genome.jp/dbget/aaindex.html. An amino acid index is a set of 20 numerical values representing the different physicochemical properties of amino acids. We ignore properties where the amino acids have a value of 0 or 1. The amino acid index database currently contains 566 indices and 94 substitution matrices; but, as noted below, a reduced number of properties are sufficient for a protein classification task.
Once a descriptor is extracted from a protein representation, it is used to train an SVM as implemented as in the LibSVM toolbox (http://www.csie.ntu.edu.tw/∼cjlin/libsvm/). In the experiments reported in this work, all features used for training a SVM are linearly normalized to [0, 1] based on the training data. In each dataset, the SVM is tuned considering only the training data using a grid search approach, which means, in effect, that the test is blind.
The ensemble approaches presented here are based on the fusion of different descriptors; the final decision is obtained by combining the pool of SVMs by weighted sum rule. In this work, we have used both the sum rule and the weighted sum rule. In the standard sum rule, all the methods have the same weight, hence if an ensemble is based on the sum rule all the approaches have equal weights.
2.2 Protein representations
2.2.1 Amino acid sequence
As noted in the introduction, representations based the sequential model are derived from AAS, which can be described as the linear sequence , where , with being the 20 native amino acid types. Many studies (see Kawashima and Kanehisa, 1999) have shown that AAS coupled with other information related to the physicochemical properties of amino acids produce many powerful descriptors.
2.2.2 Matrix representation: position-specific scoring matrix
PSSM, first proposed in Gribskov et al. (1987), is a matrix representation for proteins that is obtained from a group of sequences previously aligned by structural or sequence similarity. PSSM is calculated using PSI-BLAST, an application that compares PSSM profiles for detecting remotely related homologous proteins or DNA.
PSSM considers the following parameters:
Position, which is the index of each amino acid residue in a sequence after multiple sequence alignments.
Probe, which is a group of typical sequences of functionally related proteins that have already been aligned by structural similarity or sequence.
Profile, which is a matrix of 20 columns that correspond to the 20 amino acids.
Consensus, which the sequence of amino acid residues that are most similar to all the alignment residues of probes at each position; the consensus sequence is calculated by selecting the highest score at each position in the profile.
The PSSM representation for a given protein of length N is a matrix with dimension N × 20 where each element represents the occurrence probability of amino acid at position of the protein sequence. The rows in the matrix represent the positions of the sequence, and the columns represent the 20 types of the original amino acids. The elements of are calculated as , where; is the ratio between the frequency of the amino acid at the position of the probe and total number of probes and is the value of Dayhoff’s mutation matrix between the and amino acids (in other words, is a substitution matrix, i.e. a matrix that describes the rate at which one character in the protein changes into another over time).
PSSM scores are normally positive or negative integers. Small values of indicate weakly conserved positions and large values indicate strongly conserved positions. Thus, the element of a PSSM profile can be used to approximate the occurrence probability of the corresponding amino acid at a specific position.
In this study, PSI-BLAST is called from MATLAB to create the PSSM scores for each protein sequence using the following command (where input.txt is the protein sequence, and output.txt contains the PSSM matrix):
system(‘blastpgp.exe -i input.txt -d swissprot -Q output.txt -j 3’).
2.2.3 Matrix representation: substitution matrix representation
In Nanni et al. (2013), a variant of the substitution matrix representation (SMR) proposed by Yu et al. (2011) is presented where the SMR for a given protein is a matrix obtained as , where and is a substitution matrix, whose element represents the probability of amino acid mutating to amino acid during the evolution process. In the experiments reported below, 25 physicochemical properties are randomly selected to create an ensemble of -based predictors.
2.2.4 Matrix representation: physicochemical property response matrix
Property response (PR), proposed in Nanni et al. (2012), is a matrix representation that is based on a protein’s physicochemical properties. For a given protein the Physicochemical PR matrix is first obtained by selecting a physicochemical property . The value of the element is set to the sum of the values of the physicochemical property of the amino acid in position of the protein and the value of the physicochemical property of the amino acid in position , such that: , where and returns the value of the property d for the amino acid a.
The matrix is treated as an image that is resized to elements via cubic interpolation (Keys, 1981) (if originally larger than this size) to obtain the final matrix . In the experimental section, 25 random physicochemical properties are selected to create an ensemble of -based (PR) predictors.
2.2.5 Matrix representation: wavelet (WAVE)
Methods for extracting features from wavelets have been proposed by Li and Li (2008), Nanni et al. (2010b, 2012), Qiu et al. (2009), Shi et al. (2011) and Wen et al. (2005). Wavelet encoding requires a numerical representation, meaning that the protein sequence must first be numerically encoded by substituting each amino acid with a value corresponding to a given physicochemical property d. As in Li and Li (2008), the Meyer continuous wavelet is then applied to the wavelet transform coefficients (), and features are extracted considering 100 decomposition scales. Twenty-five physicochemical properties are randomly selected to create a WAVE ensemble composed of 25 -based predictors.
2.2.6 Matrix representation: 3D tertiary structure (DM)
DM is based on distance between atoms and residues in a PDB structure. It creates a heat map showing inter-residue distances. If the size of the map exceeds 250, it is resized to 250 × 250 to reduce the computation time of the feature extraction step. As is the case with the other protein matrix representations described in this article, DM is regarded as a gray-scale image that is used to extract texture descriptors.
2.2.7 Matrix representation: RC
RC is our label for the protein representation proposed by Zacharaki (2017). RC represents a protein structure in the form of two sets of feature maps: (i) the local distributions of two torsion angles (ϕ and ψ) per amino acid, which expresses the shape of the protein backbone, and (ii) the distance between the amino acid building blocks. The feature maps, described in more detail below, are then treated as sets of images from which different texture descriptors are extracted.
2.2.8 Protein structure: torsion angles density
The two torsion angles ϕ and ψ describe the rotations of the polypeptide backbone around the bonds between N–Cα and Cα–C, respectively. The amino acids in the protein are grouped according to their type and the density of the two torsion angles (ϕ, ψ ), which are based on the 2D sample histogram of the angles, often referred to as the Ramachandran diagram. The histogram has equal bins of size and are not normalized. The resulting torsion angle density feature maps ( have a dimensionality of , with the number of amino acids. The density function is smoothed by convoluting the density maps with a 2D Gaussian kernel .
2.2.9 Protein structure: density of amino acid distances
For each amino acid , where of a given protein, the distances to amino acid , where , are calculated based on the coordinates of the Cα atoms for their residues. The distances are stored as an array (), the length of which varies across proteins and is thus not comparable. To standardize the arrays, the sample histogram of is extracted using equally sized bins and smoothed, as above, by convoluting the histogram with a 1D Gaussian kernel . Processing all pairs of amino acids produces feature maps of dimension , where is the number of histogram bins.
In this article, we apply the two RC protein representations: (i) since the torsion angle densities build a map of size , we treat the map as 19 separate images of size ; and (ii) since the density of amino acid distances build a map of size , we treat the map as eight separate images of size . For each image different descriptors are extracted and trained by separate SVMs, which are combined by sum rule.
2.3 Protein feature extraction
In this section we describe the different approaches used to extract descriptors from the protein representations introduced in Section 2.2. The descriptors extracted from the primary ACC representation are mostly based on substituting the literal of an amino acid with its value of a fixed physicochemical property. To make the result independent on the selected property, 25 properties are selected and used to train an ensemble of SVM classifiers.
2.3.1 Primary representation: amino acid composition (AS)
AS is one of the simplest methods for extracting features since it merely counts the fraction of a given amino acid as where counts the number of occurrences of a given amino acid in a protein sequence of length .
2.3.2 Primary representation: quasi residue couple
First proposed by Nanni and Lumini (2006) and inspired by Chou’s quasi-sequence-order model and Yuan’s Markov chain model (Guo et al., 2005), QRC is a method for extracting features from the primary sequence of a protein (Nanni et al., 2010). The original residue couple model was designed to represent information contained in both the AAC and the order of the amino acids in the protein sequences. The quasi residue couple (QRC) descriptor is obtained by selecting a physicochemical property d and by combining its values with each non-zero entry in the residue couple. A parameter m is called the order of the residue couple model, and values m ≤ 3 are considered sufficient for representing a sequence.
In the experimental section, the features are extracted for in the range of 1–3 and concatenated into a 1200-dimensional vector. Moreover, 25 physicochemical properties are randomly selected to create an ensemble of QRC descriptors.
2.3.3 Primary representation: autocovariance approach
In the experimental section, 25 random physicochemical properties are selected to create an ensemble of 25 AC descriptors.
2.3.4 Matrix-based descriptor: pseudo PSSM
Pseudo PSSM (PP) is a widely used matrix descriptor for proteins (Fan and Li, 2011; Jeong et al., 2011) that is normally applied to the PSSM matrix representation. This protein descriptor is designed to retain information about the amino acid sequence by considering the PseAAC.
2.3.5 Matrix-based descriptor: N-gram features (NGR)
NGR is typically extracted from the primary protein sequence (described above in Section 2.2). However, Sharma et al. (2013) extract this descriptor directly from the PSSM matrix by accumulating the probabilities of each of the N-grams according to the probability information contained in PSSM.
Given an input matrix representing a given protein, the frequency of occurrence of transition from amino acid to amino acid is calculated for 2-grams (BGR) as where and .
The frequency of occurrence of transition from amino acid to jth amino acid is calculated for the 3-grams (TGR) as where and
2.3.6 Matrix-based descriptor: texture descriptors
In this work, we combine many different texture descriptors as follows: for each texture descriptor, a different SVM is trained, and sets of SVMs are then combined by sum rule. The following descriptors are tested in this article:
LBP (Ojala et al., 2002): uniform LBP with two settings configurations (radius, number of neighbors P): (1, 8) and (2, 16).
WLD (Chen et al., 2010): Weber law descriptor code computed within a 3 × 3 block with the following parameter configurations: BETA = 5, ALPHA = 3, and number of neighbors = 8.
CLBP (Guo et al., 2010): completed LBP with two configurations (R, P): (1, 8) and (2, 16).
RIC (Nosaka and Fukui, 2014): multiscale rotation invariant co-occurrence of adjacent LBP with R ∈ {1, 2, 4}.
MORPH (Strandmark et al., 2012): a set of MORHphological features, which is a set of measures that includes such features as the aspect ratio, number of objects, area, perimeter, eccentricity and other measures extracted from a segmented version of the image.
HASH (San Biagio et al., 2013): default values of the heterogeneous auto-similarities of characteristics features.
3 Results and discussion
In the present study, we evaluate our approach across three benchmark datasets: PDB1075, PDB594 and PDB186. These DNA-BPs were selected from the Protein Databank located at http://www.rcsb.org/pdb/home/home.do. All protein sequences containing less than 50 amino acids or the character “X” were removed, as were all sequences having more than 25% similarity with any other sequence. The PDB1075 dataset (Liu et al., 2014) contains 525 DNA-BPs and 550 DNA-non-BPs. The PDB594 dataset (Lou et al., 2014) contains 297 DNA-BPs and 297 DNA-non-BPs. The PDB186 dataset is designed as an independent testing dataset derived from Lou et al. (2014); it contains 93 DNA-BPs and 93 DNA-non-BPs.
In accordance with Chou’s procedure (2011), we perform the following testing protocols for a fair comparison with the literature:
Jack: Jackknife test as implemented in the PDB1075 dataset.
IND1: where training is on the PDB1075 dataset and testing on the independent PDB186 dataset.
IND2: where training is on the PDB594 dataset and testing on independent PDB186 dataset.
All results are reported using two performance indicators: (i) classification accuracy and (ii) area under the ROC curve (AUC). Accuracy is computed as the ratio between the number of samples correctly classified and the total number of samples. The ROC curve is computed as a graphical plot of the sensitivity of a binary classifier vs false positives (1 − specificity), given that its discrimination threshold varies. AUC (Fawcett, 2004) is a scalar measure representing the probability that the classifier will assign a lower score to a randomly picked positive pattern rather than to a randomly picked negative pattern. Before each fusion of different methods their scores are normalized to mean 0 and SD 1.
The aim of the first set of experiments is to compare all the descriptors detailed in Section 2.3. In each cell of the following tables, there are two values: the first is the accuracy and the second is the AUC. In Table 1 the performance of representations based on the sequence model and their fusions are reported. The methods named “AC + K × QRC” means the weighted sum rule between AC (weight 1) and QRC (weight K).
Performance of representations that are sequence-based
| Sequence based . | AC . | QRC . | AC + QRC . | AC 0.5xQRC . |
|---|---|---|---|---|
| IND1 | 79.03% | 75.81% | 81.18% | 79.57% |
| 84.76% | 96.55% | 89.92% | 88.14% | |
| Jack | 76.74% | 72.47% | 76.28% | 77.02% |
| 84.92% | 82.32% | 85.55% | 85.25% | |
| IND2 | 65.59% | 62.90% | 66.13% | 65.59% |
| 69.30% | 68.94% | 70.99% | 70.48% |
| Sequence based . | AC . | QRC . | AC + QRC . | AC 0.5xQRC . |
|---|---|---|---|---|
| IND1 | 79.03% | 75.81% | 81.18% | 79.57% |
| 84.76% | 96.55% | 89.92% | 88.14% | |
| Jack | 76.74% | 72.47% | 76.28% | 77.02% |
| 84.92% | 82.32% | 85.55% | 85.25% | |
| IND2 | 65.59% | 62.90% | 66.13% | 65.59% |
| 69.30% | 68.94% | 70.99% | 70.48% |
Bold values are highest performance.
Performance of representations that are sequence-based
| Sequence based . | AC . | QRC . | AC + QRC . | AC 0.5xQRC . |
|---|---|---|---|---|
| IND1 | 79.03% | 75.81% | 81.18% | 79.57% |
| 84.76% | 96.55% | 89.92% | 88.14% | |
| Jack | 76.74% | 72.47% | 76.28% | 77.02% |
| 84.92% | 82.32% | 85.55% | 85.25% | |
| IND2 | 65.59% | 62.90% | 66.13% | 65.59% |
| 69.30% | 68.94% | 70.99% | 70.48% |
| Sequence based . | AC . | QRC . | AC + QRC . | AC 0.5xQRC . |
|---|---|---|---|---|
| IND1 | 79.03% | 75.81% | 81.18% | 79.57% |
| 84.76% | 96.55% | 89.92% | 88.14% | |
| Jack | 76.74% | 72.47% | 76.28% | 77.02% |
| 84.92% | 82.32% | 85.55% | 85.25% | |
| IND2 | 65.59% | 62.90% | 66.13% | 65.59% |
| 69.30% | 68.94% | 70.99% | 70.48% |
Bold values are highest performance.
In Table 2, we report the performance obtained using two different matrix representations (PSSM and SMR) coupled with NGR protein descriptors. The third column is the fusion by sum rule between the classifiers trained with PSMM and SMR.
Performance obtained using the NGR representation
| NGR . | PSSM . | SMR . | FUS_NG . |
|---|---|---|---|
| IND1 | 80.11% | 78.49% | 80.65% |
| 87.21% | 93.16% | 91.28% | |
| Jack | 73.21% | 74.33% | 73.21% |
| 81.75% | 81.85% | 81.75% | |
| IND2 | 67.74% | 63.98% | 66.13% |
| 70.49% | 69.14% | 71.46% |
| NGR . | PSSM . | SMR . | FUS_NG . |
|---|---|---|---|
| IND1 | 80.11% | 78.49% | 80.65% |
| 87.21% | 93.16% | 91.28% | |
| Jack | 73.21% | 74.33% | 73.21% |
| 81.75% | 81.85% | 81.75% | |
| IND2 | 67.74% | 63.98% | 66.13% |
| 70.49% | 69.14% | 71.46% |
Bold values are highest performance.
Performance obtained using the NGR representation
| NGR . | PSSM . | SMR . | FUS_NG . |
|---|---|---|---|
| IND1 | 80.11% | 78.49% | 80.65% |
| 87.21% | 93.16% | 91.28% | |
| Jack | 73.21% | 74.33% | 73.21% |
| 81.75% | 81.85% | 81.75% | |
| IND2 | 67.74% | 63.98% | 66.13% |
| 70.49% | 69.14% | 71.46% |
| NGR . | PSSM . | SMR . | FUS_NG . |
|---|---|---|---|
| IND1 | 80.11% | 78.49% | 80.65% |
| 87.21% | 93.16% | 91.28% | |
| Jack | 73.21% | 74.33% | 73.21% |
| 81.75% | 81.85% | 81.75% | |
| IND2 | 67.74% | 63.98% | 66.13% |
| 70.49% | 69.14% | 71.46% |
Bold values are highest performance.
In Table 3, the performance obtained using a set of texture (TXT) descriptors (see Table 1) for describing different matrix protein representations is reported, as well as the performance of PP as a matrix representation is reported. The column named FUS is a fusion by sum rule among the different approaches PSSM, SMR, WAVE, DM and RC. The column labeled FUS_noPDB reports the performance of the fusion of methods not based on PDB, i.e. PSSM, SMR and WAVE.
TXT and PP descriptors for describing the different matrix protein representations
| TXT . | PSSM . | SMR . | PR . | WAVE . | DM . | RC . | FUS_noPDB . | FUS . |
|---|---|---|---|---|---|---|---|---|
| IND1 | 83.87% | 80.65% | 86.02% | 82.26% | 73.66% | 80.11% | 82.80% | 82.80% |
| 96.25% | 93.50% | 94.10% | 93.20% | 83.19% | 88.35% | 96.51% | 94.84% | |
| Jack | 72.93% | 77.21% | 69.30% | 67.44% | 76.19% | 72.28% | 75.26% | 76.28% |
| 80.54% | 85.44% | 78.05% | 77.25% | 84.01% | 80.56% | 83.83% | 86.19% | |
| IND2 | 66.67% | 64.52% | 61.29% | 61.29% | 67.74% | 60.75% | 65.59% | 67.74% |
| 68.46% | 70.01% | 67.65% | 66.61% | 72.33% | 65.94% | 69.94% | 71.17% | |
| PP | PSSM | SMR | PR | WAVE | DM | RC | FUS_noPDB | FUS |
| IND1 | 82.26% | 82.26% | 86.02% | 87.10% | 77.42% | 79.03% | 82.26% | 84.41% |
| 92.37% | 92.67% | 93.16% | 93.41% | 85.92% | 90.87% | 96.67% | 96.90% | |
| Jack | 76.93% | 75.53% | 65.67% | 65.77% | 71.16% | 73.95% | 78.88% | 80.74% |
| 84.89% | 83.71% | 70.98% | 72.05% | 80.86% | 82.64% | 88.41% | 89.67% | |
| IND2 | 69.89% | 66.67% | 55.91% | 55.38% | 55.91% | 60.22% | 71.51% | 69.89% |
| 76.64% | 74.05% | 57.13% | 58.12% | 62.64% | 66.66% | 79.49% | 77.80% | |
| TXT . | PSSM . | SMR . | PR . | WAVE . | DM . | RC . | FUS_noPDB . | FUS . |
|---|---|---|---|---|---|---|---|---|
| IND1 | 83.87% | 80.65% | 86.02% | 82.26% | 73.66% | 80.11% | 82.80% | 82.80% |
| 96.25% | 93.50% | 94.10% | 93.20% | 83.19% | 88.35% | 96.51% | 94.84% | |
| Jack | 72.93% | 77.21% | 69.30% | 67.44% | 76.19% | 72.28% | 75.26% | 76.28% |
| 80.54% | 85.44% | 78.05% | 77.25% | 84.01% | 80.56% | 83.83% | 86.19% | |
| IND2 | 66.67% | 64.52% | 61.29% | 61.29% | 67.74% | 60.75% | 65.59% | 67.74% |
| 68.46% | 70.01% | 67.65% | 66.61% | 72.33% | 65.94% | 69.94% | 71.17% | |
| PP | PSSM | SMR | PR | WAVE | DM | RC | FUS_noPDB | FUS |
| IND1 | 82.26% | 82.26% | 86.02% | 87.10% | 77.42% | 79.03% | 82.26% | 84.41% |
| 92.37% | 92.67% | 93.16% | 93.41% | 85.92% | 90.87% | 96.67% | 96.90% | |
| Jack | 76.93% | 75.53% | 65.67% | 65.77% | 71.16% | 73.95% | 78.88% | 80.74% |
| 84.89% | 83.71% | 70.98% | 72.05% | 80.86% | 82.64% | 88.41% | 89.67% | |
| IND2 | 69.89% | 66.67% | 55.91% | 55.38% | 55.91% | 60.22% | 71.51% | 69.89% |
| 76.64% | 74.05% | 57.13% | 58.12% | 62.64% | 66.66% | 79.49% | 77.80% | |
Bold values are highest performance.
TXT and PP descriptors for describing the different matrix protein representations
| TXT . | PSSM . | SMR . | PR . | WAVE . | DM . | RC . | FUS_noPDB . | FUS . |
|---|---|---|---|---|---|---|---|---|
| IND1 | 83.87% | 80.65% | 86.02% | 82.26% | 73.66% | 80.11% | 82.80% | 82.80% |
| 96.25% | 93.50% | 94.10% | 93.20% | 83.19% | 88.35% | 96.51% | 94.84% | |
| Jack | 72.93% | 77.21% | 69.30% | 67.44% | 76.19% | 72.28% | 75.26% | 76.28% |
| 80.54% | 85.44% | 78.05% | 77.25% | 84.01% | 80.56% | 83.83% | 86.19% | |
| IND2 | 66.67% | 64.52% | 61.29% | 61.29% | 67.74% | 60.75% | 65.59% | 67.74% |
| 68.46% | 70.01% | 67.65% | 66.61% | 72.33% | 65.94% | 69.94% | 71.17% | |
| PP | PSSM | SMR | PR | WAVE | DM | RC | FUS_noPDB | FUS |
| IND1 | 82.26% | 82.26% | 86.02% | 87.10% | 77.42% | 79.03% | 82.26% | 84.41% |
| 92.37% | 92.67% | 93.16% | 93.41% | 85.92% | 90.87% | 96.67% | 96.90% | |
| Jack | 76.93% | 75.53% | 65.67% | 65.77% | 71.16% | 73.95% | 78.88% | 80.74% |
| 84.89% | 83.71% | 70.98% | 72.05% | 80.86% | 82.64% | 88.41% | 89.67% | |
| IND2 | 69.89% | 66.67% | 55.91% | 55.38% | 55.91% | 60.22% | 71.51% | 69.89% |
| 76.64% | 74.05% | 57.13% | 58.12% | 62.64% | 66.66% | 79.49% | 77.80% | |
| TXT . | PSSM . | SMR . | PR . | WAVE . | DM . | RC . | FUS_noPDB . | FUS . |
|---|---|---|---|---|---|---|---|---|
| IND1 | 83.87% | 80.65% | 86.02% | 82.26% | 73.66% | 80.11% | 82.80% | 82.80% |
| 96.25% | 93.50% | 94.10% | 93.20% | 83.19% | 88.35% | 96.51% | 94.84% | |
| Jack | 72.93% | 77.21% | 69.30% | 67.44% | 76.19% | 72.28% | 75.26% | 76.28% |
| 80.54% | 85.44% | 78.05% | 77.25% | 84.01% | 80.56% | 83.83% | 86.19% | |
| IND2 | 66.67% | 64.52% | 61.29% | 61.29% | 67.74% | 60.75% | 65.59% | 67.74% |
| 68.46% | 70.01% | 67.65% | 66.61% | 72.33% | 65.94% | 69.94% | 71.17% | |
| PP | PSSM | SMR | PR | WAVE | DM | RC | FUS_noPDB | FUS |
| IND1 | 82.26% | 82.26% | 86.02% | 87.10% | 77.42% | 79.03% | 82.26% | 84.41% |
| 92.37% | 92.67% | 93.16% | 93.41% | 85.92% | 90.87% | 96.67% | 96.90% | |
| Jack | 76.93% | 75.53% | 65.67% | 65.77% | 71.16% | 73.95% | 78.88% | 80.74% |
| 84.89% | 83.71% | 70.98% | 72.05% | 80.86% | 82.64% | 88.41% | 89.67% | |
| IND2 | 69.89% | 66.67% | 55.91% | 55.38% | 55.91% | 60.22% | 71.51% | 69.89% |
| 76.64% | 74.05% | 57.13% | 58.12% | 62.64% | 66.66% | 79.49% | 77.80% | |
Bold values are highest performance.
From the results reported in Table 3, the following conclusion can be drawn:
PP obtains best performances.
NGR and the sequence-based representations are clearly worse than PP and TXT
The fusion among different features descriptors extracted from the same matrix representation is useful.
The representations related to the PDB protein format boosts performance (also if not in a remarkable way).
The fusion clearly outperforms all the stand-alone methods.
In Table 4, we report the performance obtained by combining by weighted sum rule PP(FUS) and TXT(FUS). Unfortunately, the fusion only weakly boosts the performance of the base approaches. We did not run a parameter selection to determine the weights, nor did we overfit them: we simply tested reasonable values. A weight of 0.5 means than that approach is weighted by half with respect the other method and by a quarter if the value is 0.25. The performance of PP(FUS)TXT(FUS) and PP(FUS)TXT(FUS) is very similar.
Weighted sum rule between PP(FUS2) and TXT(FUS2)
| . | TXT(FUS) . | PP(FUS) . | PP(FUS) . | PP(FUS) . | PP(FUS) . |
|---|---|---|---|---|---|
| . | . | . | TXT(FUS) . | 0.50 TXT(FUS) . | 0.25 TXT(FUS) . |
| IND1 | 82.80% | 84.41% | 83.87% | 84.95% | 84.95% |
| 94.84% | 96.90% | 96.39% | 96.76% | 96.90% | |
| Jack | 76.28% | 80.74% | 78.70% | 79.44% | 80.19% |
| 86.19% | 89.67% | 89.28% | 89.81% | 89.90% | |
| IND2 | 67.74% | 69.89% | 70.43% | 69.35% | 69.89% |
| 71.17% | 77.80% | 76.41% | 77.13% | 77.59% |
| . | TXT(FUS) . | PP(FUS) . | PP(FUS) . | PP(FUS) . | PP(FUS) . |
|---|---|---|---|---|---|
| . | . | . | TXT(FUS) . | 0.50 TXT(FUS) . | 0.25 TXT(FUS) . |
| IND1 | 82.80% | 84.41% | 83.87% | 84.95% | 84.95% |
| 94.84% | 96.90% | 96.39% | 96.76% | 96.90% | |
| Jack | 76.28% | 80.74% | 78.70% | 79.44% | 80.19% |
| 86.19% | 89.67% | 89.28% | 89.81% | 89.90% | |
| IND2 | 67.74% | 69.89% | 70.43% | 69.35% | 69.89% |
| 71.17% | 77.80% | 76.41% | 77.13% | 77.59% |
Bold values are highest performance.
Weighted sum rule between PP(FUS2) and TXT(FUS2)
| . | TXT(FUS) . | PP(FUS) . | PP(FUS) . | PP(FUS) . | PP(FUS) . |
|---|---|---|---|---|---|
| . | . | . | TXT(FUS) . | 0.50 TXT(FUS) . | 0.25 TXT(FUS) . |
| IND1 | 82.80% | 84.41% | 83.87% | 84.95% | 84.95% |
| 94.84% | 96.90% | 96.39% | 96.76% | 96.90% | |
| Jack | 76.28% | 80.74% | 78.70% | 79.44% | 80.19% |
| 86.19% | 89.67% | 89.28% | 89.81% | 89.90% | |
| IND2 | 67.74% | 69.89% | 70.43% | 69.35% | 69.89% |
| 71.17% | 77.80% | 76.41% | 77.13% | 77.59% |
| . | TXT(FUS) . | PP(FUS) . | PP(FUS) . | PP(FUS) . | PP(FUS) . |
|---|---|---|---|---|---|
| . | . | . | TXT(FUS) . | 0.50 TXT(FUS) . | 0.25 TXT(FUS) . |
| IND1 | 82.80% | 84.41% | 83.87% | 84.95% | 84.95% |
| 94.84% | 96.90% | 96.39% | 96.76% | 96.90% | |
| Jack | 76.28% | 80.74% | 78.70% | 79.44% | 80.19% |
| 86.19% | 89.67% | 89.28% | 89.81% | 89.90% | |
| IND2 | 67.74% | 69.89% | 70.43% | 69.35% | 69.89% |
| 71.17% | 77.80% | 76.41% | 77.13% | 77.59% |
Bold values are highest performance.
Finally, in Tables 5 and 6, we compare our approach with the literature. It will be observed that our approach obtains state-of-the-art performance in IND1 and Jack; in IND2 it obtains second best performance. In Table 6, the performance is obtained using the whole dataset for selecting the features and retaining the number of features that maximize the performance (the results of that testing protocol is not comparable with ours). Also, in Zhang and Liu (2017), the AUC is reported for IND1 (88.03%) and Jack (87.12%); note that our approach outperforms both the results.
Comparison using accuracy with the literature
| . | IND1 . | Jack . |
|---|---|---|
| PP(FUS_noPDB) | 82.26% | 78.88% |
| Here: PP(FUS TXT(FUS) | 84.95% | 80.19% |
| iDNA-Prot|dis (Liu et al., 2014) | 72.0% | 72.0% |
| PseDNA-Pro (Liu et al., 2015) | — | 76.55% |
| iDNAPro-PseAAC (Liu et al., 2015) | 71.5% | 76.56% |
| PP(FUS_noPDB) | 82.26% | 78.88% |
| Kmer1 + ACC (Dong et al., 2015) | 71.0% | 75.23% |
| Local-DPP (Wei et al., 2017) | 79.0% | 79.20% |
| (Wang et al., 2017) | 76.3% | 86.23%* |
| (Chowdhury et al., 2017) | 80.6% | 90.18%* |
| PSFM-DBT (Zhang and Liu, 2017) | 80.65% | 81.02% |
| . | IND1 . | Jack . |
|---|---|---|
| PP(FUS_noPDB) | 82.26% | 78.88% |
| Here: PP(FUS TXT(FUS) | 84.95% | 80.19% |
| iDNA-Prot|dis (Liu et al., 2014) | 72.0% | 72.0% |
| PseDNA-Pro (Liu et al., 2015) | — | 76.55% |
| iDNAPro-PseAAC (Liu et al., 2015) | 71.5% | 76.56% |
| PP(FUS_noPDB) | 82.26% | 78.88% |
| Kmer1 + ACC (Dong et al., 2015) | 71.0% | 75.23% |
| Local-DPP (Wei et al., 2017) | 79.0% | 79.20% |
| (Wang et al., 2017) | 76.3% | 86.23%* |
| (Chowdhury et al., 2017) | 80.6% | 90.18%* |
| PSFM-DBT (Zhang and Liu, 2017) | 80.65% | 81.02% |
Bold values are highest performance.
Comparison using accuracy with the literature
| . | IND1 . | Jack . |
|---|---|---|
| PP(FUS_noPDB) | 82.26% | 78.88% |
| Here: PP(FUS TXT(FUS) | 84.95% | 80.19% |
| iDNA-Prot|dis (Liu et al., 2014) | 72.0% | 72.0% |
| PseDNA-Pro (Liu et al., 2015) | — | 76.55% |
| iDNAPro-PseAAC (Liu et al., 2015) | 71.5% | 76.56% |
| PP(FUS_noPDB) | 82.26% | 78.88% |
| Kmer1 + ACC (Dong et al., 2015) | 71.0% | 75.23% |
| Local-DPP (Wei et al., 2017) | 79.0% | 79.20% |
| (Wang et al., 2017) | 76.3% | 86.23%* |
| (Chowdhury et al., 2017) | 80.6% | 90.18%* |
| PSFM-DBT (Zhang and Liu, 2017) | 80.65% | 81.02% |
| . | IND1 . | Jack . |
|---|---|---|
| PP(FUS_noPDB) | 82.26% | 78.88% |
| Here: PP(FUS TXT(FUS) | 84.95% | 80.19% |
| iDNA-Prot|dis (Liu et al., 2014) | 72.0% | 72.0% |
| PseDNA-Pro (Liu et al., 2015) | — | 76.55% |
| iDNAPro-PseAAC (Liu et al., 2015) | 71.5% | 76.56% |
| PP(FUS_noPDB) | 82.26% | 78.88% |
| Kmer1 + ACC (Dong et al., 2015) | 71.0% | 75.23% |
| Local-DPP (Wei et al., 2017) | 79.0% | 79.20% |
| (Wang et al., 2017) | 76.3% | 86.23%* |
| (Chowdhury et al., 2017) | 80.6% | 90.18%* |
| PSFM-DBT (Zhang and Liu, 2017) | 80.65% | 81.02% |
Bold values are highest performance.
Comparison with the literature, results from Lou et al. (2014)
| IND2 . | Accuracy . | AUC . |
|---|---|---|
| PP(FUS_noPDB) | 71.5% | 79.5% |
| Here: PP(FUS) TXT(FUS) | 69.9% | 77.6% |
| iDNA-Prot (Liu et al., 2014) | 67.2% | — |
| DNA-Prot (Kumar et al., 2009) | 61.8% | — |
| DNAbinder (Kumar et al., 2007) | 60.8% | 60.7% |
| DNABIND (Szilágyi and Skolnick, 2006) | 67.7% | 69.4% |
| DBD-Threader (Gao and Skolnick, 2009) | 59.7% | — |
| (Lou et al., 2014) | 76.9% | 79.1% |
| PP(FUS_noPDB) | 71.5% | 79.5% |
| IND2 . | Accuracy . | AUC . |
|---|---|---|
| PP(FUS_noPDB) | 71.5% | 79.5% |
| Here: PP(FUS) TXT(FUS) | 69.9% | 77.6% |
| iDNA-Prot (Liu et al., 2014) | 67.2% | — |
| DNA-Prot (Kumar et al., 2009) | 61.8% | — |
| DNAbinder (Kumar et al., 2007) | 60.8% | 60.7% |
| DNABIND (Szilágyi and Skolnick, 2006) | 67.7% | 69.4% |
| DBD-Threader (Gao and Skolnick, 2009) | 59.7% | — |
| (Lou et al., 2014) | 76.9% | 79.1% |
| PP(FUS_noPDB) | 71.5% | 79.5% |
Bold values are highest performance.
Comparison with the literature, results from Lou et al. (2014)
| IND2 . | Accuracy . | AUC . |
|---|---|---|
| PP(FUS_noPDB) | 71.5% | 79.5% |
| Here: PP(FUS) TXT(FUS) | 69.9% | 77.6% |
| iDNA-Prot (Liu et al., 2014) | 67.2% | — |
| DNA-Prot (Kumar et al., 2009) | 61.8% | — |
| DNAbinder (Kumar et al., 2007) | 60.8% | 60.7% |
| DNABIND (Szilágyi and Skolnick, 2006) | 67.7% | 69.4% |
| DBD-Threader (Gao and Skolnick, 2009) | 59.7% | — |
| (Lou et al., 2014) | 76.9% | 79.1% |
| PP(FUS_noPDB) | 71.5% | 79.5% |
| IND2 . | Accuracy . | AUC . |
|---|---|---|
| PP(FUS_noPDB) | 71.5% | 79.5% |
| Here: PP(FUS) TXT(FUS) | 69.9% | 77.6% |
| iDNA-Prot (Liu et al., 2014) | 67.2% | — |
| DNA-Prot (Kumar et al., 2009) | 61.8% | — |
| DNAbinder (Kumar et al., 2007) | 60.8% | 60.7% |
| DNABIND (Szilágyi and Skolnick, 2006) | 67.7% | 69.4% |
| DBD-Threader (Gao and Skolnick, 2009) | 59.7% | — |
| (Lou et al., 2014) | 76.9% | 79.1% |
| PP(FUS_noPDB) | 71.5% | 79.5% |
Bold values are highest performance.
We also report in Tables 5 and 6 our best ensemble without PDB-related protein descriptors: PP(FUS_noPDB). It obtains the highest AUC in IND2. It should be noted as well that many of the reported state-of-the-art approaches are based on features extracted considering PSSM and/or other sequence-related descriptors. In contrast, our best approach needs the 3D structure of a protein, and this structure is not always available. Nonetheless, the number of protein in the pdb repository is increasing year by year (see https://www.rcsb.org/stats/growth/).
4 Conclusion
The purpose of the present study was to experimentally evaluate the performance of many powerful protein representations and feature extraction methods to determine which descriptors and their combinations are most useful for DNA-BP identification. Experiments show that representations based on PDB clearly boost classification performance, with our best ensemble obtaining state-of-the-art performance across the benchmark datasets. We also show that the texture descriptors extracted from matrix-based representations perform similarly to PP for representing a protein.
In the future, we plan on experimentally evaluating more descriptors, including those that can be extracted from convolutional neural networks. To further improve the performance of our methods, we also plan on testing additional classification approaches (AdaBoost and Rotation Forest).
Conflict of Interest: none declared.
References