Abstract

Motivation

Because DNA-binding proteins (DNA-BPs) play a vital role in all aspects of genetic activity, the development of reliable and efficient systems for automatic DNA-BP classification is becoming a crucial proteomic technology. Key to this technology is the discovery of powerful protein representations and feature extraction methods. The goal of this article is to develop experimentally a system for automatic DNA-BP classification by comparing and combining different descriptors taken from different types of protein representations.

Results

The descriptors we evaluate include those starting from the position-specific scoring matrix (PSSM) of proteins, those derived from the amino-acid sequence (AAS), various matrix representations of proteins and features taken from the three-dimensional tertiary structure of proteins. We also introduce some new variants of protein descriptors. Each descriptor is used to train a separate support vector machine (SVM), and results are combined by sum rule. Our final system obtains state-or-the-art results on three benchmark DNA-BP datasets.

Availability and implementation

The MATLAB code for replicating the experiments presented in this paper is available at https://github.com/LorisNanni.

1 Introduction

DNA-binding proteins (DNA-BPs) play key roles in all aspects of genetic activity. Identifying these proteins, however, poses a major challenge, as traditional methods are time consuming and expensive. The development of automatic machine learning (ML) methods that quickly and accurately identify such proteins is rapidly becoming a critical proteomic technology.

According to Chou (2011), for the sake of clarity and practicality, a ML process designed to predict a protein should be documented in terms of the following five procedures: (i) the construction of a benchmark dataset for testing and training ML predictors, (ii) the formulation of discrete protein representations appropriate for the prediction problem, (iii) the development of ML approaches to perform the prediction, (iv) the evaluation of the accuracy of the proposed methods according to fair testing protocols and (v) the establishment of user-friendly web-servers that are accessible to the public.

Because the performance of a computational model is highly dependent on the power of its feature representation and because many different protein representations have been proposed in the literature, it is important to investigate which of these and their combinations are most suited for DNA-BP prediction. For this reason, the focus of this study is on Chou’s second procedure, the formulation of powerful discrete numerical representations of proteins.

ML feature representations of proteins are broadly classified based on two types of protein models: sequence-based models and structure-based models. Representations based on structural models rely on such information as the high-resolution 3D structure of protein sequences. Nimrod et al. (2010), for example, demonstrate how structural characteristics of proteins can be computed for DNA-BP identification from their average surface electrostatic potentials, dipole moments and cluster-based amino acid conservation patterns. For a recent survey of representations based on structural models, see Xiong et al. (2018). Many structure-based methods also combine structural information with sequential features for identifying DNA-BP. In Zhang and Liu (2017), for instance, different features extracted from the position-specific frequency matrix are tested for DNA-BP classification. Unfortunately, structural information of proteins is not always available, making methods based on this model unsuitable for predicting protein sequences without known structural information.

Representations based on sequence models are usually easier to extract since they are based on extracting features from the simple amino acid composition (AAC). The vector representation of AAC (Nakashima et al., 1986) is of length 20 since it represents proteins as the normalized occurrence frequencies of the 20 native amino acids. Pseudo amino acid composition (PseAAC) (Chou, 2001, 2009), which has become one of the most popular protein vector representations, expands AAC by retaining additional information embedded in protein sequences, such as the protein’s sequential order, using, among other modes, a series of rank-different correlation factors along a protein chain. Sixteen variants of PseAAC are described in Chou (2009), along with an historical account and practical guide for using them. In terms of DNA-BP identification (Rahman et al., 2018), recently used random forest to rank PseACC features and then recursive feature elimination to extract an optimal set of features that are trained using a support vector machine (SVM) with a linear kernel, and (Liu et al., 2014) combined PseACC features with a physicochemical distance transformation (Liu et al., 2015) using a reduced alphabet approach to improve prediction and decrease computation time.

Additional approaches based on the AAC discrete model include dipeptide (Lin and Ding, 2011; Nanni and Lumini, 2008; Waris et al., 2016), tripeptide (Ding et al., 2012) and tetrapeptide (Lin et al., 2013), where each protein is represented by a vector of length 20 that includes the normalized occurrence frequencies of a given n-peptide. Nanni and Lumini (2009) developed a multi-classifier approach based on grouped weight and a genetic algorithm for selecting a set of reduced alphabets taken directly from the AAC. There are also protein representations that rely on physicochemical properties (Liu et al., 2015; Song et al., 2014; Xu et al., 2014).

One other important approach based on the ACC model incorporates the evolutionary information embedded in sequence profiles that are automatically generated by position-specific iterated BLAST (PSI-BLAST) (Altschul et al., 1997). Many studies have demonstrated the importance of including evolutionary features for improving DNA-BP prediction (Chowdhury et al., 2017; Liu et al., 2015; Wei et al., 2017; Xu et al., 2015), with Liu et al. (2015) integrating PseAAC with profile-based evolutionary information retrieved by PSI-BLAST, discovering that negative samples in the training model improve prediction.

Several powerful features have been derived from the position-specific scoring matrix (PSSM) (Gribskov et al., 1987). PSSM describes a protein starting from the evolutionary information contained in a PSI-BLAST similarity search, see Nanni et al. (2013), for a survey of research using protein descriptors extracted from PSSM. In Waris et al. (2016), for instance, DNA-BP prediction is improved using a combination of features extracted from dipeptide composition, split AAC and PSSM. Wang et al. (2017) show improved performance by combining a 200-dimension normalized Moreau-Broto autocorrelations feature vector (Feng and Zhang, 2000) with a 1040-dimension feature vector called PSSM-DWT (PSSM compressed by a discrete wave transform) and a 100-dimension PSSM-DCT (PSSM compressed by a discrete cosine transform). Nanni et al., (2013) combine multiple matrix representations into a high performing general protein ensemble starting from PSSM.

Finally, several studies have demonstrated the power of extracting texture descriptors from matrix representations of proteins that are treated as images (Kavianpour and Vasighi, 2017; Li et al., 2018; Nanni et al., 2010b, 2012). In Nanni et al. (2010b), for instance, the authors extract and combine sets of well-known texture descriptors (e.g. variants of local binary patterns and features based on the Radon feature transform and Haralick descriptors) from the protein backbone image, demonstrating a significant improvement in classification rates on datasets for protein fold recognition, DNA-BP recognition, biological processes and molecular function recognition, and in Nanni et al. (2012), the authors analyze and compare several feature extraction methods used in protein classification that are based on the calculation of texture descriptors starting from a wavelet representation of the protein.

The main objective of this study is to search for an ensemble of protein features that work well across different DNA-BP classification datasets. To accomplish this objective, we investigate several state-of-the-art protein descriptors: those based on the ACC model, different types of matrix representations of the protein (such as PSSM) and the 3D tertiary structure of the protein. We develop as the result of our investigation a powerful ensemble of representations for DNA-BP identification, obtaining state-of-the-art performances across the benchmark datasets.

Although all MATLAB source codes used in this article are provided (see abstract for the URL), it should also be noted that most of the protein feature extraction methods used in this article can also be generated using such user-friendly tools as Pse-in-One 2.0 (Liu et al., 2017), BioSeq-Analysis (Liu, 2017) and Pse-Analysis (Liu et al., 2017). These tools take a given benchmark dataset as input and construct an optimized predictor based on samples taken from the dataset. Of note is the web-server Pse-in-One 2.0, which provides a powerful series of feature analysis approaches. Some additional PseAAC web servers include PseAAC-Builder (Du et al., 2012), propy (Cao et al., 2013) and PseAAC-General (Du et al., 2014).

2 Materials and methods

2.1 Machine learning approach for protein classification

As noted in the introduction, much recent research has focused on finding a compact and effective representation of proteins, one that ideally produces a fixed-length descriptor so that the classification problem can be solved by a ML approach. In this work, we explore several solutions evaluated based on a general representation that can be used with an ensemble of general-purpose classifiers, such as an ensemble of SVMs.

From each of the protein representations (described below in Section 2.2), several different types of features (as detailed in Section 2.3) are extracted. Some feature extraction methods are applied multiple times, once for each of the physicochemical properties that are considered in the extraction process. The set of physicochemical properties are obtained from the amino acid index database (Kawashima and Kanehisa, 1999) available at http://www.genome.jp/dbget/aaindex.html. An amino acid index is a set of 20 numerical values representing the different physicochemical properties of amino acids. We ignore properties where the amino acids have a value of 0 or 1. The amino acid index database currently contains 566 indices and 94 substitution matrices; but, as noted below, a reduced number of properties are sufficient for a protein classification task.

Once a descriptor is extracted from a protein representation, it is used to train an SVM as implemented as in the LibSVM toolbox (http://www.csie.ntu.edu.tw/∼cjlin/libsvm/). In the experiments reported in this work, all features used for training a SVM are linearly normalized to [0, 1] based on the training data. In each dataset, the SVM is tuned considering only the training data using a grid search approach, which means, in effect, that the test is blind.

The ensemble approaches presented here are based on the fusion of different descriptors; the final decision is obtained by combining the pool of SVMs by weighted sum rule. In this work, we have used both the sum rule and the weighted sum rule. In the standard sum rule, all the methods have the same weight, hence if an ensemble is based on the sum rule all the approaches have equal weights.

2.2 Protein representations

2.2.1 Amino acid sequence

As noted in the introduction, representations based the sequential model are derived from AAS, which can be described as the linear sequence P=(p1, p2,, pN), where piε𝓐=[A,C,D,,Y], with 𝓐 being the 20 native amino acid types. Many studies (see Kawashima and Kanehisa, 1999) have shown that AAS coupled with other information related to the physicochemical properties of amino acids produce many powerful descriptors.

2.2.2 Matrix representation: position-specific scoring matrix

PSSM, first proposed in Gribskov et al. (1987), is a matrix representation for proteins that is obtained from a group of sequences previously aligned by structural or sequence similarity. PSSM is calculated using PSI-BLAST, an application that compares PSSM profiles for detecting remotely related homologous proteins or DNA.

PSSM considers the following parameters:

  1. Position, which is the index of each amino acid residue in a sequence after multiple sequence alignments.

  2. Probe, which is a group of typical sequences of functionally related proteins that have already been aligned by structural similarity or sequence.

  3. Profile, which is a matrix of 20 columns that correspond to the 20 amino acids.

  4. Consensus, which the sequence of amino acid residues that are most similar to all the alignment residues of probes at each position; the consensus sequence is calculated by selecting the highest score at each position in the profile.

The PSSM representation for a given protein of length N is a matrix with dimension N × 20 where each element Si,j  represents the occurrence probability of amino acid j at position i of the protein sequence. The rows in the matrix represent the positions of the sequence, and the columns represent the 20 types of the original amino acids. The elements of PSSM(i,j) are calculated as PSSM(i,j)= k=120w(i,k)×Y(j,k), where i=(1,,N); j=(1,,20); w(i,k) is the ratio between the frequency of the kth amino acid at the position i of the probe and total number of probes and Y(j,k) is the value of Dayhoff’s mutation matrix between the jth and kth amino acids (in other words, Y(j,k) is a substitution matrix, i.e. a matrix that describes the rate at which one character in the protein changes into another over time).

PSSM scores are normally positive or negative integers. Small values of PSSM(i,j) indicate weakly conserved positions and large values indicate strongly conserved positions. Thus, the element of a PSSM profile can be used to approximate the occurrence probability of the corresponding amino acid at a specific position.

In this study, PSI-BLAST is called from MATLAB to create the PSSM scores for each protein sequence using the following command (where input.txt is the protein sequence, and output.txt contains the PSSM matrix):

  • system(‘blastpgp.exe -i input.txt -d swissprot -Q output.txt -j 3’).

2.2.3 Matrix representation: substitution matrix representation

In Nanni et al. (2013), a variant of the substitution matrix representation (SMR) proposed by Yu et al. (2011) is presented where the SMR for a given protein P=(p1,p2,, pN) is a N×20 matrix obtained as SMR(i,j)=M(pi,j), where i=(1,,N); j=(1,, 20); and M is a 20×20 substitution matrix, whose element Mi,j  represents the probability of amino acid i mutating to amino acid j during the evolution process. In the experiments reported below, 25 physicochemical properties are randomly selected to create an ensemble of SMR-based predictors.

2.2.4 Matrix representation: physicochemical property response matrix

Property response (PR), proposed in Nanni et al. (2012), is a matrix representation that is based on a protein’s physicochemical properties. For a given protein P= (p1, p2,, pN), the Physicochemical PR matrix PRMd(i,j) RN×N is first obtained by selecting a physicochemical property d. The value of the element PRMd(i,j) is set to the sum of the values of the physicochemical property d of the amino acid in position i of the protein and the value of the physicochemical property of the amino acid in position j, such that: PRMd(i,j)=index(pi,d)+index(pj,d), where i=(1,,N);j=(1,, 20) and index(a, d) returns the value of the property d for the amino acid a.

The PRMdmatrix is treated as an image that is resized to 250×250 elements via cubic interpolation (Keys, 1981) (if originally larger than this size) to obtain the final matrix PRd. In the experimental section, 25 random physicochemical properties are selected to create an ensemble of PRd-based (PR) predictors.

2.2.5 Matrix representation: wavelet (WAVE)

Methods for extracting features from wavelets have been proposed by Li and Li (2008), Nanni et al. (2010b, 2012), Qiu et al. (2009), Shi et al. (2011) and Wen et al. (2005). Wavelet encoding requires a numerical representation, meaning that the protein sequence must first be numerically encoded by substituting each amino acid with a value corresponding to a given physicochemical property d. As in Li and Li (2008), the Meyer continuous wavelet is then applied to the wavelet transform coefficients (WAVEd), and features are extracted considering 100 decomposition scales. Twenty-five physicochemical properties are randomly selected to create a WAVE ensemble composed of 25 WAVEd-based predictors.

2.2.6 Matrix representation: 3D tertiary structure (DM)

DM is based on distance between atoms and residues in a PDB structure. It creates a heat map showing inter-residue distances. If the size of the map exceeds 250, it is resized to 250 × 250 to reduce the computation time of the feature extraction step. As is the case with the other protein matrix representations described in this article, DM is regarded as a gray-scale image that is used to extract texture descriptors.

2.2.7 Matrix representation: RC

RC is our label for the protein representation proposed by Zacharaki (2017). RC represents a protein structure in the form of two sets of feature maps: (i) the local distributions of two torsion angles (ϕ and ψ) per amino acid, which expresses the shape of the protein backbone, and (ii) the distance between the amino acid building blocks. The feature maps, described in more detail below, are then treated as sets of images from which different texture descriptors are extracted.

2.2.8 Protein structure: torsion angles density

The two torsion angles ϕ and ψ describe the rotations of the polypeptide backbone around the bonds between N–Cα and Cα–C, respectively. The amino acids in the protein are grouped according to their type and the density of the two torsion angles (ϕ, ψ [180, 180]), which are based on the 2D sample histogram of the angles, often referred to as the Ramachandran diagram. The histogram has equal bins of size hA=19 and are not normalized. The resulting torsion angle density feature maps (XA) have a dimensionality of [hA×hA×m], with m the number of amino acids. The density function is smoothed by convoluting the density maps with a 2D Gaussian kernel (σ=0.5).

2.2.9 Protein structure: density of amino acid distances

For each amino acid Ai, where i=(1, , m) of a given protein, the distances to amino acid Aj, where j=(1, , m), are calculated based on the coordinates of the Cα atoms for their residues. The distances are stored as an array (dij), the length of which varies across proteins and is thus not comparable. To standardize the arrays, the sample histogram of dij is extracted using equally sized bins and smoothed, as above, by convoluting the histogram with a 1D Gaussian kernel (σ=0.5). Processing all pairs of amino acids produces feature maps XD of dimension [m×m×hD], where hD=8 is the number of histogram bins.

In this article, we apply the two RC protein representations: (i) since the torsion angle densities build a map of size m×19×19, we treat the map as 19 separate images of size m×19; and (ii) since the density of amino acid distances build a map of size m×m×8, we treat the map as eight separate images of size m×m. For each image different descriptors are extracted and trained by separate SVMs, which are combined by sum rule.

2.3 Protein feature extraction

In this section we describe the different approaches used to extract descriptors from the protein representations introduced in Section 2.2. The descriptors extracted from the primary ACC representation are mostly based on substituting the literal of an amino acid with its value of a fixed physicochemical property. To make the result independent on the selected property, 25 properties are selected and used to train an ensemble of SVM classifiers.

2.3.1 Primary representation: amino acid composition (AS)

AS is one of the simplest methods for extracting features since it merely counts the fraction of a given amino acid as AS(i)=h(i)/N,i[1,, 20],  where h(i) counts the number of occurrences of a given amino acid in a protein sequence of length N.

2.3.2 Primary representation: quasi residue couple

First proposed by Nanni and Lumini (2006) and inspired by Chou’s quasi-sequence-order model and Yuan’s Markov chain model (Guo et al., 2005), QRC is a method for extracting features from the primary sequence of a protein (Nanni et al., 2010). The original residue couple model was designed to represent information contained in both the AAC and the order of the amino acids in the protein sequences. The quasi residue couple (QRC) descriptor is obtained by selecting a physicochemical property d and by combining its values with each non-zero entry in the residue couple. A parameter m is called the order of the residue couple model, and values m ≤ 3 are considered sufficient for representing a sequence.

The QRC model for a physicochemical property d can be represented as:
QRCmd(k)=1Nmn=1NmHi,j(n,n+m,d)+Hj,i(n+m,n,d),
(1)
where i and j ε [1, , 20] are the 20 different amino acids; k=j+20(i1), N is the length of the protein, the function index(i,d) returns the value of the property d for the amino acid i, and the function Hi,j(a, b, d)= index(i,d), if pa=i and pb=j, otherwise, Hi,j(a, b, d)=0.

In the experimental section, the QRCd features are extracted for m, in the range of 1–3 and concatenated into a 1200-dimensional vector. Moreover, 25 physicochemical properties are randomly selected to create an ensemble of QRC descriptors.

2.3.3 Primary representation: autocovariance approach

Autocovariance (Zeng et al., 2009) is a sequence-based variant of Chou’s PseAAC that is based on autocovariance. AC extracts a set of PseAAC-based features from a given protein that is the concatenation of the 20 standard AAC values along with m values reflecting the effect of sequence order (m is a parameter indicating the maximum distance between two considered amino acids i,  j, which is set to 20 in the tests reported in the experimental section). Given a protein P = (p1, p2, , pN), and fixing physicochemical property d, the AC descriptor (ACdεR20+m) can be defined as:
ACd(i)={h(i)/Ni[1,,20]k=1Ni+20(index(pk,d)μd)·(index(pk+i20,d)μd)σd·(Ni+20)i[21,,20+m]
(2)
where the function index(i,d) returns the value of the property d for the amino acid i, and the function h(i) counts the number of occurrences of a given amino acid in a protein sequence. Both μd and σd are normalization factors denoting the mean and the variance of d on the 20 amino acids:
μd=120i=120index(i,d),σd=120i=120(index(i,d)μd)2
(3)

In the experimental section, 25 random physicochemical properties are selected to create an ensemble of 25 AC descriptors.

2.3.4 Matrix-based descriptor: pseudo PSSM

Pseudo PSSM (PP) is a widely used matrix descriptor for proteins (Fan and Li, 2011; Jeong et al., 2011) that is normally applied to the PSSM matrix representation. This protein descriptor is designed to retain information about the amino acid sequence by considering the PseAAC.

Given an input matrix MatRN×20, the pseudo PSSM descriptor is a vector PPR320 that is defined as:
PP(k)={1Ni=1NE(i,j) k=1,, 201Nlagi=1Nlag[E(i,j)E(i+lag,j)]2j=1,, 20, lag=1,,15k=20+j+20·(lag1) 
(4)
where k is a linear index used to scan the cells of Mat, lag denotes the distance between one residue and its neighbors, N is the length of the sequence and E ∈ ℜN × 20 is the normalized version of Mat and is defined as:
E(i,j)=Mat(i,j)120v=120Mat(i,v)120u=120(Mat(i,u)120v=120Mat(i,v))2,
(5)
where i=(1,,N) and j=(1,, 20).

2.3.5 Matrix-based descriptor: N-gram features (NGR)

NGR is typically extracted from the primary protein sequence (described above in Section 2.2). However, Sharma et al. (2013) extract this descriptor directly from the PSSM matrix by accumulating the probabilities of each of the N-grams according to the probability information contained in PSSM.

Given an input matrix MatεRN×2 representing a given protein, the frequency of occurrence of transition from ith amino acid to jth amino acid is calculated for 2-grams (BGR) as BGR(l)=z=1N1Mat(z,i)×Mat(z+1,j), where i=1,,N;j=1,, 20; and l=(i1)*20+j.

The frequency of occurrence of transition from ithith amino acid to jth jth amino acid is calculated for the 3-grams (TGR) as TGR(l)=z=1N2Mat(z,i)×Mat(z+1,j)×Mat(z+2,k),  where i=(1,,N); j=(1,, 20); k=(1,, 20);  and l=(i1)*400+(j1)*20+k

2.3.6 Matrix-based descriptor: texture descriptors

In this work, we combine many different texture descriptors as follows: for each texture descriptor, a different SVM is trained, and sets of SVMs are then combined by sum rule. The following descriptors are tested in this article:

  • LBP (Ojala et al., 2002): uniform LBP with two settings configurations (radius, number of neighbors P): (1, 8) and (2, 16).

  • WLD (Chen et al., 2010): Weber law descriptor code computed within a 3 × 3 block with the following parameter configurations: BETA = 5, ALPHA = 3, and number of neighbors = 8.

  • CLBP (Guo et al., 2010): completed LBP with two configurations (R, P): (1, 8) and (2, 16).

  • RIC (Nosaka and Fukui, 2014): multiscale rotation invariant co-occurrence of adjacent LBP with R ∈ {1, 2, 4}.

  • MORPH (Strandmark et al., 2012): a set of MORHphological features, which is a set of measures that includes such features as the aspect ratio, number of objects, area, perimeter, eccentricity and other measures extracted from a segmented version of the image.

  • HASH (San Biagio et al., 2013): default values of the heterogeneous auto-similarities of characteristics features.

3 Results and discussion

In the present study, we evaluate our approach across three benchmark datasets: PDB1075, PDB594 and PDB186. These DNA-BPs were selected from the Protein Databank located at http://www.rcsb.org/pdb/home/home.do. All protein sequences containing less than 50 amino acids or the character “X” were removed, as were all sequences having more than 25% similarity with any other sequence. The PDB1075 dataset (Liu et al., 2014) contains 525 DNA-BPs and 550 DNA-non-BPs. The PDB594 dataset (Lou et al., 2014) contains 297 DNA-BPs and 297 DNA-non-BPs. The PDB186 dataset is designed as an independent testing dataset derived from Lou et al. (2014); it contains 93 DNA-BPs and 93 DNA-non-BPs.

In accordance with Chou’s procedure (2011), we perform the following testing protocols for a fair comparison with the literature:

  • Jack: Jackknife test as implemented in the PDB1075 dataset.

  • IND1: where training is on the PDB1075 dataset and testing on the independent PDB186 dataset.

  • IND2: where training is on the PDB594 dataset and testing on independent PDB186 dataset.

All results are reported using two performance indicators: (i) classification accuracy and (ii) area under the ROC curve (AUC). Accuracy is computed as the ratio between the number of samples correctly classified and the total number of samples. The ROC curve is computed as a graphical plot of the sensitivity of a binary classifier vs false positives (1 − specificity), given that its discrimination threshold varies. AUC (Fawcett, 2004) is a scalar measure representing the probability that the classifier will assign a lower score to a randomly picked positive pattern rather than to a randomly picked negative pattern. Before each fusion of different methods their scores are normalized to mean 0 and SD 1.

The aim of the first set of experiments is to compare all the descriptors detailed in Section 2.3. In each cell of the following tables, there are two values: the first is the accuracy and the second is the AUC. In Table 1 the performance of representations based on the sequence model and their fusions are reported. The methods named “AC + K × QRC” means the weighted sum rule between AC (weight 1) and QRC (weight K).

Table 1.

Performance of representations that are sequence-based

Sequence basedACQRCAC + QRCAC + 0.5xQRC
IND1 79.03% 75.81% 81.18% 79.57% 
 84.76% 96.55% 89.92% 88.14% 
Jack 76.74% 72.47% 76.28% 77.02% 
 84.92% 82.32% 85.55% 85.25% 
IND2 65.59% 62.90% 66.13% 65.59% 
 69.30% 68.94% 70.99% 70.48% 
Sequence basedACQRCAC + QRCAC + 0.5xQRC
IND1 79.03% 75.81% 81.18% 79.57% 
 84.76% 96.55% 89.92% 88.14% 
Jack 76.74% 72.47% 76.28% 77.02% 
 84.92% 82.32% 85.55% 85.25% 
IND2 65.59% 62.90% 66.13% 65.59% 
 69.30% 68.94% 70.99% 70.48% 

Bold values are highest performance.

Table 1.

Performance of representations that are sequence-based

Sequence basedACQRCAC + QRCAC + 0.5xQRC
IND1 79.03% 75.81% 81.18% 79.57% 
 84.76% 96.55% 89.92% 88.14% 
Jack 76.74% 72.47% 76.28% 77.02% 
 84.92% 82.32% 85.55% 85.25% 
IND2 65.59% 62.90% 66.13% 65.59% 
 69.30% 68.94% 70.99% 70.48% 
Sequence basedACQRCAC + QRCAC + 0.5xQRC
IND1 79.03% 75.81% 81.18% 79.57% 
 84.76% 96.55% 89.92% 88.14% 
Jack 76.74% 72.47% 76.28% 77.02% 
 84.92% 82.32% 85.55% 85.25% 
IND2 65.59% 62.90% 66.13% 65.59% 
 69.30% 68.94% 70.99% 70.48% 

Bold values are highest performance.

In Table 2, we report the performance obtained using two different matrix representations (PSSM and SMR) coupled with NGR protein descriptors. The third column is the fusion by sum rule between the classifiers trained with PSMM and SMR.

Table 2.

Performance obtained using the NGR representation

NGRPSSMSMRFUS_NG
IND1 80.11% 78.49% 80.65% 
 87.21% 93.16% 91.28% 
Jack 73.21% 74.33% 73.21% 
 81.75% 81.85% 81.75% 
IND2 67.74% 63.98% 66.13% 
 70.49% 69.14% 71.46% 
NGRPSSMSMRFUS_NG
IND1 80.11% 78.49% 80.65% 
 87.21% 93.16% 91.28% 
Jack 73.21% 74.33% 73.21% 
 81.75% 81.85% 81.75% 
IND2 67.74% 63.98% 66.13% 
 70.49% 69.14% 71.46% 

Bold values are highest performance.

Table 2.

Performance obtained using the NGR representation

NGRPSSMSMRFUS_NG
IND1 80.11% 78.49% 80.65% 
 87.21% 93.16% 91.28% 
Jack 73.21% 74.33% 73.21% 
 81.75% 81.85% 81.75% 
IND2 67.74% 63.98% 66.13% 
 70.49% 69.14% 71.46% 
NGRPSSMSMRFUS_NG
IND1 80.11% 78.49% 80.65% 
 87.21% 93.16% 91.28% 
Jack 73.21% 74.33% 73.21% 
 81.75% 81.85% 81.75% 
IND2 67.74% 63.98% 66.13% 
 70.49% 69.14% 71.46% 

Bold values are highest performance.

In Table 3, the performance obtained using a set of texture (TXT) descriptors (see Table 1) for describing different matrix protein representations is reported, as well as the performance of PP as a matrix representation is reported. The column named FUS is a fusion by sum rule among the different approaches PSSM, SMR, WAVE, DM and RC. The column labeled FUS_noPDB reports the performance of the fusion of methods not based on PDB, i.e. PSSM, SMR and WAVE.

Table 3.

TXT and PP descriptors for describing the different matrix protein representations

TXTPSSMSMRPRWAVEDMRCFUS_noPDBFUS
IND1 83.87% 80.65% 86.0282.26% 73.66% 80.11% 82.80% 82.80% 
 96.25% 93.50% 94.10% 93.20% 83.19% 88.35% 96.5194.84% 
Jack 72.93% 77.2169.30% 67.44% 76.19% 72.28% 75.26% 76.28% 
 80.54% 85.44% 78.05% 77.25% 84.01% 80.56% 83.83% 86.19
IND2 66.67% 64.52% 61.29% 61.29% 67.7460.75% 65.59% 67.74
 68.46% 70.01% 67.65% 66.61% 72.3365.94% 69.94% 71.17% 

 
PP PSSM SMR PR WAVE DM RC FUS_noPDB FUS 

 
IND1 82.26% 82.26% 86.02% 87.10% 77.42% 79.03% 82.26% 84.41
 92.37% 92.67% 93.16% 93.41% 85.92% 90.87% 96.67% 96.90
Jack 76.93% 75.53% 65.67% 65.77% 71.16% 73.95% 78.88% 80.74
 84.89% 83.71% 70.98% 72.05% 80.86% 82.64% 88.41% 89.67
IND2 69.89% 66.67% 55.91% 55.38% 55.91% 60.22% 71.51% 69.89
 76.64% 74.05% 57.13% 58.12% 62.64% 66.66% 79.4977.80% 
TXTPSSMSMRPRWAVEDMRCFUS_noPDBFUS
IND1 83.87% 80.65% 86.0282.26% 73.66% 80.11% 82.80% 82.80% 
 96.25% 93.50% 94.10% 93.20% 83.19% 88.35% 96.5194.84% 
Jack 72.93% 77.2169.30% 67.44% 76.19% 72.28% 75.26% 76.28% 
 80.54% 85.44% 78.05% 77.25% 84.01% 80.56% 83.83% 86.19
IND2 66.67% 64.52% 61.29% 61.29% 67.7460.75% 65.59% 67.74
 68.46% 70.01% 67.65% 66.61% 72.3365.94% 69.94% 71.17% 

 
PP PSSM SMR PR WAVE DM RC FUS_noPDB FUS 

 
IND1 82.26% 82.26% 86.02% 87.10% 77.42% 79.03% 82.26% 84.41
 92.37% 92.67% 93.16% 93.41% 85.92% 90.87% 96.67% 96.90
Jack 76.93% 75.53% 65.67% 65.77% 71.16% 73.95% 78.88% 80.74
 84.89% 83.71% 70.98% 72.05% 80.86% 82.64% 88.41% 89.67
IND2 69.89% 66.67% 55.91% 55.38% 55.91% 60.22% 71.51% 69.89
 76.64% 74.05% 57.13% 58.12% 62.64% 66.66% 79.4977.80% 

Bold values are highest performance.

Table 3.

TXT and PP descriptors for describing the different matrix protein representations

TXTPSSMSMRPRWAVEDMRCFUS_noPDBFUS
IND1 83.87% 80.65% 86.0282.26% 73.66% 80.11% 82.80% 82.80% 
 96.25% 93.50% 94.10% 93.20% 83.19% 88.35% 96.5194.84% 
Jack 72.93% 77.2169.30% 67.44% 76.19% 72.28% 75.26% 76.28% 
 80.54% 85.44% 78.05% 77.25% 84.01% 80.56% 83.83% 86.19
IND2 66.67% 64.52% 61.29% 61.29% 67.7460.75% 65.59% 67.74
 68.46% 70.01% 67.65% 66.61% 72.3365.94% 69.94% 71.17% 

 
PP PSSM SMR PR WAVE DM RC FUS_noPDB FUS 

 
IND1 82.26% 82.26% 86.02% 87.10% 77.42% 79.03% 82.26% 84.41
 92.37% 92.67% 93.16% 93.41% 85.92% 90.87% 96.67% 96.90
Jack 76.93% 75.53% 65.67% 65.77% 71.16% 73.95% 78.88% 80.74
 84.89% 83.71% 70.98% 72.05% 80.86% 82.64% 88.41% 89.67
IND2 69.89% 66.67% 55.91% 55.38% 55.91% 60.22% 71.51% 69.89
 76.64% 74.05% 57.13% 58.12% 62.64% 66.66% 79.4977.80% 
TXTPSSMSMRPRWAVEDMRCFUS_noPDBFUS
IND1 83.87% 80.65% 86.0282.26% 73.66% 80.11% 82.80% 82.80% 
 96.25% 93.50% 94.10% 93.20% 83.19% 88.35% 96.5194.84% 
Jack 72.93% 77.2169.30% 67.44% 76.19% 72.28% 75.26% 76.28% 
 80.54% 85.44% 78.05% 77.25% 84.01% 80.56% 83.83% 86.19
IND2 66.67% 64.52% 61.29% 61.29% 67.7460.75% 65.59% 67.74
 68.46% 70.01% 67.65% 66.61% 72.3365.94% 69.94% 71.17% 

 
PP PSSM SMR PR WAVE DM RC FUS_noPDB FUS 

 
IND1 82.26% 82.26% 86.02% 87.10% 77.42% 79.03% 82.26% 84.41
 92.37% 92.67% 93.16% 93.41% 85.92% 90.87% 96.67% 96.90
Jack 76.93% 75.53% 65.67% 65.77% 71.16% 73.95% 78.88% 80.74
 84.89% 83.71% 70.98% 72.05% 80.86% 82.64% 88.41% 89.67
IND2 69.89% 66.67% 55.91% 55.38% 55.91% 60.22% 71.51% 69.89
 76.64% 74.05% 57.13% 58.12% 62.64% 66.66% 79.4977.80% 

Bold values are highest performance.

From the results reported in Table 3, the following conclusion can be drawn:

  • PP obtains best performances.

  • NGR and the sequence-based representations are clearly worse than PP and TXT

  • The fusion among different features descriptors extracted from the same matrix representation is useful.

  • The representations related to the PDB protein format boosts performance (also if not in a remarkable way).

  • The fusion clearly outperforms all the stand-alone methods.

In Table 4, we report the performance obtained by combining by weighted sum rule PP(FUS) and TXT(FUS). Unfortunately, the fusion only weakly boosts the performance of the base approaches. We did not run a parameter selection to determine the weights, nor did we overfit them: we simply tested reasonable values. A weight of 0.5 means than that approach is weighted by half with respect the other method and by a quarter if the value is 0.25. The performance of PP(FUS)+0.5×TXT(FUS) and PP(FUS)+0.25×TXT(FUS) is very similar.

Table 4.

Weighted sum rule between PP(FUS2) and TXT(FUS2)

TXT(FUS)PP(FUS)PP(FUS)PP(FUS)PP(FUS)
+ TXT(FUS)+ 0.50 ×  TXT(FUS)+ 0.25 ×  TXT(FUS)
IND1 82.80% 84.41% 83.87% 84.9584.95
 94.84% 96.9096.39% 96.76% 96.90
Jack 76.28% 80.7478.70% 79.44% 80.19% 
 86.19% 89.67% 89.28% 89.81% 89.90
IND2 67.74% 69.8970.43% 69.35% 69.89
 71.17% 77.8076.41% 77.13% 77.59% 
TXT(FUS)PP(FUS)PP(FUS)PP(FUS)PP(FUS)
+ TXT(FUS)+ 0.50 ×  TXT(FUS)+ 0.25 ×  TXT(FUS)
IND1 82.80% 84.41% 83.87% 84.9584.95
 94.84% 96.9096.39% 96.76% 96.90
Jack 76.28% 80.7478.70% 79.44% 80.19% 
 86.19% 89.67% 89.28% 89.81% 89.90
IND2 67.74% 69.8970.43% 69.35% 69.89
 71.17% 77.8076.41% 77.13% 77.59% 

Bold values are highest performance.

Table 4.

Weighted sum rule between PP(FUS2) and TXT(FUS2)

TXT(FUS)PP(FUS)PP(FUS)PP(FUS)PP(FUS)
+ TXT(FUS)+ 0.50 ×  TXT(FUS)+ 0.25 ×  TXT(FUS)
IND1 82.80% 84.41% 83.87% 84.9584.95
 94.84% 96.9096.39% 96.76% 96.90
Jack 76.28% 80.7478.70% 79.44% 80.19% 
 86.19% 89.67% 89.28% 89.81% 89.90
IND2 67.74% 69.8970.43% 69.35% 69.89
 71.17% 77.8076.41% 77.13% 77.59% 
TXT(FUS)PP(FUS)PP(FUS)PP(FUS)PP(FUS)
+ TXT(FUS)+ 0.50 ×  TXT(FUS)+ 0.25 ×  TXT(FUS)
IND1 82.80% 84.41% 83.87% 84.9584.95
 94.84% 96.9096.39% 96.76% 96.90
Jack 76.28% 80.7478.70% 79.44% 80.19% 
 86.19% 89.67% 89.28% 89.81% 89.90
IND2 67.74% 69.8970.43% 69.35% 69.89
 71.17% 77.8076.41% 77.13% 77.59% 

Bold values are highest performance.

Finally, in Tables 5 and 6, we compare our approach with the literature. It will be observed that our approach obtains state-of-the-art performance in IND1 and Jack; in IND2 it obtains second best performance. In Table 6, the performance is obtained using the whole dataset for selecting the features and retaining the number of features that maximize the performance (the results of that testing protocol is not comparable with ours). Also, in Zhang and Liu (2017), the AUC is reported for IND1 (88.03%) and Jack (87.12%); note that our approach outperforms both the results.

Table 5.

Comparison using accuracy with the literature

IND1Jack
PP(FUS_noPDB) 82.26% 78.88% 
Here: PP(FUS)+0.25× TXT(FUS) 84.9580.19% 
iDNA-Prot|dis (Liu et al., 201472.0% 72.0% 
PseDNA-Pro (Liu et al., 2015— 76.55% 
iDNAPro-PseAAC (Liu et al., 201571.5% 76.56% 
PP(FUS_noPDB) 82.26% 78.88% 
Kmer1 + ACC (Dong et al., 201571.0% 75.23% 
Local-DPP (Wei et al., 201779.0% 79.20% 
(Wang et al., 201776.3% 86.23%* 
(Chowdhury et al., 201780.6% 90.18%* 
PSFM-DBT (Zhang and Liu, 201780.65% 81.02% 
IND1Jack
PP(FUS_noPDB) 82.26% 78.88% 
Here: PP(FUS)+0.25× TXT(FUS) 84.9580.19% 
iDNA-Prot|dis (Liu et al., 201472.0% 72.0% 
PseDNA-Pro (Liu et al., 2015— 76.55% 
iDNAPro-PseAAC (Liu et al., 201571.5% 76.56% 
PP(FUS_noPDB) 82.26% 78.88% 
Kmer1 + ACC (Dong et al., 201571.0% 75.23% 
Local-DPP (Wei et al., 201779.0% 79.20% 
(Wang et al., 201776.3% 86.23%* 
(Chowdhury et al., 201780.6% 90.18%* 
PSFM-DBT (Zhang and Liu, 201780.65% 81.02% 

Bold values are highest performance.

Table 5.

Comparison using accuracy with the literature

IND1Jack
PP(FUS_noPDB) 82.26% 78.88% 
Here: PP(FUS)+0.25× TXT(FUS) 84.9580.19% 
iDNA-Prot|dis (Liu et al., 201472.0% 72.0% 
PseDNA-Pro (Liu et al., 2015— 76.55% 
iDNAPro-PseAAC (Liu et al., 201571.5% 76.56% 
PP(FUS_noPDB) 82.26% 78.88% 
Kmer1 + ACC (Dong et al., 201571.0% 75.23% 
Local-DPP (Wei et al., 201779.0% 79.20% 
(Wang et al., 201776.3% 86.23%* 
(Chowdhury et al., 201780.6% 90.18%* 
PSFM-DBT (Zhang and Liu, 201780.65% 81.02% 
IND1Jack
PP(FUS_noPDB) 82.26% 78.88% 
Here: PP(FUS)+0.25× TXT(FUS) 84.9580.19% 
iDNA-Prot|dis (Liu et al., 201472.0% 72.0% 
PseDNA-Pro (Liu et al., 2015— 76.55% 
iDNAPro-PseAAC (Liu et al., 201571.5% 76.56% 
PP(FUS_noPDB) 82.26% 78.88% 
Kmer1 + ACC (Dong et al., 201571.0% 75.23% 
Local-DPP (Wei et al., 201779.0% 79.20% 
(Wang et al., 201776.3% 86.23%* 
(Chowdhury et al., 201780.6% 90.18%* 
PSFM-DBT (Zhang and Liu, 201780.65% 81.02% 

Bold values are highest performance.

Table 6.

Comparison with the literature, results from Lou et al. (2014)

IND2AccuracyAUC
PP(FUS_noPDB) 71.5% 79.5% 
Here: PP(FUS) +0.25× TXT(FUS) 69.9% 77.6% 
iDNA-Prot (Liu et al., 201467.2% — 
DNA-Prot (Kumar et al., 200961.8% — 
DNAbinder (Kumar et al., 200760.8% 60.7% 
DNABIND (Szilágyi and Skolnick, 200667.7% 69.4% 
DBD-Threader (Gao and Skolnick, 200959.7% — 
(Lou et al., 201476.979.1% 
PP(FUS_noPDB) 71.5% 79.5% 
IND2AccuracyAUC
PP(FUS_noPDB) 71.5% 79.5% 
Here: PP(FUS) +0.25× TXT(FUS) 69.9% 77.6% 
iDNA-Prot (Liu et al., 201467.2% — 
DNA-Prot (Kumar et al., 200961.8% — 
DNAbinder (Kumar et al., 200760.8% 60.7% 
DNABIND (Szilágyi and Skolnick, 200667.7% 69.4% 
DBD-Threader (Gao and Skolnick, 200959.7% — 
(Lou et al., 201476.979.1% 
PP(FUS_noPDB) 71.5% 79.5% 

Bold values are highest performance.

Table 6.

Comparison with the literature, results from Lou et al. (2014)

IND2AccuracyAUC
PP(FUS_noPDB) 71.5% 79.5% 
Here: PP(FUS) +0.25× TXT(FUS) 69.9% 77.6% 
iDNA-Prot (Liu et al., 201467.2% — 
DNA-Prot (Kumar et al., 200961.8% — 
DNAbinder (Kumar et al., 200760.8% 60.7% 
DNABIND (Szilágyi and Skolnick, 200667.7% 69.4% 
DBD-Threader (Gao and Skolnick, 200959.7% — 
(Lou et al., 201476.979.1% 
PP(FUS_noPDB) 71.5% 79.5% 
IND2AccuracyAUC
PP(FUS_noPDB) 71.5% 79.5% 
Here: PP(FUS) +0.25× TXT(FUS) 69.9% 77.6% 
iDNA-Prot (Liu et al., 201467.2% — 
DNA-Prot (Kumar et al., 200961.8% — 
DNAbinder (Kumar et al., 200760.8% 60.7% 
DNABIND (Szilágyi and Skolnick, 200667.7% 69.4% 
DBD-Threader (Gao and Skolnick, 200959.7% — 
(Lou et al., 201476.979.1% 
PP(FUS_noPDB) 71.5% 79.5% 

Bold values are highest performance.

We also report in Tables 5 and 6 our best ensemble without PDB-related protein descriptors: PP(FUS_noPDB). It obtains the highest AUC in IND2. It should be noted as well that many of the reported state-of-the-art approaches are based on features extracted considering PSSM and/or other sequence-related descriptors. In contrast, our best approach needs the 3D structure of a protein, and this structure is not always available. Nonetheless, the number of protein in the pdb repository is increasing year by year (see https://www.rcsb.org/stats/growth/).

4 Conclusion

The purpose of the present study was to experimentally evaluate the performance of many powerful protein representations and feature extraction methods to determine which descriptors and their combinations are most useful for DNA-BP identification. Experiments show that representations based on PDB clearly boost classification performance, with our best ensemble obtaining state-of-the-art performance across the benchmark datasets. We also show that the texture descriptors extracted from matrix-based representations perform similarly to PP for representing a protein.

In the future, we plan on experimentally evaluating more descriptors, including those that can be extracted from convolutional neural networks. To further improve the performance of our methods, we also plan on testing additional classification approaches (AdaBoost and Rotation Forest).

Conflict of Interest: none declared.

References

Altschul
 
S.F.
, et al.  (
1997
)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res.
,
25
,
3389
3402
.

Cao
 
D.S.
, et al.  (
2013
)
Propy: a tool to generate various modes of Chou’s PseAAC
.
Bioinformatics
,
29
,
960
962
.

Chen
 
J.
, et al.  (
2010
)
WLD: a robust local image descriptor
.
IEEE Trans. Pattern Anal. Mach. Intell.
,
32
,
1705
1720
.

Chou
 
K.-C.
(
2001
)
Prediction of protein cellular attributes using pseudo-amino acid composition
.
Proteins Struct. Fucnt. Genet.
,
43
,
246
255
.

Chou
 
K.-C.
(
2009
)
Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology
.
Curr. Proteom.
,
6
,
262
274
.

Chou
 
K.-C.
(
2011
)
Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review)
.
J. Theor. Biol.
,
273
,
236
247
.

Chowdhury
 
S.Y.
, et al.  (
2017
)
iDNAProt-ES: identifcation of DNA-binding proteins using evolutionary and structural features
.
Sci. Rep.
,
7
,
1
14
.

Ding
 
S.
, et al.  (
2012
)
A novel protein structural classes prediction method based on predicted secondary structure
.
Biochimie
,
94
,
1166
1171
.

Dong
 
Q.
, et al.  (
2015
)
Identification of DNA-binding proteins by auto-cross covariance transformation
. In:
IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
,
Washington DC
, pp.
470
475
.

Du
 
P.
, et al.  (
2012
)
PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions
.
Anal. Biochem.
,
425
,
117
119
.

Du
 
P.
, et al.  (
2014
)
PseAAC-general: fast building various modes of general form of Chou’s pseudo-amino acid composition for large-scale protein datasets
.
Int. J. Mol. Sci.
,
15
,
3495
3506
.

Fan
 
G.-L.
,
Li
Q.-Z.
(
2011
)
Predicting protein submitochondrion locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition
.
Amino Acids
,
43
,
545
555
.

Fawcett
 
T.
(
2004
)
ROC Graphs: Notes and Practical Considerations for Researchers
.
HP Laboratories
,
Palo Alto, CA
.

Feng
 
Z.P.
,
Zhang
C.T.
(
2000
)
Prediction of membrane protein types based on the hydrophobic index of amino acids
.
J. Protein Chem.
,
19
,
269
275
.

Gao
 
M.
,
Skolnick
J.
(
2009
)
A threading-based method for the prediction of DNA-binding proteins with application to the human genome
.
PLoS Comput. Biol.
,
5
,
e1000567
.

Gribskov
 
M.
, et al.  (
1987
)
Profile analysis: detection of distantly related proteins
.
Proc. Natl. Acad. Sci USA
,
84
,
4355
4358
.

Guo
 
J.
, et al.  (
2005
)
A novel method for protein subcellular localization: combining residue-couple model and SVM
. In:
Proceedings of 3rd Asia-Pacific Bioinformatics Conference
,
Singapore
, pp.
117
129
.

Guo
 
Z.
, et al.  (
2010
)
A completed modeling of local binary pattern operator for texture classification
.
IEEE Trans. Image Process.
,
19
,
1657
1663
.

Jeong
 
J.C.
, et al.  (
2011
)
On position-specific scoring matrix for protein function prediction
.
IEEE/ACM Trans. Comput. Biol. Bioinform.
,
8
,
308
315
.

Kavianpour
 
H.
,
Vasighi
M.
(
2017
)
Structural classification of proteins using texture descriptors extracted from the cellular automata image
.
Amino Acids
,
49
,
261
271
.

Kawashima
 
S.
,
Kanehisa
M.
(
1999
)
AAindex: amino acid index database
.
Nucleic Acids Res.
,
27
,
368
.

Keys
 
R.
(
1981
)
Cubic convolution interpolation for digital image processing
.
IEEE Trans. Acoust. Speech Signal Process.
,
29
,
1153
1160
.

Kumar
 
K.K.
, et al.  (
2009
)
DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest
.
J. Biomol. Struct. Dyn.
,
26
,
679
686
.

Kumar
 
M.
, et al.  (
2007
)
Identification of DNA-binding proteins using support vector machines and evolutionary profiles
.
BMC Bioinform.
,
8
,
463
.

Li
 
C.
, et al.  (
2018
)
Protein sequence comparison and DNA-binding protein identification with generalized PseAAC and graphical representation
.
Combinat. Chem. High Throughput Screen.
,
21
,
100
110
.

Li
 
F.M.
,
Li
Q.Z.
(
2008
)
Predicting protein subcellular location using Chou’s pseudo amino acid composition and improved hybrid approach
.
Protein Pept. Lett.
,
15
,
612
616
.

Lin
 
H.
,
Ding
H.
(
2011
)
Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition
.
J. Theor. Biol.
,
269
,
64
69
.

Lin
 
H.
, et al.  (
2013
)
Using over-represented tetrapeptides to predict protein submitochondia locations
.
Acta Biotheor.
,
61
,
259
268
.

Liu
 
B.
(
2017
)
BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches
.
Brief. Bioinform.
,
bbx165
.

Liu
 
B.
 et al.  (
2017
)
Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods
.
Oncotarget
,
8
,
13338
13343
.

Liu
 
B.
, et al.  (
2014
)
iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition
.
PLoS One
,
9
,
e106691
.

Liu
 
B.
, et al.  (
2015
)
DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation
.
Sci. Rep.
,
5
,
15479
.

Liu
 
B.
, et al.  (
2015
)
PseDNA‐Pro: DNA‐binding protein identification by combining Chou’s PseAAC and physicochemical distance transformation
.
Mol. Inform.
,
34
,
8
17
.

Liu
 
B.
, et al.  (
2017
)
Pse-in-one 2.0: an improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences
.
Nat. Sci
,
67–91
.

Lou
 
W.
, et al.  (
2014
)
Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naive Bayes
.
PLoS One
,
9
,
e86703
.

Nakashima
 
H.
, et al.  (
1986
)
The folding type of a protein is relevant to the amino acid composition
.
J. Biochem.
,
99
,
153
162
.

Nanni
 
L.
,
Lumini
A.
(
2006
)
An ensemble of K-local hyperplane for predicting protein-protein interactions
.
BioInformatics
,
22
,
1207
1210
.

Nanni
 
L.
,
Lumini
A.
(
2008
)
Combing ontologies and dipeptide composition for predicting DNA-binding proteins
.
Amino Acids
,
34
,
635
641
.

Nanni
 
L.
,
Lumini
A.
(
2009
)
An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins
.
Amino Acids
,
36
,
167
175
.

Nanni
 
L.
, et al.  (
2010a
)
High performance set of PseAAC descriptors extracted from the amino acid sequence for protein classification
.
J. Theor. Biol.
,
266
,
1
10
.

Nanni
 
L.
, et al.  (
2010b
)
Protein classification using texture descriptors extracted from the protein backbone image
.
J. Theor. Biol.
,
264
,
1024
1032
.

Nanni
 
L.
, et al.  (
2012
)
Wavelet images and Chou’s pseudo amino acid composition for protein classification
.
Amino Acids
,
43
,
657
665
.

Nanni
 
L.
, et al.  (
2013
)
An empirical study on the matrix-based protein representations and their combination with sequence-based approaches
.
Amino Acids
,
44
,
887
901
.

Nimrod
 
G.
, et al.  (
2010
)
iDBPs: a web server for the identification of DNA binding proteins
.
Bioinformatics
,
26
,
692
693
.

Nosaka
 
R.
,
Fukui
K.
(
2014
)
HEp-2 cell classification using rotation invariant co-occurrence among local binary patterns
.
Pattern Recogn. Bioinform.
,
47
,
2428
2436
.

Ojala
 
T.
, et al.  (
2002
)
Multiresolution gray-scale and rotation invariant texture classification with local binary patterns
.
IEEE Trans. Pattern Anal. Mach. Intell.
,
24
,
971
987
.

Qiu
 
J.D.
, et al.  (
2009
)
Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: an approach from discrete wavelet transform
.
Anal. Biochem.
,
390
,
68
73
.

Rahman
 
M.S.
, et al.  (
2018
)
DPP-PseAAC: a DNA-binding protein prediction model using Chou’s general PseAAC
.
J. Theor. Biol.
,
452
,
22
34
.

San Biagio
 
M.
, et al.  (
2013
)
Heterogeneous auto-similarities of characteristics (HASC): exploiting relational information for classification
. In:
IEEE Computer Vision (ICCV13)
, Sydney, Australia, pp.
809
816
.

Sharma
 
A.
, et al.  (
2013
)
A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition
.
J. Theor. Biol.
,
320
,
41
46
.

Shi
 
S.P.
, et al.  (
2011
)
Identify submitochondria and subchloroplast locations with pseudo amino acid composition: approach from the strategy of discrete wavelet transform feature extraction
.
Biochim. Biophys. Acta
,
1813
,
424
430
.

Song
 
L.
, et al.  (
2014
)
nDNA-prot: identification of DNA-binding proteins based on unbalanced classification
.
BMC Bioinform.
,
15
,
298
.

Strandmark
 
P.
, et al.  (
2012
)
HEp-2 staining pattern classification
. In:
International Conference on Pattern Recognition (ICPR2012)
.

Szilágyi
 
A.
,
Skolnick
J.
(
2006
)
Efficient prediction of nucleic acid binding function from low-resolution protein structures
.
J. Mol. Biol.
,
358
,
922
933
.

Wang
 
Y.
, et al.  (
2017
)
Improved detection of DNA-binding proteins via compression technology on PSSM information
.
PLoS One
,
12
,
e0185587
.

Waris
 
M.
, et al.  (
2016
)
Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix
.
Neurocomputing
,
199
,
154
162
.

Wei
 
L.
, et al.  (
2017
)
Local-dpp: an improved DNA-binding protein prediction method by exploring local evolutionary information
.
Inform. Sci.
,
384
,
135
144
.

Wen
 
Z.N.
, et al.  (
2005
)
Analyzingfunctional similarity of protein sequences with discrete wavelettransform
.
Comput. Biol. Chem.
,
29
,
220
228
.

Xiong
 
Y.
, et al.  (
2018
)
Survey of computational approaches for prediction of DNA-binding residues on protein surfaces
. In:
Huang
T.
(ed.)
Computational Systems Biology: Methods in Molecular Biology
.
Humana Press
,
New York
.

Xu
 
R.
, et al.  (
2014
)
enDNA-Prot: identification of DNA-binding proteins by applying ensemble learning
.
BioMed Res. Int. B
,
1
10
.

Xu
 
R.
, et al.  (
2015
)
Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach
.
J. Biomol. Struct. Dyn.
,
33
,
1720
1730
.

Yu
 
X.
, et al.  (
2011
)
Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation
.
Amino Acids
,
1619
1625
.

Zacharaki
 
E.I.
(
2017
)
Prediction of protein function using a deep convolutional neural network ensemble
.
PeerJ Computer Science
,
3
,
e123
.

Zeng
 
Y.H.
, et al.  (
2009
)
Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach
.
J. Theor. Biol.
,
259
,
366
372
.

Zhang
 
J.
,
Liu
B.
(
2017
)
PSFM-DBT: identifying DNA-binding proteins by combing position specific frequency matrix and distance-bigram transformation
.
Int. J. Mol. Sci.
,
25
,
E1856. pii
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: John Hancock
John Hancock
Associate Editor
Search for other works by this author on: