Abstract

Motivation

Lysine acetylation exists extensively in prokaryotes, and plays a vital role in function adjustment. Recent progresses in the identification of prokaryote acetylation substrates and sites provide a great opportunity to explore the difference of substrate site specificity between prokaryotic and eukaryotic acetylation. Motif analysis suggests that prokaryotic and eukaryotic acetylation sites have distinct location-specific difference, and it is necessary to develop a prokaryote-specific acetylation sites prediction tool.

Results

Therefore, we collected nine species of prokaryote lysine acetylation data from various databases and literature, and developed a novel online tool named ProAcePred for predicting prokaryote lysine acetylation sites. Optimization of feature vectors via elastic net could considerably improve the prediction performance. Feature analyses demonstrated that evolutionary information played significant roles in prediction model for prokaryote acetylation. Comparison between our method and other tools suggested that our species-specific prediction outperformed other existing works. We expect that the ProAcePred could provide more instructive help for further experimental investigation of prokaryotes acetylation.

Availability and implementation

http://computbiol.ncu.edu.cn/ProAcePred

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Lysine (K) acetylation is an essentially reversible and highly regulated post-translational modification (PTM) that has been shown to occur in many protein targets, including core histones, approximately 40 transcription factors and over 30 other proteins (Vidali et al., 1968; Yang, 2004). Of the known PTM, lysine acetylation has the capacity to destabilize the chromatin polymer through charge neutralization of the basic lysine residues, potentially harboring structural consequences for higher-order chromatin structures (Taverna et al., 2007; Verreault et al., 1998). This scenario is not only crucial in the nucleus but also important for regulating different cytoplasmic processes, including cytoskeleton dynamics, energy metabolism, endocytosis and even signaling from the plasma membrane (Yang and Seto, 2008). Recently, increasingly more experimental observations have suggested that amino acid variations could indirectly or directly influence the original acetylation sites. For example, Iwasaki et al. (2011) investigated the mutations of eight lysine residues which were known to be targets for acetylation in Homo sapiens histones H3 and H4. They found some of these amino acid substitutions altered the local protein-DNA interactions and the interactions between amino acid residues within the nucleosome, which could affect chromatin structure and function and consequently cause related disease. Thus, knowledge of lysine acetylation sites can be very helpful in drug design for various related diseases.

A lot of experimental approaches, such as mass spectrometry-based experiments, radioactive chemical method and chromatin immunoprecipitation, have been used to identify lysine acetylation sites (Medzihradszky, 2005; Umlauf et al., 2004; Welsch and Nelsestuen, 1988). However, the conventional experimental identification of lysine acetylation substrates is usually expensive and time consuming. Therefore, computational prediction is desirable and necessary. By combing experimental methods with clustering analysis, Basu et al. (2009) constructed a tool named as PredMod to predict acetylation based on the sequence characteristics. Li et al. (2012) developed the acetylation set enrichment based computer program to predict lysine (K)-acetyl-transferase-families-specific acetylation site. On the basis of logistic regression classifiers and different biological features, Hou et al. (2014) proposed a predictor called LAceP for lysine acetylation prediction. Li et al. (2014) developed a method termed as SSPKA for species-specific lysine acetylation prediction by using random forest classifiers. The majority of the rest of these methods, such as LysAcet (Li et al., 2009), N-Ace (Lee et al., 2010), EnsemblePail (Xu et al., 2010), Phosida (Gnad et al., 2010), PLMLA (Shi et al., 2012), PSKAcePred (Suo et al., 2012), BRABSB (Shao et al., 2012) and KA-predictor (Wuyun et al., 2016), have been carried out with support vector machines (SVM) classifiers based on different feature extraction. Taken together, most of the predictors were not species-specific models or were only designed for eukaryote acetylation. SSPKA and KA-predictor were developed for eukaryote and prokaryote acetylation sites which included two prokaryote species Escherichia coli (E. coli) and Salmonella typhimurium (S. typhimurium), and their results indicated that species-specific prediction can effectively promote the prediction performance.

Recent progresses in the identification of prokaryote acetylation substrates and sites provide a great opportunity to explore the difference of substrate site specificity between prokaryotic and eukaryotic acetylation. We collected 7288 non-redundant prokaryotic acetylation sites across nine different species from the databases and relevant literature (Section 2.1), and 29 930 non-redundant eukaryotic acetylation sites from 6077 Homo sapiens (H. sapiens) proteins and 3644 Mus musculus (M. musculus) proteins extracted from the PLMD database (Xu et al., 2017). The corresponding data can be downloaded from our website. Based on motif-x (Chou and Schwartz, 2011), we got 33 and 13 significant sequence motifs of eukaryotic and prokaryotic, respectively (detailed motif figures shown in Supplementary Fig. S1). We observed that the eukaryotic acetylation preferred to AEK, GKL, KXGK and SKP motifs and prokaryotic acetylation preferred to EK, DK, KXR and KY motifs (‘X’ denotes any amino acid). Moreover, the arginine (R) was abundant in downstream for prokaryotes acetylation sequence, but K preferred to locate in the downstream of eukaryotic acetylation sequences and upstream sequences have also some difference between prokaryotes and eukaryotes. These observations suggest that prokaryotic and eukaryotic acetylation sites have distinct location-specific difference, and it is necessary to develop a prokaryote specific acetylation prediction tool.

In this study, we specially developed a novel tool named as ProAcePred to predict nine species of prokaryote lysine acetylation sites based on the combination of sequence-based, physicochemical property and evolutionary information features via elastic net. In the first-stage, we trained three type features in nine species. Then we incorporated all of the features, and used elastic net method to reduce the dimensionality of feature vectors. Finally, we used optimal feature vectors and through 10-fold cross-validation to evaluate the performance of the model based on SVM. The results of independent test set showed that our models were significantly enhanced in comparison with other existing models. Taken together, ProAcePred could be a powerful tool for the computational identification and characterization of prokaryote protein acetylation on a large scale, and the online service is freely available at http://computbiol.ncu.edu.cn/ProAcePred.

2 Materials and methods

2.1 Data collection and pre-processing

We collected eight species of lysine acetylation from bacteria in prokaryotes: E. coli, S. typhimurium, Bacillus subtilis (B. subtilis), Vibrio parahemolvticus (V. parahemolvticus), Mycobacterium tuberculosis (M. tuberculosis), Corynebacterium glutamicum (C. glutamicum), Erwinia amylovora (E. amylovora) and Geobacillus kaustophilus (G. kaustophilus), respectively. In addition, we also collected some lysine acetylation data from Archaea, including Thermos themophilus (T. themophilus) and other species for Archaea, which Archaea appears as a category of prokaryotes. Therefore, we finally collected nine categories in prokaryotes. These datasets were extracted from multiple public resources, including UniProtKB/Swiss-Prot (Boutet et al., 2007), NCBI (https://www.ncbi.nlm.nih.gov/), CPLM database (Liu et al., 2014a, b) as well as the relevant literature (Kosono et al., 2015; Lee et al., 2013; Mizuno et al., 2016; Okanishi et al., 2013; Pan et al., 2014; Wu et al., 2013; Xie et al., 2015). After removing other mistaken sequences with modification sites among those databases, we finally collected 5316 experimental acetylated protein (shown in Supplementary Table S1). It contained 8787 lysine acetylation sites and 87 585 lysine non-acetylation sites.

Because the structural studies have shown that lysine acetyltransferase (KATs) domains coupled with peptide substrates typically do not exceed 14–20 amino acids in length (Marmorstein, 2001), we first chose maximum number of 20 residues that were upstream and downstream of the acetylation or non-acetylation site in such a way that the whole length of the peptide became 41. When sequence fragment length was insufficient, we added the letter O (O is defined as a virtual amino acid). Subsequently, we clustered the protein sequences with a threshold of 30% identity using the CD-HIT (Li and Godzik, 2006) and then 7288 lysine acetylation fragments with 41 638 lysine non-acetylation fragments were extracted for further pre-processing. The positive samples extracted from acetyl-lysine fragments were divided into training datasets and independent datasets in nine species. We used the same type of residue (lysine), excluding known acetylation sites, as the negative set (non-acetylation sites). Afterwards, we randomly selected 10% of non-homologous lysine acetylation fragments with non-homologous lysine non-acetylation fragments as independent test dataset in nine species, respectively. It was used as a benchmark for evaluating our model as well as for comparing it with other published methods. Finally, due to the number of negative samples more than positive samples for training dataset and independent test dataset (shown in Supplementary Table S2), we randomly selected negative samples with the same as positive ones to form a balance of the positive and negative samples in training sets and test sets. The corresponding data can be downloaded from our website.

For all of the positive and negative sets in nine species, we optimized the final window sizes through training K nearest neighbors (KNN) features. In nine species, the final window sizes were 13 for Archaea, 17 for V. parahemolvticus and 21 for others species (detailed information shown in Supplementary Text S1).

2.2 Features extraction

In this work, amino acid compositions (AAC), amino acids binary encoding (BE), position weight amino acid compositions (PWAA) and K-space amino acid pair composition (K-space) were used to describe sequence-based features. Average accessible surface area (AASA) and encoding based on grouped weight (EBGW) were adopted to extract physicochemical property features. KNN was utilized to represent evolutionary information features.

2.2.1 Sequence-based feature

AAC and BE are widely used in various prediction systems (Chen et al., 2013; Liu et al., 2014a, b; Suo et al., 2012), which can efficiently describe the specific state of a given acetylated and non-acetylated peptide fragments. AAC and BE are considered on the basis of a single amino acid residue sequence information, therefore we further took the K-space to extract correlation information between amino acids. Meanwhile we adopted PWAA (Shi et al., 2012) to extract the sequence-order information of amino acid residues around acetyl-lysine sites. The detailed information of AAC, BE, K-space and PWAA are shown in Supplementary Text S2.

2.2.2 Physicochemical properties

AASA:  Pang et al. (2007) investigated the structural environment of 8378 incidences of 44 types of PTM and found a side-chain of amino acid that undergoes PTM prefers to be accessible on the surface of a protein. In the previous literature, AASA feature has been showing better predictive performance (Suo et al., 2012; Xu et al., 2015). Therefore, the solvent accessibility of amino acid residues surrounding the acetylation sites may be adapted to distinguish between the acetylation sites and non-acetylation sites. Here, we demonstrated the effectiveness of AASA as features for acetylation site prediction.

EBGW: Acetylation of lysine neutralizes the positive charge of this amino acid, and changes the protein function in diverse ways (Yang, 2004). Thus, in the following, we adopted an encoding scheme of the amino acid sequence considering the hydrophobicity and charged character of amino acid residues. The EBGW divides the 20 amino acid residues into four different classes on the basis of their hydrophobicity and charged character:

The hydrophobic group C1 = {A, F, G, I, L, M, P, V, W}.

The polar group C2 = {C, N, Q, S, T, Y}.

The positively charged group C3 = {K, H, R}.

The negatively charged group C4 = {D, E}.

So we can partition amino acid residues into the following disjoint groups: C1 + C2 versus C3 + C4, or C1 + C3 versus C2 + C4 and C1 + C4 versus C2 + C3. For a protein sequence P=p1,p2,,pL, it is transformed into three binary sequences
H1(pj)={1 if pjC1+C20 if pjC3+C4H2(Pj)={1 if pjC1+C30 if pjC2+C4H3(pj)={1 if pjC1+C40 if pjC2+C3
(1)
We divide each binary sequence into J sub-sequences increasing in length. The feature value of the jth sub-sequence is defined as:
Xj=sum(j)int(L·j/J), j=1, 2,, J
(2)

Where the function sum(j) gives the number of 1 in the jth sub-sequence, int(L·j/J) denotes the length of the jth sub-sequence, the int() rounds a number to the nearest integer and L is the length of the sliding window sequence. The 3*J-dimension vector X=[X1, X2, X3] is defined as the EBGW string of protein sequence. Here, we made J = 5.

2.2.3 Evolutionary-derived information

KNN: In order to take advantage of cluster information of local sequence fragments for predicting acetylation sites, we took the local sequence around acetylation sites in a query protein and extracted features from its similar sequences in both positive and negative datasets by a KNN score algorithm (Gao et al., 2010; Shen et al., 2006; Tan, 2006).

The KNN score is based on the local sequence similarity (Wang et al., 2016). In details, for two local sequence fragments S1=(S11,S12,,S1L and S2=(S21,S22,,S2(L)), the distance Dist(S1,S2) between S1 and S2 is defined as:
DistS1,S2=1-i=1LSimS1i,S2iL
(3)
Sima,b=Ma,b-min(M)maxM-min(M)
(4)

Where, L represents the length of a protein sequence fragment; the Sim, the amino acid similarity matrix, is derived from the BLOSUM62 substitution matrix (Henikoff and Henikoff, 1992); a and b denote two amino acids, M is the substitution matrix and max/min{M} represent the largest/smallest number in the matrix, respectively.

To take advantage of different properties of neighbors with various similarities, different K values can be used to obtain multiple scores. In this work, because of different amount of datasets for nine species, we chose different K value. K was chosen to be 2, 4, 8, 16 and 32 and the 5 KNN scores were extracted as features from Archaea. K was chosen to be 4, 8, 16, 32 and 64 and the 5 KNN scores were extracted as features from E. amylovora, S. typhimurium, G. kaustophilus and V. parahemolvticus. K was chosen to be 4, 8, 16, 32, 64, 128 and 256 and the 7 KNN scores were extracted as features from E.coli, C. glutamicum, M. tuberculosis and B. subtilis.

2.3 Features optimization via elastic net

A promising technique called the lasso was proposed by Tibshirani (1996). The lasso is a penalized least squares method imposing an L1-penalty on the regression coefficients. Owing to the nature of the L1-penalty, the lasso does both continuous shrinkage and automatic variable selection simultaneously. Although the lasso has shown success in many situations, it has some limitations. For example, if there is a group of variables among which the pairwise correlations are very high, then the lasso tends to select only one variable from the group and does not care which one is selected; For usual n > p situations, if there are high correlations between predictors, it has been empirically observed that the prediction performance of the lasso is dominated by ridge regression (where n is observation, p is predictors). Therefore, Zou and Hastie (2005) based on lasso theory, proposed new method elastic net. Similar to the lasso, the elastic net simultaneously does automatic variable selection and continuous shrinkage and it can select groups of correlated variables. Simulation studies and real data examples show that the elastic net often outperforms the lasso in terms of prediction accuracy.

Suppose that the dataset has n observations with p predictors. Let y=(y1, y2,,yn)T be the response and X=(X1,,XP) be the model matrix, where Xj=(x1j,,xnj)T, j=1,, p, are the predictors. After a location and scale transformation, we can assume that the response is centered and predictors are standardized,
i=1nyi=0, i=1nxij=0 and i=1nxij2=1, for j=1, 2,, p.
(5)
Given dataset (y, X) and (λ1,λ2), define an artificial dataset (y*,X*) by
X(n+p)p*=(1+λ2)12(Xλ2I),y(n+p)*=(y0).
(6)
Let β*=1+λ2β. The naive elastic net solves a lasso-type problem:
β^*=argminβ*(y*-X*β*)2+λ11+λ2|β*|1.
(7)
The elastic net (corrected) estimates β^ are defined by:
β^elastic net=1+λ2β^*.
(8)
Recall that β^naive elastic net=11+λ2β^*; thus
β^elastic net=1+λ2β^naive elastic net
(9)
Or
β^elastic net=argminββT(XTX+λ21+λ2)β-2yTXβ+s|β|1.
(10)
Where I is unit matrix, and
|β|2=j=1pβj2,|β|1=j=1p|βj|,(1λ2S+λ2)|β|1+λ2S+λ2|β|2t
for some t.

By the proportional relationship between β^ and β^*, we used (λ2,s) to parameterize the elastic net and further selected the optimal feature vectors based on optimization parameter.

3 Results and discussion

3.1 The features optimization results

Seven features (AAC, BE, K-space, PWAA, AASA, EBGW and KNN feature) were encoded independently to predict nine species acetylation sites, which the performances were not fully satisfactory. However, the combination of different features, compared with single feature, would provide more powerful prediction performances (Suo et al., 2012; Wang et al., 2016). If we combined all seven features, the feature vectors have bigger dimensions. The dimensions of each kind of feature vectors and combined all feature vectors in different species are listed in Supplementary Table S3. The highly dimensional features will lead to a time-consuming classification of the training model, and some of feature vectors may be unwanted noise and lead to bias about the prediction performance. Therefore, it is generally necessary to reduce dimensionality of incorporative feature vectors, ultimately, retain some important feature vectors in models.

Here, we used elastic net to optimize the dimensionality of feature vectors. For example, when generating the predictive model of Archaea, first, we defined parameter λ2=0.1 and applied elastic net to evaluate all of the feature vectors and received parameter s value about each of feature vectors, so all feature vectors were ranked according to their s values from small to large. If the s value is smaller, the feature vector is more important for the model. Second, we trained feature vectors with s value less than 0.1, 0.2, 0.3, 0.4 and 0.5 by using 10-fold cross-validation respectively, and then selected optimization s value according to maximum accuracy (Acc) value of λ2=0.1. The detailed model optimization and evaluation method is shown in Supplementary Text S3. Third, we used similar method to process λ2=0.2, 0.3, 0.4, 0.5. Finally, we selected optimized feature vectors according to maximum Acc value. Here, the optimized parameter λ2=0.1, s=0.3 for Arhaea (other species optimized parameter shown in Supplementary Table S4), and the dimension of feature vectors was reduced from 2552 to 283. The elastic net scores (i.e. s value) about optimization feature for nine species can be downloaded from our website. We compared the prediction performance of all incorporative features with optimized feature for nine species (Table 1). We observed that the performance of optimized features generated by elastic net were better than those of all incorporative feature vectors by 10-fold cross-validation based on SVM, and the dimensions of feature vectors were dramatically reduced. For example, the value of area under receiver operating characteristic (AUC) of combined all seven feature sets is 0.721 in Archaea, but the AUC of the model with optimized feature vectors improved 0.176. The AUC of E. amylovora and S. typhimurium were non-entity for combined all the feature vectors, but the AUC of the models with optimized feature vectors were 0.988 and 0.778, respectively. These results may be explained by that fusion of all feature vectors can extract more information on the protein, but it still contains some redundancy or unimportant feature vectors which may have a negative influence on the performance of model. However, the elastic net can not only keep the group effect but also overcome the double shrinkage by the quadratic penalty, and it can select groups of correlated variables from all feature vectors (Zou and Hastie, 2005). Therefore, the optimal feature subsets generated by the elastic net retain best group effect, which may result in a significant improvement in the predictive performance of the model. Furthermore, for different species, we found that their performance have different levels of improvement. It suggested that different species may have different sequence or structure properties (Section 3.2), and it is necessary to distinguish different species.

Table 1.

The training results of combined all eigenvector and optimal feature vectors in nine species

AccSnSpMCCAUCdimension
ArchaeaAll0.7430.7720.7130.4880.7212552
Optimal0.90.90.90.8030.897283
V. parahemolvticusAll0.7240.7360.7190.4560.6972640
Optimal0.8020.8080.7960.6050.799303
E. coliAll0.6590.6910.6270.3190.6582730
Optimal0.690.6910.6890.3810.687367
C. glutamicumAll0.7260.750.7030.4550.7342730
Optimal0.80.8070.7940.6020.807325
M. tuberculosisAll0.7470.7490.7440.4940.7442730
Optimal0.8340.830.8380.6710.83347
B. subtilisAll0.740.7460.7350.4810.742730
Optimal0.7960.7990.7920.5920.8376
E. amylovoraAll0.550.8440.256NoNNoN2728
Optimal0.9830.9780.9890.9670.988155
S. typhimuriumAll0.5120.6060.418NoNNoN2728
Optimal0.8160.8950.7370.640.778274
G. kaustophilusAll0.5670.6440.4890.1340.5772728
Optimal0.8970.9170.8780.8010.902294
AccSnSpMCCAUCdimension
ArchaeaAll0.7430.7720.7130.4880.7212552
Optimal0.90.90.90.8030.897283
V. parahemolvticusAll0.7240.7360.7190.4560.6972640
Optimal0.8020.8080.7960.6050.799303
E. coliAll0.6590.6910.6270.3190.6582730
Optimal0.690.6910.6890.3810.687367
C. glutamicumAll0.7260.750.7030.4550.7342730
Optimal0.80.8070.7940.6020.807325
M. tuberculosisAll0.7470.7490.7440.4940.7442730
Optimal0.8340.830.8380.6710.83347
B. subtilisAll0.740.7460.7350.4810.742730
Optimal0.7960.7990.7920.5920.8376
E. amylovoraAll0.550.8440.256NoNNoN2728
Optimal0.9830.9780.9890.9670.988155
S. typhimuriumAll0.5120.6060.418NoNNoN2728
Optimal0.8160.8950.7370.640.778274
G. kaustophilusAll0.5670.6440.4890.1340.5772728
Optimal0.8970.9170.8780.8010.902294

Note: ‘All’ represents combined all eigenvector; ‘Optimal’ represents optimized feature vectors generated by elastic net; ‘dimension’ represents feature vectors dimension; ‘NoN’ represents non-entity. (Sn, sensitivity; Sp, specificity; Acc, accuracy; MCC, Matthew correlation coefficient).

Table 1.

The training results of combined all eigenvector and optimal feature vectors in nine species

AccSnSpMCCAUCdimension
ArchaeaAll0.7430.7720.7130.4880.7212552
Optimal0.90.90.90.8030.897283
V. parahemolvticusAll0.7240.7360.7190.4560.6972640
Optimal0.8020.8080.7960.6050.799303
E. coliAll0.6590.6910.6270.3190.6582730
Optimal0.690.6910.6890.3810.687367
C. glutamicumAll0.7260.750.7030.4550.7342730
Optimal0.80.8070.7940.6020.807325
M. tuberculosisAll0.7470.7490.7440.4940.7442730
Optimal0.8340.830.8380.6710.83347
B. subtilisAll0.740.7460.7350.4810.742730
Optimal0.7960.7990.7920.5920.8376
E. amylovoraAll0.550.8440.256NoNNoN2728
Optimal0.9830.9780.9890.9670.988155
S. typhimuriumAll0.5120.6060.418NoNNoN2728
Optimal0.8160.8950.7370.640.778274
G. kaustophilusAll0.5670.6440.4890.1340.5772728
Optimal0.8970.9170.8780.8010.902294
AccSnSpMCCAUCdimension
ArchaeaAll0.7430.7720.7130.4880.7212552
Optimal0.90.90.90.8030.897283
V. parahemolvticusAll0.7240.7360.7190.4560.6972640
Optimal0.8020.8080.7960.6050.799303
E. coliAll0.6590.6910.6270.3190.6582730
Optimal0.690.6910.6890.3810.687367
C. glutamicumAll0.7260.750.7030.4550.7342730
Optimal0.80.8070.7940.6020.807325
M. tuberculosisAll0.7470.7490.7440.4940.7442730
Optimal0.8340.830.8380.6710.83347
B. subtilisAll0.740.7460.7350.4810.742730
Optimal0.7960.7990.7920.5920.8376
E. amylovoraAll0.550.8440.256NoNNoN2728
Optimal0.9830.9780.9890.9670.988155
S. typhimuriumAll0.5120.6060.418NoNNoN2728
Optimal0.8160.8950.7370.640.778274
G. kaustophilusAll0.5670.6440.4890.1340.5772728
Optimal0.8970.9170.8780.8010.902294

Note: ‘All’ represents combined all eigenvector; ‘Optimal’ represents optimized feature vectors generated by elastic net; ‘dimension’ represents feature vectors dimension; ‘NoN’ represents non-entity. (Sn, sensitivity; Sp, specificity; Acc, accuracy; MCC, Matthew correlation coefficient).

3.2 Analysis of feature importance and contribution

As mentioned above, the seven different feature sets (AAC, BE, K-space, PWAA, EBGW, KNN and AASA) were used to predict acetylation sites. In the beginning, we do not know these seven features make how much contribution, and which features are more important than others. Therefore, we analyzed the relative importance and contribution of these different features by elastic net, and then evaluated which dimensions feature vectors have more valuable for prediction model. The ratios of selected dimension of feature vectors belonging to the seven feature groups are shown in Figure 1 (detailed feature vectors dimension of seven features in optimization model shown in Supplementary Table S5). We observed that the ratio of KNN feature is significantly higher than other six features. Especially for the Archaea, V. parahemolvticus, C. glutamicum and B. subtilis, KNN ratios reach up to 1 and the lowest KNN ratio is also higher than other features. From the elastic net score values, we found that the s values of KNN feature are lower than those of other features in nine species. Thus, it suggested that KNN feature played significant role in determining acetylation sites. Because substrate sites of same KATs or KATs family usually share similar patterns in local sequence fragments, local sequence clusters often exist around acetylation site (Weinert et al., 2011). As mentioned above, KNN just reflects cluster information of local sequence fragments for acetylation sites. We further compared the KNN values of acetylation sites and non-acetylation sites in nine species (shown in Fig. 2). A score greater than 0.5 means the query site is more similar to the acetylation samples; a score smaller than 0.5 means it is more similar to the non-acetylation samples. The larger the KNN score is, the more similar the site is to some known acetylation sites. For acetylation fragments, the average KNN scores with different sizes of nearest neighbors are within 0.53 to 0.85. Therefore, the local sequences surrounding known acetylation sites are more similar to their nearest neighbors in positive set (excluding self-matches) on average as expected. This result confirmed that acetylation related clusters exist in acetylation fragments. For non-acetylation fragments, the KNN scores are within 0.38 to 0.53, which means that the sequences in negative set are more similar to nearest neighbors in negative datasets. With the increasing of the value of K, the gap of KNN scores between acetylation and non-acetylation sites were getting smaller and smaller. Additionally, we found that KNN scores of different species were subtle differences, especially KNN scores of Archaea have relatively large difference and KNN scores of S. typhimurium have less difference among nine species. It suggested that contribution of KNN feature was different to each of species. In short, the KNN feature can capture evolutionary similarity information in the local sequence around acetylation sites and hence distinguish them from the background. Therefore, the KNN features are suitable to be used for predicting prokaryote acetylation site.

Fig. 1.

The ratios of selected dimension of feature vectors belonging to the seven feature groups. Abscissa represents each of feature set

Fig. 2.

Comparison of KNN scores between acetylation sites and non-acetylation sites for nine species. The vertical axis represents the average KNN scores. The horizontal axis represents numbers of nearest neighbors

In contrast, as shown in Figure 1, the ratios of most of the K-space and AAC feature vectors are low, which suggested that most of dimension features were unwanted and these two features were not important for the model. In order to detect statistically significant differences about the compositional biases of sequences between acetylated and non-acetylated sites, we used Two Sample Logo tool (Vacic et al., 2006) to generate sequence logos for the lysine acetylation sequence in nine species based on the curated datasets. From Figure 3, we found that different species have different sequence information. For instance, for lysine acetylation sequence fragments, K, aspartic acid (D), arginine (R) and glutamic acid (E) enrich in upstream for E. coli and V. parahemolvticus, but leucine (L) enriches in downstream for E.coli, histidine (H) and tyrosine (Y) enrich in downstream for V. parahemolvticus; K enriches at positions –10, –8, –5, –3, –2 and 1 for E. amylovora, while K enriches at positions –4, –3, –1, 2 and 4 for S. typhimurium. However, the different species also have some common feature for acetylation sequence; we observed that K enriches in upstream for nine species. Indeed, Gu et al. (2017) found that the charge property of K in E. coli pyridoxine 5’-phosphate oxidase (EcPNPOx) may be crucial for its enzyme activity and lysine acetylation regulates the activity of EcPNPOx. This shows that the acetylation substrates which bind to KATs are much different from non-acetylation. In addition, we observed that the characteristics of the residues in the positions at –10∼–2, +1 and +6∼+10 have significant difference between acetylated and non-acetylated samples of E. coli, the content of K in upstream and R in downstream of acetylation sites are higher than those of non-acetylation sites and so on. These observations suggested that acetyl-lysine and non-acetyl-lysine have distinct location-specific differences, and the PWAA feature is helpful to predict acetylation sites in prokaryotes.

Fig. 3.

Sequence logo illustration generated by Two Sample Logo for acetylation sites sequence information of nine species (P-value < 0.05; t-test)

In addition, from Figure 1, we observed that each of feature groups have different contribution to different species. For example, the EBGW feature has better performance for eight species, but the ratio of EBGW was 0 for the S. typhimurium, suggesting that EBGW feature was low contribution to S. typhimurium; the AASA feature has better contribution to E. coli and V. parahemolvticus among nine species. The EBGW feature is mainly based on the hydrophobicity and charge character of amino acid residues. Based on the curated datasets, the number of sub-sequences was selected to be 5. And the EBGW values were represented by 15 dimension values (shown in Supplementary Fig. S2). The results showed that the ratios of EBGW values of H2 between acetylation sequence and non-acetylation ones were higher than 0 (except of archaea), implying more positive charge existed around acetylation sites than that around the non-acetylation ones. This is in agreement with the results of above feature analysis. A protein that has a surface that is composed mainly of negatively charged amino acids will bind to a protein with mainly positively charged molecules (Neuwirth et al., 2007). Supplementary Figure S2 also showed that the difference between acetylation and non-acetylation for G. kaustophilus (gray) and other eight species were bigger in H3, but for E. amylovora, the curves (wathet) showed bigger difference in H1, H2 and H3. The AASA formed from nine species acetylation sites in the constructed datasets (except of center position) is shown in Supplementary Figure S3. Most of the acetylation or non-acetylation sites (0 position) were located in the highly AASA value. The average AASA of neighborhood residues are 35 to 55 for all species acetylation sites. The fluctuant range of AASA of residues surrounding acetylation sites was bigger than that of non-acetylation sites. For example, the AASA of the E. coli that surrounds the acetylation sites exceeded that around non-acetylation sites, especially in the –10∼–4 positions. This implied that the acetylation processing might have occurred where the structural surroundings are relatively large variation range, which was consistent with previous knowledge on lysine acetylation (Pang et al., 2007).

From above results, we discovered that the AASA, BE, EBGW, PWAA and KNN features have large proportion in optimal feature vectors and the AAC and K-space have relatively few ratios.

3.3 Prokaryote acetylation sites prediction of ProAcePred

Based on optimal feature vectors generated by elastic net, nine models for prokaryotes lysine acetylation prediction were constructed respectively. To evaluate the performance of ProAcePred, we performed a 10-fold cross-validation for each species. We plotted receiver operating characteristic (ROC) curves of training set and calculated the AUC values as shown in Figure 4 (detailed training results shown in Supplementary Table S6 and ROC curves of independent test set shown in Supplementary Fig. S4). For training datasets of nine species, seven models of C. glutamicum, M. tuberculosis, B. subtilis, E. amylovora, S. typhimurium, G. kaustophilus and Archaea achieved AUC values of 0.807, 0.83, 0.8, 0.988, 0.882, 0.902 and 0.897, respectively. The AUC values of E. coli and V. parahemolvtius were below in 0.8. Based on the independent datasets, AUC values of ProAcePred were 0.855, 0.855, 0877, 0.861, 0.863, 0.942, 0.896, 0.778 and 0.854 in Archaea, V. parahemolvticus, E. coli, C. glutamicum, M. tuberculosis, B. subtilis, E. amylovora, S. typhimurium and G. kaustophilus models, respectively. The results of most independent test were better than those of training set, which justifies that our predictor ProAcePred has stabilization and good prediction performance.

Fig. 4.

The ROC curves and AUC values of the training datasets for nine species

3.4 Comparison with other prediction tools

To further evaluate the performance of ProAcePred, we compared it with some widely existing lysine acetylation prediction tools. Although there have been reported many tools for predicting protein acetylation sites since 2009, and considering the web server tools whether do work or not, four general tools Ensemblepail (Xu et al., 2010), Phosida (Gnad et al., 2010), PLMLA (Shi et al., 2012), LAceP (Hou et al., 2014) and one species-specific predictor SSPKA (Li et al., 2014) were selected to compare lysine acetylation models, respectively. The PLMLA, Ensemblepail and LAceP incorporated prokaryotic and eukaryotic datasets into their training datasets and did not optimize the dimensionality of feature vectors. The tool Phosida was specifically designed for predicting H. sapiens acetylation. Therefore, we used these four tools to predict our nine species data, in order to investigate the difference of substrate specificity between prokaryote and eukaryote acetylation sites. The SSPKA divided the datasets into six types, H. sapiens, M. musculus, Saccharomyces cerevisiae (S. cerevisiae), Rattu norvegicus, E. coli and S. typhimurium, respectively. Thus, we used this tool to predict E. coli and S. typhimurium data. To avoid any bias, we submitted independent test data of each species to PLMLA, Ensemblepail, LAceP and Phosida tools and submitted independent test data of S. typhimurium and E. coli to SSPKA. The detailed comparison results are listed in Supplementary Table S7. From Figure 5 and Supplementary Table S7, the AUC and MCC values of ProAcePred were significantly higher than those of LAceP, PLMLA and Ensemblepail which were not species-specific predictors. For example, the AUC values of LAceP, PLMLA and Ensemblepail were 0.591, 0.677 and 0.539 for Archaea, respectively, and the AUC value of our model reached to 0.855; for B. subtilis, the AUC values of LAceP, PLMLA, Ensemblepail and our model were 0.509, 0.608, 0.522 and 0.942, respectively, and so on. This indicated that distinguishing different species could improve prediction performance. For S. typhimurium, the AUC values of LAceP, PLMLA, Ensemblepail, SSPKA and our model were 0.647, 0.55, 0.665, 0.893 and 0.778, respectively, and the prediction performance of SSPKA was higher than other tools. Because the SSPKA distinguished different species and used minimum-redundancy maximum-relevance approach to optimize the dimensionality of feature vectors, which once again showed that it was important to distinguish different species and optimize feature vectors. For S. typhimurium, although the prediction performance of our model has a bit lower than SSPKA, our method reached a high accuracy for other species. Meanwhile, from Figure 5, we also found that the prediction performance of Phosida in nine species were lower than our tool, which clearly showed the necessity of developing prokaryote-specific acetylation sites prediction tool.

Fig. 5.

Comparison of AUC values between our method and other tools

4 Conclusion

In this study, we proposed a new method named as ProAcePred to predict prokaryote-specific lysine acetylation. Especially, we applied a kind of mathematical method elastic net to optimize the dimensionality of feature vectors which significantly improved prediction performance. We performed an in-depth analysis of feature importance and contribution for different species, and found that KNN features played significant roles in predicting prokaryote acetylation sites. The optimization feature vectors were different in different species models, indicating that distinguishing different species was important for acetylation site prediction in order to improve prediction quality. Our prediction model achieved a promising performance and outperformed other prediction tools. Finally, we have developed an online service system of ProAcePred by combining software MATLAB R2010a with Eclipse (Java EE) software, which is freely available to academic researchers. We expect that the ProAcePred could provide more instructive help for further experimental investigation of prokaryote acetylation. Furthermore, recent studies (Li et al., 2015a, b; Wang et al., 2015a, b) showed that protein–protein interaction (PPI) and network-based features have also been proved useful in predicting protein phosphorylation. Therefore, with more experimental annotation data available, we will try to introduce PPI functional information and network-based features to further improve prokaryote acetylation prediction in future work.

Funding

This work was supported by the National Natural Science Foundation of China (21665016 and 21305062), and the Natural Science Foundation of Jiangxi Province (20151BAB203022).

Conflict of Interest: none declared.

References

Basu
 
A.
 et al.  (
2009
)
Proteome-wide prediction of acetylation substrates
.
Proc. Natl. Acad. Sci. USA
,
106
,
13785
13790
.

Boutet
 
E.
 et al.  (
2007
)
UniProtKB/Swiss-Prot: the manually annotated section of the UniProt KnowledgeBase
.
Methods Mol. Biol.
,
406
,
89
112
.

Chen
 
X.
 et al.  (
2013
)
Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites
.
Bioinformatics
,
29
,
1614
1622
.

Chou
 
M.F.
,
Schwartz
D.
(
2011
)
Biological sequence motif discovery using motif-x
.
Curr. Protoc. Bioinform
.,
35
, Unit 13.
15
24
.

Gao
 
J.
 et al.  (
2010
)
Musite, a tool for global prediction of general and kinase-specific phosphorylation sites
.
Mol. Cell. Proteomics
,
9
,
2586
2600
.

Gnad
 
F.
 et al.  (
2010
)
Predicting posttranslational lysine acetylation using support vector machines
.
Bioinformatics
,
26
,
1666
1668
.

Gu
 
J.
 et al.  (
2017
)
Lysine acetylation regulates the activity of Escherichia coli pyridoxine 5′-phosphate oxidase
.
Acta Biochim. Biophys. Sin
,
49
,
1044
1192
.

Henikoff
 
S.
,
Henikoff
J.G.
(
1992
)
Amino acid substitution matrices from protein blocks
.
Proc. Natl. Acad. Sci. USA
,
89
,
10915
10919
.

Hou
 
T.
 et al.  (
2014
)
LAceP: lysine acetylation site prediction using logistic regression classifiers
.
PLoS One
,
9
,
e89575
.

Iwasaki
 
W.
 et al.  (
2011
)
Comprehensive structural analysis of mutant nucleosomes containing lysine to glutamine (KQ) substitutions in the H3 and H4 histone-fold domains
.
Biochemistry-US
,
50
,
7822
7832
.

Kosono
 
S.
 et al.  (
2015
)
Changes in the acetylome and succinylome of Bacillus subtilis in response to carbon source
.
PLoS One
,
10
,
e0131169
.

Lee
 
D.W.
 et al.  (
2013
)
Proteomic analysis of acetylation in thermophilic Geobacillus kaustophilus
.
Proteomics
,
13
,
2278
2282
.

Lee
 
T.Y.
 et al.  (
2010
)
N‐Ace: using solvent accessibility and physicochemical properties to identify protein N-acetylation sites
.
J. Comput. Chem
.,
31
,
2759
2771
.

Li
 
H.
 et al.  (
2015a
)
Prediction of kinase-substrate relations based on heterogeneous networks
.
J. Bioinf. Comput. Biol
.,
13
,
1542003.

Li
 
Y.
 et al.  (
2015b
)
Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features
.
Sci. Rep-UK
,
4
,
5765
.

Li
 
S.
 et al.  (
2009
)
Improved prediction of lysine acetylation by support vector machines
.
Protein Peptide Lett
.,
16
,
977
983
.

Li
 
T.
 et al.  (
2012
)
Characterization and prediction of lysine (k)-acetyl-transferase specific acetylation sites
.
Mol. Cell. Proteomics
,
11
,
M111.011080
.

Li
 
W.
,
Godzik
A.
(
2006
)
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
.
Bioinformatics
,
22
,
1658
1659
.

Liu
 
B.
 et al.  (
2014a
)
Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection
.
Bioinformatics
,
30
,
472
479
.

Liu
 
Z.
 et al.  (
2014b
)
CPLM: a database of protein lysine modifications
.
Nucleic Acids Res
.,
42
,
D531
D536
.

Marmorstein
 
R.
(
2001
)
Structure and function of histone acetyltransferases
.
Cell. Mol. Life Sci
.,
58
,
693
703
.

Medzihradszky
 
K.F.
(
2005
)
Peptide sequence analysis
.
Method. Enzymol
.,
402
,
209
244
.

Mizuno
 
Y.
 et al.  (
2016
)
Altered acetylation and succinylation profiles in Corynebacterium glutamicum in response to conditions inducing glutamate overproduction
.
Microbiol. Open
,
5
,
152
173
.

Neuwirth
 
M.
 et al.  (
2007
)
Thermodynamic characterization of the protein-protein interaction in the heteromeric Bacillus subtilis pyridoxalphosphate synthase
.
Biochemistry
,
46
,
5131
5139
.

Okanishi
 
H.
 et al.  (
2013
)
Acetylome with structural mapping reveals the significance of lysine acetylation in Thermus thermophilus
.
J. Proteome Res
.,
12
,
3952
3968
.

Pan
 
J.
 et al.  (
2014
)
Systematic analysis of the lysine acetylome in Vibrio parahemolyticus
.
J. Proteome Res
.,
13
,
3294
3302
.

Pang
 
C.N.I.
 et al.  (
2007
)
Surface accessibility of protein post-translational modifications
.
J. Proteome Res
.,
6
,
1833
1845
.

Shao
 
J.
 et al.  (
2012
)
Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score Bayes feature representation
.
Mol. Biosyst
.,
8
,
2964
2973
.

Shen
 
H.B.
 et al.  (
2006
)
Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition
.
J. Theor. Biol
.,
240
,
9
13
.

Shi
 
S.P.
 et al.  (
2012
)
PLMLA: prediction of lysine methylation and lysine acetylation by combining multiple features
.
Mol. Biosyst
.,
8
,
1520
1527
.

Suo
 
S.B.
 et al.  (
2012
)
Position-specific analysis and prediction for protein lysine acetylation based on multiple features
.
PLoS One
,
7
,
e49108.

Tan
 
S.
(
2006
)
An effective refinement strategy for KNN text classifier
.
Expert Syst. Appl
.,
30
,
290
298
.

Taverna
 
S.D.
 et al.  (
2007
)
How chromatin-binding modules interpret histone modifications: lessons from professional pocket pickers
.
Nat. Struct. Mol. Boil
.,
14
,
1025
1040
.

Tibshirani
 
R.
(
1996
)
Regression shrinkage and selection via the lasso
.
J. Roy. Stat. Soc. B
,
267
288
.

Umlauf
 
D.
 et al.  (
2004
)
Site-specific analysis of histone methylation and acetylation
.
Methods Mol. Biol.
,
287
,
99
120
.

Vacic
 
V.
 et al.  (
2006
)
Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments
.
Bioinformatics
,
22
,
1536
1537
.

Verreault
 
A.
 et al.  (
1998
)
Nucleosomal DNA regulates the core-histone-binding subunit of the human Hat1 acetyltransferase
.
Curr. Biol
.,
8
,
96
108
.

Vidali
 
G.
 et al.  (
1968
)
Chemical studies of histone acetylation the distribution of ε-N-acetyllysine in calf thymus histones
.
J. Biol. Chem
.,
243
,
6361
6366
.

Wang
 
B.
 et al.  (
2015a
)
A novel network-based computational method to predict protein phosphorylation on tyrosine sites
.
J. Bioinf. Comput. Biol
.,
13
,
1542005
.

Wang
 
M.
 et al.  (
2015b
)
A novel method for predicting post-translational modifications on serine and threonine sites by using site-modification network profiles
.
Mol. Biosyst
.,
11
,
3092
3100
.

Wang
 
L.N.
 et al.  (
2016
)
Computational prediction of species-specific malonylation sites via enhanced characteristic strategy
.
Bioinformatics
,
33
,
1457
1463
.

Weinert
 
B.T.
 et al.  (
2011
)
Proteome-wide mapping of the Drosophila acetylome demonstrates a high degree of conservation of lysine acetylation
.
Sci. Signal
,
4
,
ra48
ra48
.

Welsch
 
D.J.
,
Nelsestuen
G.L.
(
1988
)
Amino-terminal alanine functions in a calcium-specific process essential for membrane binding by prothrombin fragment 1
.
Biochemistry-US
,
27
,
4939
4945
.

Wu
 
X.
 et al.  (
2013
)
Differential lysine acetylation profiles of Erwinia amylovora strains revealed by proteomics
.
J. Proteomics
,
79
,
60
71
.

Wuyun
 
Q.
 et al.  (
2016
)
Improved species-specific lysine acetylation site prediction based on a large variety of features set
.
PLoS One
,
11
,
e0155370.

Xie
 
L.
 et al.  (
2015
)
Proteome-wide lysine acetylation profiling of the human pathogen Mycobacterium tuberculosis
.
Int. J. Biochem. Cell B
,
59
,
193
202
.

Xu
 
H.D.
 et al.  (
2015
)
Systematic analysis of the genetic variability that impacts SUMO conjugation and their involvement in human diseases
.
Sci. Rep. UK
,
5
,
10900
.

Xu
 
H.D.
 et al.  (
2017
)
PLMD: an updated data resource of protein lysine modifications
.
J. Genet. Genomics
,
44
,
243
250
.

Xu
 
Y.
 et al.  (
2010
)
Lysine acetylation sites prediction using an ensemble of support vector machine classifiers
.
J. Theor. Boil
.,
264
,
130
135
.

Yang
 
X.J.
(
2004
)
Lysine acetylation and the bromodomain: a new partnership for signaling
.
Bioessays
,
26
,
1076
1087
.

Yang
 
X.J.
,
Seto
E.
(
2008
)
Lysine acetylation: codified crosstalk with other posttranslational modifications
.
Mol. Cell
,
31
,
449
461
.

Zou
 
H.
,
Hastie
T.
(
2005
)
Regularization and variable selection via the elastic-net
.
J. Roy. Stat. Soc. B
,
67
,
301
320
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: John Hancock
John Hancock
Associate Editor
Search for other works by this author on:

Supplementary data