-
PDF
- Split View
-
Views
-
Cite
Cite
Yongchun Zuo, Yuan Li, Yingli Chen, Guangpeng Li, Zhenhe Yan, Lei Yang, PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition, Bioinformatics, Volume 33, Issue 1, 1 January 2017, Pages 122–124, https://doi.org/10.1093/bioinformatics/btw564
Close -
Share
Abstract
The reduced amino acids perform powerful ability for both simplifying protein complexity and identifying functional conserved regions. However, dealing with different protein problems may need different kinds of cluster methods. Encouraged by the success of pseudo-amino acid composition algorithm, we developed a freely available web server, called PseKRAAC (the pseudo K-tuple reduced amino acids composition). By implementing reduced amino acid alphabets, the protein complexity can be significantly simplified, which leads to decrease chance of overfitting, lower computational handicap and reduce information redundancy. PseKRAAC delivers more capability for protein research by incorporating three crucial parameters that describes protein composition. Users can easily generate many different modes of PseKRAAC tailored to their needs by selecting various reduced amino acids alphabets and other characteristic parameters. It is anticipated that the PseKRAAC web server will become a very useful tool in computational proteomics and protein sequence analysis.
Freely available on the web at http://bigdata.imu.edu.cn/psekraac
Supplementary data are available at Bioinformatics online.
1 Introduction
With the emergence of big data in the post-genomic age, an enormous amount of data had been generated, which requires efficient computational methods for rapid and effective identification of biological features contained in sequences (Chen et al., 2015) . Even more so in the study of proteomics, because of the structure of protein exhibits more complexity when compared to nucleotide, due to the possible 20 amino acid peptides to the 4 nucleic bases. Therefore, the complexity and information content are expanded exponentially when polypeptides are formed. Also, protein sequence varies widely in length, which poses additional difficulty for incorporating the sequence-order information consistently in both dataset construction and algorithm formulation.
To overcome these obstacles, the pseudo-amino acid composition (PseAAC) algorithm was proposed in the year 2001 (Chou, 2001). This concept has been widely utilized in nearly all areas of computational proteomics, and was selected as one of the key topics in ‘Molecular Science for Drug Development and Biomedicine’ (Zhong and Zhou, 2014). Encouraged by the success of this idea, various approaches similar to PseAAC had been simulated to deal with problems in protein and protein-related systems, including three powerful software programs for generating different modes of PseAAC: PseAAC-Builder (Du et al., 2012), propy (Cao et al., 2013), PseAAC-General (Du et al., 2014,) and Pse-in-One (Liu et al., 2015a,b,c).
When dealing with extremely large dimensions can potentially cause over-fitting or high-dimension disaster (Wang et al., 2016), restrict by computation handicap, and increase information redundancy, which results in bad prediction accuracy. To solve this problem, we present a convenient approaches based on the idea of pseudo reduced amino acid composition (PseRAAC), and provide a flexible and user-friendly web server for pseudo K-tuple reduced amino acids composition (PseKRAAC) (http://bigdata.imu.edu.cn/psekraac), where users can easily generate many different modes of PseKRAAC tailored to their needs by selecting various reduced amino acids alphabets and other characteristic parameters.
2 Reduced amino acids alphabets
Based on physicochemical features or evolutionary relationships, amino acids residues can be clustered into groups because they serve similar structural or functional roles in proteins (Wang and Wang, 1999). The reduced amino acids not only simplify the complexity of the protein system, but also improve the ability in finding structurally conserved regions and the structural similarity of entire proteins (Peterson et al., 2009). In recent years, the alphabet reduction techniques play high potential roles for enhancing the power in dealing with protein sequence analysis (Supplementary Data; Feng et al., 2013; Liu et al., 2014, 2015a,b,c). Therefore, it is reasonable to use the reduced amino acids alphabets to formulate PseKRAAC for protein sequences. Here, 16 types of reduced amino acid alphabets were proposed to generate various different modes of PseRAAC (Table 1; Liu et al., 2015a,b,c).
List of 16 types of reduced amino acid alphabets in protein
| Type . | Method description . | Clusters . | Dimension . |
|---|---|---|---|
| 1 | RedPSSM | 2–19 | RAACK |
| 2 | BLOSUM 62 matrix | 2–6,8,15 | RAACK |
| 3A | PAM matrix | 2–19 | RAACK |
| 3B | WAG matrix | 2–19 | RAACK |
| 4 | Protein Blocks | 5,8,9,11,13 | RAACK |
| 5 | BLOSUM50 matrix | 3,4,8,10,15 | RAACK |
| 6 | Multiple cluster | 4,5a,5b,5c | RAACK |
| 7 | Metric multi-dimensional scaling (MMDS) | 2–19 | RAACK |
| 8 | Grantham Distance Matrix (Saturation) | 2–19 | RAACK |
| 9 | Grantham Distance Matrix (Grantham) | 2–19 | RAACK |
| 10 | BLOSUM matrix for SWISS-PROT | 2–19 | RAACK |
| 11 | BLOSUM matrix for SWISS-PROT | 2–19 | RAACK |
| 12 | BLOSUM matrix for DAPS | 2–18 | RAACK |
| 13 | Coarse-graining substitution matrices | 4,12,17 | RAACK |
| 14 | Alphabet Simplifer | 2–19 | RAACK |
| 15 | MJ matrix | 2–16 | RAACK |
| 16 | BLOSUM50 matrix | 2–16 | RAACK |
| Type . | Method description . | Clusters . | Dimension . |
|---|---|---|---|
| 1 | RedPSSM | 2–19 | RAACK |
| 2 | BLOSUM 62 matrix | 2–6,8,15 | RAACK |
| 3A | PAM matrix | 2–19 | RAACK |
| 3B | WAG matrix | 2–19 | RAACK |
| 4 | Protein Blocks | 5,8,9,11,13 | RAACK |
| 5 | BLOSUM50 matrix | 3,4,8,10,15 | RAACK |
| 6 | Multiple cluster | 4,5a,5b,5c | RAACK |
| 7 | Metric multi-dimensional scaling (MMDS) | 2–19 | RAACK |
| 8 | Grantham Distance Matrix (Saturation) | 2–19 | RAACK |
| 9 | Grantham Distance Matrix (Grantham) | 2–19 | RAACK |
| 10 | BLOSUM matrix for SWISS-PROT | 2–19 | RAACK |
| 11 | BLOSUM matrix for SWISS-PROT | 2–19 | RAACK |
| 12 | BLOSUM matrix for DAPS | 2–18 | RAACK |
| 13 | Coarse-graining substitution matrices | 4,12,17 | RAACK |
| 14 | Alphabet Simplifer | 2–19 | RAACK |
| 15 | MJ matrix | 2–16 | RAACK |
| 16 | BLOSUM50 matrix | 2–16 | RAACK |
RAACK: K-tuple of reduced amino acid cluster (RAAC), For example, Type 1, Cluster =10(RAAC) and K-tuple = 2 (K = 2), Dimension= RAACK = 102 = 100.
List of 16 types of reduced amino acid alphabets in protein
| Type . | Method description . | Clusters . | Dimension . |
|---|---|---|---|
| 1 | RedPSSM | 2–19 | RAACK |
| 2 | BLOSUM 62 matrix | 2–6,8,15 | RAACK |
| 3A | PAM matrix | 2–19 | RAACK |
| 3B | WAG matrix | 2–19 | RAACK |
| 4 | Protein Blocks | 5,8,9,11,13 | RAACK |
| 5 | BLOSUM50 matrix | 3,4,8,10,15 | RAACK |
| 6 | Multiple cluster | 4,5a,5b,5c | RAACK |
| 7 | Metric multi-dimensional scaling (MMDS) | 2–19 | RAACK |
| 8 | Grantham Distance Matrix (Saturation) | 2–19 | RAACK |
| 9 | Grantham Distance Matrix (Grantham) | 2–19 | RAACK |
| 10 | BLOSUM matrix for SWISS-PROT | 2–19 | RAACK |
| 11 | BLOSUM matrix for SWISS-PROT | 2–19 | RAACK |
| 12 | BLOSUM matrix for DAPS | 2–18 | RAACK |
| 13 | Coarse-graining substitution matrices | 4,12,17 | RAACK |
| 14 | Alphabet Simplifer | 2–19 | RAACK |
| 15 | MJ matrix | 2–16 | RAACK |
| 16 | BLOSUM50 matrix | 2–16 | RAACK |
| Type . | Method description . | Clusters . | Dimension . |
|---|---|---|---|
| 1 | RedPSSM | 2–19 | RAACK |
| 2 | BLOSUM 62 matrix | 2–6,8,15 | RAACK |
| 3A | PAM matrix | 2–19 | RAACK |
| 3B | WAG matrix | 2–19 | RAACK |
| 4 | Protein Blocks | 5,8,9,11,13 | RAACK |
| 5 | BLOSUM50 matrix | 3,4,8,10,15 | RAACK |
| 6 | Multiple cluster | 4,5a,5b,5c | RAACK |
| 7 | Metric multi-dimensional scaling (MMDS) | 2–19 | RAACK |
| 8 | Grantham Distance Matrix (Saturation) | 2–19 | RAACK |
| 9 | Grantham Distance Matrix (Grantham) | 2–19 | RAACK |
| 10 | BLOSUM matrix for SWISS-PROT | 2–19 | RAACK |
| 11 | BLOSUM matrix for SWISS-PROT | 2–19 | RAACK |
| 12 | BLOSUM matrix for DAPS | 2–18 | RAACK |
| 13 | Coarse-graining substitution matrices | 4,12,17 | RAACK |
| 14 | Alphabet Simplifer | 2–19 | RAACK |
| 15 | MJ matrix | 2–16 | RAACK |
| 16 | BLOSUM50 matrix | 2–16 | RAACK |
RAACK: K-tuple of reduced amino acid cluster (RAAC), For example, Type 1, Cluster =10(RAAC) and K-tuple = 2 (K = 2), Dimension= RAACK = 102 = 100.
3 Reduced amino acid composition
I. g-gap PseKRAAC
The g-gap PseKRAAC is used to represent a protein sequence with a vector containing RAACK components, where g represents the gap between each K-tuple peptides (Liu et al., 2015a,b,c; Wang et al., 2016). A g-gap of n reflects the sequence-order information for all dipeptides with the starting residues separated by n residues. Supplementary Figure 1A shows the schematic drawing of g-gap definition of dipeptide (K = 2).
II. λ-Correlation PseKRAAC
The λ-correlation PseKRAAC, also called parallel correlation PseKRAAC, is used to represent a protein sequence with a vector containing RAACK components, where λ is an integer that represents the correlation tier, and is less than L – K. The nth-tier correlation factor (λ = n) reflects the sequence-order correlation between nth most nearest residue. Supplementary Figure 1B shows the schematic drawing of λ-correlation definition of dipeptide (K = 2).
4 Server description
A step-by-step server guide on how to use PseKRAAC can refer to the Supplementary Data. Compared to the original Chou’s PseAAC server, PseKRAAC server offers following important improvements and advantages: First, by implementing the concept of reduced amino acid alphabet for amino acid clustering, the complexity of protein composition is significantly simplified, which leads to decrease chance of overfitting, lower computational handicap and reduce information redundancy.
Also, PseKRAAC delivers more capability for protein research by incorporating three crucial parameters that describes protein composition: K-tuple peptide functionality for K up to 3, λ-correlation PseKRAAC and g-gap PseKRAAC for protein characterization. Users can increase the λ value in PseAAC webserver to cover more global sequence-pattern effects or increase the K value to count more local sequence-pattern effects. Finally, PseKRAAC provides easier application for inputting sequences by accepting protein sequences in FASTA format via directly enter into the input text box or upload it as a FASTA file. The server is also capable of outputting result files in LIBSVM, CSV and FASTA format for further analysis. When uploading FASTA outputting files to other PseAAC webservers, the user can easily generating more various modes of PseAAC (Chou, 2005).
Acknowledgements
The authors wish to thank the three anonymous reviewers for their constructive comments, which were helpful for strengthening the presentation of this study.
Funding
This work was supported by The National Nature Scientific Foundation of China (No: 61561036, 31501078) and the Specialized Research Fund for the Doctoral Program of Higher Education (20131501120009).
Conflict of Interest: none declared.
References
Author notes
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
