Abstract

Proteome-pI is an online database containing information about predicted isoelectric points for 5029 proteomes calculated using 18 methods. The isoelectric point, the pH at which a particular molecule carries no net electrical charge, is an important parameter for many analytical biochemistry and proteomics techniques, especially for 2D gel electrophoresis (2D-PAGE), capillary isoelectric focusing, liquid chromatography–mass spectrometry and X-ray protein crystallography. The database, available at http://isoelectricpointdb.org allows the retrieval of virtual 2D-PAGE plots and the development of customised fractions of proteome based on isoelectric point and molecular weight. Moreover, Proteome-pI facilitates statistical comparisons of the various prediction methods as well as biological investigation of protein isoelectric point space in all kingdoms of life. For instance, using Proteome-pI data, it is clear that Eukaryotes, which evolved tight control of homeostasis, encode proteins with pI values near the cell pH. In contrast, Archaea living frequently in extreme environments can possess proteins with a wide range of isoelectric points. The database includes various statistics and tools for interactive browsing, searching and sorting. Apart from data for individual proteomes, datasets corresponding to major protein databases such as UniProtKB/TrEMBL and the NCBI non-redundant (nr) database have also been precalculated and made available in CSV format.

INTRODUCTION

Isoelectric point (pI) is the pH at which a particular molecule carries no net electrical charge. For polypeptide chains, pI depends primarily on the dissociation constants (pKa) for the ionisable groups of the seven charged amino acids: glutamate, aspartate, cysteine, tyrosine, histidine, lysine and arginine. Moreover, the charge of the terminal groups (NH2 and COOH) can affect the pI of short peptides. It is also important to consider posttranslational modifications, the exposure of charged residues to solvent, the Born effect (dehydration), hydrogen bonds (charge-dipole interactions), and charge-charge interactions (1). pI has broad usage in currently used biochemical and proteomic techniques. For example, during electrophoresis, the direction of protein migration on the gel depends on the charge. Thus, it is possible to separate proteins in a gel based on their pI. Given the sequence, we can try to computationally predict pI using the Henderson–Hasselbalch equation (2), by calculating the charge of the molecule at a certain pH using pKa values of charged residues. More than 600 different pKa values have so far been reported for the ionisable groups of amino acids (3). The final result, predicted pI, will most likely be different than the real one, given that many proteins are chemically modified (e.g., amino acids can be phosphorylated, methylated, or acetylated), and this can influence their charge. Nevertheless, even an approximate isoelectric point is a highly valuable and frequently used parameter.

In the past, much work has gone into creating databases storing experimentally verified pI values for proteins, yet none of these databases contains more than five thousand proteins (4,5), which is very few compared to the protein sequence data currently available. Thus, Proteome-pI database is an attempt to decrease this gap; hopefully, it will expand the body of knowledge regarding isoelectric points in a more genome-wide fashion.

MATERIALS AND METHODS

Sequences

Protein sequences of model organisms were obtained from UniProt as of April 2016, release 2016_04 (6). This includes 5029 complete proteomes (with splicing isoforms for Eukaryote) from the entire tree of life. In total, protein isoelectric point, molecular weights and other statistics were calculated for >21 million protein sequences (Table 1).

General statistics of the Proteome-pI database

Table 1.
General statistics of the Proteome-pI database
Number of proteomesTotal number of proteinsMean number of proteins (±SD)Mean size of proteins (±SD)Mean mw of proteins (±SD)
Viruses50420 92042 ± 89297 ± 37533 ± 42
Archaea135318 3882358 ± 920283 ± 21231 ± 23
Bacteria377612 082 9033200 ± 2510311 ± 24034 ± 26
Eukaryote6149 299 03915 145 ± 11 830438 ± 42949 ± 48
Eukaryote (major)6148 629 59114 055 ± 9899434 ± 41648 ± 46
Eukaryote (minor)448669 4481494 ± 5130495 ± 56455 ± 63
Number of proteomesTotal number of proteinsMean number of proteins (±SD)Mean size of proteins (±SD)Mean mw of proteins (±SD)
Viruses50420 92042 ± 89297 ± 37533 ± 42
Archaea135318 3882358 ± 920283 ± 21231 ± 23
Bacteria377612 082 9033200 ± 2510311 ± 24034 ± 26
Eukaryote6149 299 03915 145 ± 11 830438 ± 42949 ± 48
Eukaryote (major)6148 629 59114 055 ± 9899434 ± 41648 ± 46
Eukaryote (minor)448669 4481494 ± 5130495 ± 56455 ± 63

mw—molecular weight in kDa; for more statistics, see Supplementary Table S1. ‘Major’ and ‘minor’ refer to splicing isoforms of proteins used for calculation of the statistics.

Table 1.
General statistics of the Proteome-pI database
Number of proteomesTotal number of proteinsMean number of proteins (±SD)Mean size of proteins (±SD)Mean mw of proteins (±SD)
Viruses50420 92042 ± 89297 ± 37533 ± 42
Archaea135318 3882358 ± 920283 ± 21231 ± 23
Bacteria377612 082 9033200 ± 2510311 ± 24034 ± 26
Eukaryote6149 299 03915 145 ± 11 830438 ± 42949 ± 48
Eukaryote (major)6148 629 59114 055 ± 9899434 ± 41648 ± 46
Eukaryote (minor)448669 4481494 ± 5130495 ± 56455 ± 63
Number of proteomesTotal number of proteinsMean number of proteins (±SD)Mean size of proteins (±SD)Mean mw of proteins (±SD)
Viruses50420 92042 ± 89297 ± 37533 ± 42
Archaea135318 3882358 ± 920283 ± 21231 ± 23
Bacteria377612 082 9033200 ± 2510311 ± 24034 ± 26
Eukaryote6149 299 03915 145 ± 11 830438 ± 42949 ± 48
Eukaryote (major)6148 629 59114 055 ± 9899434 ± 41648 ± 46
Eukaryote (minor)448669 4481494 ± 5130495 ± 56455 ± 63

mw—molecular weight in kDa; for more statistics, see Supplementary Table S1. ‘Major’ and ‘minor’ refer to splicing isoforms of proteins used for calculation of the statistics.

Predictions

To predict isoelectric points, Proteome-pI currently uses 18 different algorithms and programs, which can be divided into three categories. The first category consists of methods that predict the isoelectric point based on the Henderson–Hasselbalch equation with different pKa values corresponding to different charged groups (2). Those methods usually use nine different pKa values established empirically in separate experiments (seven pKa values for charged amino acids and two for polypeptide chain termini). For example, pKa values obtained by Thurlkill et al. were measured in 0.1 M KCl at 25°C using alanine pentapeptides with a charged residue in the centre and with blocked terminal groups (7). Further, nine-parameter models are used for calculation of isoelectric points in methods named after the lead author of the study or the source of the pKa values: EMBOSS (8), DTASelect (9), Solomons (10), Sillero (11), Rodwell (12), Wikipedia, Lehninger (13), Grimsley (3), Toseland (14), Thurlkill (7), Nozaki (15) and Dawson (16). Additionally some algorithms use different numbers of pKa values (Patrickios (17) uses only six, Bjellqvist (18) uses 17, and ProMoST (19) uses 72 pKa values depending on the location of amino acid with respect to the protein termini). In the next category, we have IPC_protein and IPC_peptide models, which use computationally optimised nine-parameter pKa sets (20). Finally, the consensus from all methods apart from Patrickios (highly simplified model with only six parameters) is also reported.

RESULTS

Database use

The Proteome-pI database incorporates multiple browsing and searching tools. First, it can be searched and browsed by organism name, average isoelectric point, molecular weight or amino acid frequencies (see also Table 2). Proteins with extreme pI values are also available. For individual proteomes, users can retrieve proteins of interest given the method, isoelectric point and molecular weight ranges (this particular feature can be highly useful to limit potential targets in analysis of 2D-PAGE gels or before conducting mass spectrometry). Additionally, precalculated fractions of proteins according to isoelectric point are also available. Finally, some general statistics (total number of proteins, amino acids, average sequence length, amino acid frequency) and links to other databases (UniProt, NCBI) can be found (see Figure 1 for an example).

Proteome-pI example report for Salmonella enterica.
              At the top, the average isoelectric point, precalculated fractions of proteins
              according to isoelectric point and virtual 2D-PAGE plot for the proteome are shown. In
              the next section, the user can retrieve a subset of proteins within specified
              isoelectric point and molecular weight ranges calculated using a particular method.
              Next, proteins with minimal and maximal isoelectric points are presented along with
              some general statistics.
Figure 1.

Proteome-pI example report for Salmonella enterica. At the top, the average isoelectric point, precalculated fractions of proteins according to isoelectric point and virtual 2D-PAGE plot for the proteome are shown. In the next section, the user can retrieve a subset of proteins within specified isoelectric point and molecular weight ranges calculated using a particular method. Next, proteins with minimal and maximal isoelectric points are presented along with some general statistics.

Amino acid frequency for the kingdoms of life in the Proteome-pI database

Table 2.
Amino acid frequency for the kingdoms of life in the Proteome-pI database
KingdomAlaCysAspGluPheGlyHisIleLysLeuMetAsnProGlnArgSerThrValTrpTyrTotal amino acids
Viruses6.611.765.816.044.255.792.156.536.358.842.465.414.623.395.247.066.066.501.193.946 150 189
Archaea8.200.986.217.693.867.581.777.035.279.312.353.684.262.385.516.175.447.801.033.4589 488 664
Bacteria10.060.945.596.153.897.762.065.894.6810.092.383.584.613.585.885.855.527.271.272.943 716 982 916
Eukaryota7.631.765.406.423.876.332.445.105.649.292.254.285.414.215.718.345.566.201.242.873 743 221 293
All8.761.385.496.323.877.032.265.495.199.682.323.935.023.905.787.145.536.731.252.917 555 843 062
KingdomAlaCysAspGluPheGlyHisIleLysLeuMetAsnProGlnArgSerThrValTrpTyrTotal amino acids
Viruses6.611.765.816.044.255.792.156.536.358.842.465.414.623.395.247.066.066.501.193.946 150 189
Archaea8.200.986.217.693.867.581.777.035.279.312.353.684.262.385.516.175.447.801.033.4589 488 664
Bacteria10.060.945.596.153.897.762.065.894.6810.092.383.584.613.585.885.855.527.271.272.943 716 982 916
Eukaryota7.631.765.406.423.876.332.445.105.649.292.254.285.414.215.718.345.566.201.242.873 743 221 293
All8.761.385.496.323.877.032.265.495.199.682.323.935.023.905.787.145.536.731.252.917 555 843 062

*Similar statistics for all 5029 proteomes included in Proteome-pI are available online on individual subpages. For di-amino acid frequencies see Supplementary Table S2.

Table 2.
Amino acid frequency for the kingdoms of life in the Proteome-pI database
KingdomAlaCysAspGluPheGlyHisIleLysLeuMetAsnProGlnArgSerThrValTrpTyrTotal amino acids
Viruses6.611.765.816.044.255.792.156.536.358.842.465.414.623.395.247.066.066.501.193.946 150 189
Archaea8.200.986.217.693.867.581.777.035.279.312.353.684.262.385.516.175.447.801.033.4589 488 664
Bacteria10.060.945.596.153.897.762.065.894.6810.092.383.584.613.585.885.855.527.271.272.943 716 982 916
Eukaryota7.631.765.406.423.876.332.445.105.649.292.254.285.414.215.718.345.566.201.242.873 743 221 293
All8.761.385.496.323.877.032.265.495.199.682.323.935.023.905.787.145.536.731.252.917 555 843 062
KingdomAlaCysAspGluPheGlyHisIleLysLeuMetAsnProGlnArgSerThrValTrpTyrTotal amino acids
Viruses6.611.765.816.044.255.792.156.536.358.842.465.414.623.395.247.066.066.501.193.946 150 189
Archaea8.200.986.217.693.867.581.777.035.279.312.353.684.262.385.516.175.447.801.033.4589 488 664
Bacteria10.060.945.596.153.897.762.065.894.6810.092.383.584.613.585.885.855.527.271.272.943 716 982 916
Eukaryota7.631.765.406.423.876.332.445.105.649.292.254.285.414.215.718.345.566.201.242.873 743 221 293
All8.761.385.496.323.877.032.265.495.199.682.323.935.023.905.787.145.536.731.252.917 555 843 062

*Similar statistics for all 5029 proteomes included in Proteome-pI are available online on individual subpages. For di-amino acid frequencies see Supplementary Table S2.

Moreover, apart from the data for individual proteomes, one can also obtain precalculated isoelectric points from all major protein databases, including nr (21), UniProt, PDB (22) and SwissProt (23) (more details in Supplementary Data).

DISCUSSION

The main content of the Proteome-pI database is the comprehensive isoelectric point prediction using numerous methods. The isoelectric point—the pH at which a particular molecule carries no net electrical charge—is an important parameter for many analytical biochemistry and proteomics techniques, such as 2D-PAGE gel electrophoresis (24,25), capillary isoelectric focusing (26), liquid chromatography–mass spectrometry (LC–MS) (27) and X-ray protein crystallography (28,29). Additional goals of the database include facilitating biological investigation of protein isoelectric point space. For instance, it is well known that distribution of protein isoelectric points of proteomes is bimodal, with a low fraction of proteins having pI values close to the cell physiological pH (Supplementary Figure S1) (30). Interestingly, if we divide proteomes into the kingdoms of life, one can notice that Eukaryota have the largest proteins restricted to narrow isoelectric point range. On the other side, Archaea possess usually small proteins, but the isoelectric points of their proteins can vary significantly (Figure 2). This is most likely due to the adaptation to the extreme conditions in which many Archaea live (31). Finally, viruses form a completely separate group. Their proteins have isoelectric point which is strongly correlated with the pI of its host proteins and therefore can vary significantly. Simultaneously, the molecular weight of viral proteins is significantly lower than that of host proteins due to the compactness of virions (significant evolutionary pressure to minimise the overall size) (32).

Isoelectric points and molecular weights across kingdoms of life. Data for the
            proteomes of 135 Archaea, 127 viruses (>50 proteins), 3775 bacteria and 614
            eukaryotes.
Figure 2.

Isoelectric points and molecular weights across kingdoms of life. Data for the proteomes of 135 Archaea, 127 viruses (>50 proteins), 3775 bacteria and 614 eukaryotes.

It should be noted that there is at least one other similar database storing isoelectric points for some proteomes. The JVirGel website (33) containspI data for 227 relatively small, prokaryotic proteomes, precalculated using only one method. In contrast, the Proteome-pI database aggregates predictions of isoelectric points calculated by 18 different methods and algorithms across >5000 proteomes from all kingdoms of life (over 21 million proteins).

Future work

The principal future goal is to include more isoelectric point algorithms and proteomes for further investigation. The next future goal is to provide more tools for online analysis, e.g., tools for Gene Ontology searching (34). Another possible extension could be to add putative digestion products of trypsin and their respective isoelectric points (35). We will be grateful for any contribution to the database from the community.

AVAILABILITY

All data in the Proteome-pI database are available for download free of charge. Proteome-pI can be accessed at http://isoelectricpointdb.org The database will be available at given web address for at least ten years.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

ACKNOWLEDGEMENTS

Proteome-pI author acknowledges all authors of previous works related to different pKa sets and databases, especially developers of UniProt database.

FUNDING

Funding for open access charge: The open access publication charge for this paper has been waived by Oxford University Press—NAR Editorial Board members are entitled to one free paper per year in recognition of their work on behalf of the journal.

Conflict of interest statement. None declared.

REFERENCES

1.

Pace
C.N.
Grimsley
G.R.
Scholtz
J.M.
Protein ionizable groups: pK values and their contribution to protein stability and solubility
J. Biol. Chem.
2009
284
13285
13289

2.

Po
H.N.
Senozan
N.M.
The Henderson-Hasselbalch equation: its history and limitations
J. Chem. Educ.
2001
78
1499

3.

Grimsley
G.R.
Scholtz
J.M.
Pace
C.N.
A summary of the measured pK values of the ionizable groups in folded proteins
Protein Sci.
2009
18
247
251

4.

Hoogland
C.
Mostaguir
K.
Sanchez
J.C.
Hochstrasser
D.F.
Appel
R.D.
SWISS‐2DPAGE, ten years later
Proteomics
2004
4
2352
2356

5.

Bunkute
E.
Cummins
C.
Crofts
F.J.
Bunce
G.
Nabney
I.T.
Flower
D.R.
PIP-DB: the protein isoelectric point database
Bioinformatics
2015
31
295
296

6.

The UniProt Consortium
UniProt: a hub for protein information
Nucleic Acids Res.
2015
43
D204
D212

7.

Thurlkill
R.L.
Grimsley
G.R.
Scholtz
J.M.
Pace
C.N.
pK values of the ionizable groups of proteins
Protein Sci.
2006
15
1214
1218

8.

Rice
P.
Longden
I.
Bleasby
A.
EMBOSS: the European Molecular Biology Open Software Suite
Trends Genet.
2000
16
276
277

9.

Tabb
D.L.
McDonald
W.H.
Yates
J.R.
DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics
J. Proteome Res.
2002
1
21
26

10.

Solomons
T.G.
Organic Chemistry
1992
John Wiley&Sons

11.

Sillero
A.
Ribeiro
J.M.
Isoelectric points of proteins: theoretical determination
Anal. Biochem.
1989
179
319
325

12.

Rodwell
J.D.
Heterogeneity of component bands in isoelectric focusing patterns
Anal. Biochem.
1982
119
440
449

13.

Nelson
D.L.
Lehninger
A.L.
Cox
M.M.
Lehninger Principles of Biochemistry
2008
Macmillan

14.

Toseland
C.P.
McSparron
H.
Davies
M.N.
Flower
D.R.
PPD v1.0—an integrated, web-accessible database of experimentally determined protein pK(a) values
Nucleic Acids Res.
2006
34
D199
D203

15.

Nozaki
Y.
Tanford
C.
The solubility of amino acids and two glycine peptides in aqueous ethanol and dioxane solutions: estabilishment of a hydrophobicity scale
J. Biol. Chem.
1971
246
2211
2217

16.

Dawson
R.M.C.
Data for Biochemical Research
1986
Oxford
Clarendon Press

17.

Patrickios
C.S.
Yamasaki
E.N.
Polypeptide amino acid composition and isoelectric point. II. Comparison between experiment and theory
Anal. Biochem.
1995
231
82
91

18.

Bjellqvist
B.
Basse
B.
Olsen
E.
Celis
J.E.
Reference points for comparisons of two‐dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions
Electrophoresis
1994
15
529
539

19.

Halligan
B.D.
Ruotti
V.
Jin
W.
Laffoon
S.
Twigger
S.N.
Dratz
E.A.
ProMoST (Protein Modification Screening Tool): a web-based tool for mapping protein modifications on two-dimensional gels
Nucleic Acids Res.
2004
32
W638
W644

20.

Kozlowski
L.P.
IPC - Isoelectric Point Calculator
Biol. Direct
2016
11
55

21.

Pruitt
K.D.
Tatusova
T.
Klimke
W.
Maglott
D.R.
NCBI reference sequences: current status, policy and new initiatives
Nucleic Acids Res.
2009
37
D32
D36

22.

Rose
P.W.
Prlić
A.
Bi
C.
Bluhm
W.F.
Christie
C.H.
Dutta
S.
Green
R.K.
Goodsell
D.S.
Westbrook
J.D.
Woo
J.
The RCSB protein data bank: views of structural biology for basic and applied research and education
Nucleic Acids Res.
2015
43
D345
D356

23.

Boutet
E.
Lieberherr
D.
Tognolli
M.
Schneider
M.
Bansal
P.
Bridge
A.J.
Poux
S.
Bougueleret
L.
Xenarios
I.
UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view
Plant Bioinformatics: Methods Protocols
2016
23
54

24.

O'Farrell
P.H.
High resolution two-dimensional electrophoresis of proteins
J. Biol. Chem.
1975
250
4007
4021

25.

Klose
J.
Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues
Humangenetik
1975
26
231
243

26.

Righetti
P.G.
Castagna
A.
Herbert
B.
Reymond
F.
Rossier
J.S.
Prefractionation techniques in proteome analysis
Proteomics
2003
3
1397
1407

27.

Heller
M.
Ye
M.
Michel
P.E.
Morier
P.
Stalder
D.
Jünger
M.A.
Aebersold
R.
Reymond
F.
Rossier
J.S.
Added value for tandem mass spectrometry shotgun proteomics data validation through isoelectric focusing of peptides
J. Proteome Res.
2005
4
2273
2282

28.

Kirkwood
J.
Hargreaves
D.
O'Keefe
S.
Wilson
J.
Using isoelectric point to determine the pH for initial protein crystallization trials
Bioinformatics
2015
31
1444
1451

29.

Kantardjieff
K.A.
Rupp
B.
Protein isoelectric point as a predictor for increased crystallization screening efficiency
Bioinformatics
2004
20
2162
2168

30.

Kiraga
J.
Mackiewicz
P.
Mackiewicz
D.
Kowalczuk
M.
Biecek
P.
Polak
N.
Smolarczyk
K.
Dudek
M.R.
Cebrat
S.
The relationships between the isoelectric point and: length of proteins, taxonomy and ecology of organisms
BMC Genomics
2007
8
163

31.

Oren
A.
Microbial life at high salt concentrations: phylogenetic and metabolic diversity
Saline Syst.
2008
4
13

32.

Grenfell
B.T.
Pybus
O.G.
Gog
J.R.
Wood
J.L.
Daly
J.M.
Mumford
J.A.
Holmes
E.C.
Unifying the epidemiological and evolutionary dynamics of pathogens
Science
2004
303
327
332

33.

Hiller
K.
Grote
A.
Maneck
M.
Münch
R.
Jahn
D.
JVirGel 2.0: computational prediction of proteomes separated via two-dimensional gel electrophoresis under consideration of membrane and secreted proteins
Bioinformatics
2006
22
2441
2443

34.

The Gene Ontology Consortium
Gene ontology consortium: going forward
Nucleic Acids Res.
2015
43
D1049
D1056

35.

Shevchenko
A.
Tomas
H.
Havli
J.
Olsen
J.V.
Mann
M.
In-gel digestion for mass spectrometric characterization of proteins and proteomes
Nat. Protoc.
2006
1
2856
2860

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.