Phospho.ELM is a manually curated database of eukaryotic phosphorylation sites. The resource includes data collected from published literature as well as high-throughput data sets.
The current release of Phospho.ELM (version 7.0, July 2007) contains 4078 phospho-protein sequences covering 12 025 phospho-serine, 2362 phospho-threonine and 2083 phospho-tyrosine sites. The entries provide information about the phosphorylated proteins and the exact position of known phosphorylated instances, the kinases responsible for the modification (where known) and links to bibliographic references. The database entries have hyperlinks to easily access further information from UniProt, PubMed, SMART, ELM, MSD as well as links to the protein interaction databases MINT and STRING.
A new BLAST search tool, complementary to retrieval by keyword and UniProt accession number, allows users to submit a protein query (by sequence or UniProt accession) to search against the curated data set of phosphorylated peptides.
Phospho.ELM is available on line at: http://phospho.elm.eu.org
Protein phosphorylation is one of the most-studied post-translational modifications: it has been estimated that up to one-third of the proteins may be modified by protein kinases ( 1 ). This ubiquitous regulatory mechanism controls many biological processes, including cellular growth, differentiation and DNA repair ( 2 ).
Knowing the phosphorylated residues in proteins is central to understanding the various signaling events in which they partake; therefore much effort has been invested in trying to identify and characterize phosphorylation sites. Traditional methods for measuring protein phosphorylation, such as mutational analysis and Edman degradation chemistry on phosphopeptides, have the disadvantage of being relatively time consuming and laborious, requiring large amounts of purified protein. On the other hand, mass spectrometry-(MS)based methods have emerged as powerful tools for the analysis of post-translational modifications due to higher sensitivity, selectively and speed. Over the past few years MS, combined with enrichment strategies for phosphorylated proteins e.g. isotope-coded affinity tags (ICAT) ( 3 ), stable isotopic amino acids in cell culture (SILAC) ( 4 ) and isobaric reagent iTRAQ ( 5 ), has been increasingly employed to identify novel phosphorylation sites. One consequence of this change in phosphorylation research is that bioinformatics resources need to be adapted and expanded to accommodate the new data.
For the thousands of phosphorylation sites identified by phosphoproteomic MS the information on which kinase phosphorylates them, and consequently the pathway in which they act, is still missing. To improve the link between experimentally identified phosphorylation sites and protein kinases, Linding and collaborators ( 6 ) have recently used the Phospho.ELM data set to develop and train a method, NetworKIN ( http://networkin.info/ ) that combines computational methods for predicting which group of kinases are likely to phosphorylate a given site with information about signaling pathways and protein interaction data.
The analysis of protein phosphorylation by MS will clearly prove to be an invaluable source of information for understanding cellular signaling. For this reason, we consider it increasingly important to create and maintain publicly available phospho-protein databases, where the exponentially increasing number of known phosphorylation sites ( 7–12 ) can be easily accessed by the research community.
MATERIALS AND METHODS
The Phospho.ELM database
The content and the format of Phospho.ELM have been previously described in Diella et al. ( 13 ). While the general format of the database has remained essentially unchanged, some additions have been implemented to improve the data retrieval and presentation. The updated version also contains a much larger number of phosphorylation sites (see Figure 1 ), a new search tool based on sequence comparison and a Web Services interface.
The user can query the database by protein name, UniProt accession number/identifier, kinase name or binding motif to get a list of all known phosphorylation sites (instances) in a specific protein. The main results page summarizes information about the substrate protein (e.g. a brief description of the protein, protein type, the UniProt protein identification number), the phosphorylation sites contained within it and its surrounding amino acids (+/−10). The annotations to each instance include (where available) the PubMed reference, the kinase(s) phosphorylating the given site, the phospho-peptide binding domain(s) and a link to the ELM server ( 14 ) to retrieve further information about the kinase. Also where available, hyperlinks are provided to protein structures containing phosphorylated residues ( 15 ). Recently, Zanzoni and collaborators ( 16 ) have developed Phospho3D, a database of three-dimensional structures of phosphorylation sites, which stores data derived from the Phospho.ELM database and is focused on the annotation of structural information at the residue level.
Additional information for each protein kinase substrate includes the subcellular compartment [annotated with the Gene Ontology terms ( 17 )], the tissue distribution and a list of interaction partners derived from the MINT ( 18 ) and STRING databases ( 19 ). The STRING interactors are shown in a summary graphic (network) that opens in a pop-up window. The network views provide links to the STRING database, where the information relative to the interactors is described in detail.
The current release of the Phospho.ELM data set (version 7.0, July 2007) contains 4078 phospho-protein sequences covering 12 025 phospho-serine, 2362 phospho-threonine and 2083 phospho-tyrosine sites with a total of 16 470 sites. The dataset is currently limited to metazoan species. This is partly due to our annotation capacity and partly because the kinases and nomenclature are so different in other lineages that they should be placed in separate databases. Although no animal species is purposely excluded from the data, currently human (11 197 phospho-sites) and mouse (2073 phospho-sites) are the most representative species due to the prevalence of their use as model organisms in biological research e.g. phosphoproteomic MS analyses have been mainly performed on human/mouse cell lines/tissues.
For each phospho-site we report if the phosphorylation evidence has been identified by small-scale analysis (low throughput; LTP) that typically focus on one or a few proteins at a time or by large-scale experiments (high throughput; HTP), which mainly apply MS techniques. It is noteworthy that in our data set there is a small overlap between instances identified by LTP and HTP experiments ( Figure 1 ). This implies that most of the human phosphoproteome remains to be discovered. Figure 1 also shows that the rate of identification of additional phosphorylation sites on proteins has been increasing at a much faster rate than identification of novel phosphoproteins (e.g. see the srmm2 protein, UniProt accession Q9UQ35). While revealing that many more proteins are heavily phosphorylated than was previously known, it may be worth investigating whether the data also imply a strong bias in the proteins retrieved in the MS experiments.
The kinase responsible for the phosphorylation is known for ∼21% of the Phospho.ELM instances. Currently, more than 250 kinases are annotated in the database (for a detailed list of the kinases see the related information at the Phospho.ELM home page).
The PhosphoBLAST search tool
A BLAST search has been implemented which is complementary to the retrieval by keyword or UniProt accession/identifier. This tool identifies phospho-peptides contained in the query sequence that match those stored in Phospho.ELM ( Figure 2 ). It consists of a two-step process: a BLAST ( 20 ) search and a parsing of the BLAST output. The BLAST program performs a sequence-similarity search against the Phospho.ELM data set of peptides (16 471), which have been experimentally proven to contain phospho-residues. It returns a set of local gapped alignments between the query sequence peptides and the phospho-peptides. In the parsing stage, those matches that present more than 70% sequence similarity and that conserve the phospho-residue in the same position as the corresponding phospho-peptide are selected. The final output shows the list of chosen matches, with their alignments and links to database records.
The PhosphoBLAST tool does not aim at predicting phosphorylation motifs in the query protein and is primarily useful for retrieving phosphorylation sites that are conserved in related proteins (whether orthologs or paralogs). Nevertheless, unrelated query proteins occasionally yield matching phosphorylation sites in Phospho.ELM that can be equally interesting: it will be up to the user to consider carefully the possible biological meaning (e.g. shared kinase and/or phospho-peptide-binding domain specificities) associated with these match(es).
In order to facilitate remote tool integration, a Web Service to access the phospho.ELM database programmatically has been implemented and is available at: http://phospho.elm.eu.org/webservice/phosphoELMdb.wsdl .
The WSDL (Web Service Description Language) ( 21 ) file is WS-I compatible. The WS-Interoperability Basic Profile ( 22 ) proposes a set of rules to achieve interoperability of web services between different platforms. The WSDL file implements an XML wrapped document/literal style ( 23 ). The backend code is implemented in Java and runs on Axis2 ( 24 ) inside a Tomcat servlet container ( 25 ).
The functionality provided by the Web Service encompasses the current interface functionality with some additional filters. The extra options implemented in the Web Service are to search by PubMed ID and to retrieve all instances with a PDB entry assigned to them.
Phospho.ELM is developed and deployed with open source software ( 26 ). Software is developed in Python including some modules from the BioPython project ( 27 ) to retrieve information from UniProt and PubMed. The web interface software uses the CGImodel framework ( 28 ).
The data set is publicly available for academic users. Phospho.ELM can be accessed on the public Apache2 powered website at: http://phospho.elm.eu.org .
Since its inception in 2004, the Phospho.ELM data set has been adopted for numerous bioinformatics tools and pipelines e.g. the protein kinase-specific prediction server GPS (group-based phosphorylation scoring method) ( 29 ), the RLIMS-P, a rule-based text-mining program designed to extract information on phosphorylation sites from abstracts ( 30 ), PhosphoregDB, a database of tissue and sub-cellular distribution of mammalian protein kinases and phosphatases ( 31 ), and NetworKin, a computational approach which combines consensus sequence motifs and contextual data to predict which kinases phosphorylate experimentally identified phosphorylation sites ( 6 ).
While anticipating that the size of the Phospho.ELM data set will constantly grow, we consider that the resource should be kept relatively lean in terms of the categories of data to be incorporated. On the other hand, links to external resources are under regular review and likely to be augmented from time to time. For example, resources such as KEGG ( 32 ) and Reactome ( 33 ) that annotate cell signaling networks are increasing their pathway coverage and it will clearly become essential to provide links to such resources. In the near future we intend to equip Phospho.ELM with links to the predicted kinase-substrate relations from the NetworKIN database (R.Linding, et al. , submitted for publication).
We would like to acknowledge all the Phospho.ELM users who, by reporting missing sites or sending us their data sets, have contributed to improve the database. We wish to thank the EU EMBRACE grant (LHSG-CT-2004-512092) for funding. Many thanks to Ivica Letunic and Arnaud Ceol for technical support. We are grateful to Lars Juhl-Jensen and Rune Linding for their insightful comments and suggestions. We are thankful to Niall Haslam for critical reading of the manuscript. Funding to pay the Open Access publication charges for this article was provided by EMBL.
Conflict of interest statement . None declared.