Protein kinases control cellular responses by phosphorylating specific substrates. Recent proteome-wide mapping of protein phosphorylation sites by mass spectrometry has discovered thousands of in vivo sites. Systematically assigning all 518 human kinases to all these sites is a challenging problem. The NetworKIN database ( http://networkin.info ) integrates consensus substrate motifs with context modelling for improved prediction of cellular kinase–substrate relations. Based on the latest human phosphoproteome from the Phospho.ELM and PhosphoSite databases, the resource offers insight into phosphorylation-modulated interaction networks. Here, we describe how NetworKIN can be used for both global and targeted molecular studies. Via the web interface users can query the database of precomputed kinase–substrate relations or obtain predictions on novel phosphoproteins. The database currently contains a predicted phosphorylation network with 20 224 site-specific interactions involving 3978 phosphoproteins and 73 human kinases from 20 families.
Dynamical protein phosphorylation governs many cell biological processes (1). Decades of targeted studies and recent progress in phosphoproteomics has resulted in a large body of protein phosphorylation data (2). Determining how these phosphorylation sites change through time, for example during the cell cycle or following exposure to extracellular stimuli is now possible with techniques such as quantitative mass spectrometry (3). However, it remains difficult to determine which of the 518 human kinases is responsible for the phosphorylation of an observed site; a glance at the Phospho.ELM database reveals that only about a quarter of known in vivo phosphorylation sites have been assigned as substrates of a specific kinase, and this fraction is constantly decreasing (2).
This has motivated the development of numerous computational methods for predicting kinase–substrate relations, for example, Scansite (4), NetphosK (5,6), predikin (7), PredPhospho (8) GPS (9), PPSP (10) and KinasePhos (11). These methods all rely on consensus sequence motifs recognized by the active site of the enzymes, represented by either position-specific scoring matrices (PSSMs), neural networks, support vector machines or other machine-learning representations. However, kinase specificity is known to also depend on other factors, such as auxiliary protein interactions, scaffolds, coexpression and colocalization (collectively referred to as ‘context’). We recently introduced a computational framework, NetworKIN, which uses a probabilistic protein association network [ string (12)] to model the context of kinases and substrates; combined with consensus sequence motifs, this gave a 2.5-fold leap in prediction accuracy over previous methods (13).
Here, we present a database of predicted kinase–substrate relations based on the latest human phosphoproteome and protein association network from the Phospho.ELM (2), PhosphoSite (14) and string (12) databases. This database is available via a web interface at http://networkin.info, which enables the user to query the database for any kinases or substrates of interest, to submit new substrates and to explore the evidence underlying a prediction.
The foundation of the NetworKIN algorithm is the fact that signalling proteins are modular in nature, that is they consist of discrete functional modules, such as protein kinase domains and the linear peptide motifs they recognize and phosphorylate. This makes it possible to model the behaviour of such proteins by coupling the prediction of linear motifs to that of identifying the corresponding binding module in a network context. Due to the improved capability of mass spectrometry to identify phosphorylation sites and other post-translational modifications, the scope of modelling these events has changed from predicting what could get phosphorylated to predicting what kinase phosphorylates which sites.
The NetworKIN algorithm is designed to work from a set of experimentally identified in vivo phosphorylation sites (although the algorithm can also be used ab initio ). The precomputed results in the database are based on the latest human phosphoproteome from the Phospho.ELM and PhosphoSite databases (2,14) ( Figure 1 ). The release cycle of the database is approximately every 3 months due to the high throughput of mass-spectrometry-driven proteomics, and we intend to keep NetworKIN up-to-date with future releases of Phospho.ELM and PhosphoSite.
These data are processed through the NetworKIN algorithm, which is implemented in Python and C. The sites are classified by matching them to a motif collection ( Figure 1 ) based on the position-specific scoring matrices from Scansite (4) ( http://scansite.mit.edu ) and the artificial neural networks from NetPhosK (5) ( http://www.cbs.dtu.dk/services/NetPhosK ). Each consensus sequence motif is considered to be a representative for a family of closely related kinases; for example, the NetPhosK Cdk5 predictor is used for predicting possible phosphorylation sites for all cyclin-dependent kinases. Within a proteome, kinases are identified and assigned to these families based on their best hit in a blastp (15) sequence similarity search against a set of 82 representative kinase domain sequences, which have been manually assigned to families. Only hits with an E -value better than 10 − 40 and with at least 50% sequence identity are considered.
To capture the biological context of a substrate, we use a probabilistic network of functional associations extracted from the string database (12) ( http://string.embl.de,Figure 1 ). This network is based on four fundamentally different types of evidence: genomic context (gene fusion, gene neighbourhood and phylogentic profiles), primary experimental evidence (physical protein interactions and gene co-expression), manually curated pathway databases, and automatic literature mining. We showed that the three latter evidence types are of comparable importance, whereas genomic context methods contribute very little towards the predictions made by NetworKIN (13). As the curated pathway databases generally contain few errors, a confidence score of 0.9 is assigned to this type of evidence. The best candidate kinases within the appropriate kinase families are identified from a protein network of functional associations [generated using the string database (12)] by calculating the proximity to the substrate for all kinases, defined as the probability of the most probable path connecting them (Floyd–Warshall algorithm). The context is thus used as a filter that eliminates many of the false-positive predictions obtained from the sequence motifs and hence improves the prediction accuracy. However, the current algorithm is unable to recover sites that are missed by the sequence motifs (i.e. false-negative predictions).
The resulting predicted kinase–substrate relations are stored in a MySQL relational database. The database also contains cross-references to the Phospho.ELM (2), PhosphoSite (14) and string (12) databases. This database can be accessed via a web interface, which consists of a collection of CGI scripts, that query the database backend and format the results as XHTML for display in a web browser.
The NetworKIN database can be accessed in several different ways. In the following, we will explain the various features of the web interface, using the tumour suppressor 53BP1 as an example. For large-scale analysis or visualization, most users will probably prefer to download the complete set of predictions for human phosphoproteins, which is available in tab-separated and Cytoscape format.
For all other users, the primary entry point to NetworKIN is its search interface shown in Figure 2 A. The user can select a specific substrate and/or kinase to view the corresponding subset of predictions; in our example, we query for 53BP1 as the substrate and use the wildcard * to obtain predictions for all kinases. The web interface also offers an advanced search form, which enables the user to pose much more refined queries. In either case, the search results will be presented as a table in which each row shows a predicted relation between a kinase and a specific phosphorylation site in a substrate. In case of 53BP1, we get a list of 78 predictions for 39 sites and 12 kinases; the first 10 of these predictions are shown in Figure 2 B. For each prediction we list two scores, namely the context score and the motif score, both of which should preferably be high. It should be noted that the motif scores for different kinase families are not comparable; in particular, motif scores from NetPhosK should not be compared with motif scores from Scansite. For this reason, the predictions for a given phosphorylation site are sorted by their context score. As the results of a single query may be extensive, the results can also be downloaded in the formats mentioned previously.
Furthermore, the user can investigate the predictions in greater detail via the web interface. For each substrate, we link to Phospho.ELM or PhosphoSite where the user can find manually curated information on in vivo phosphorylation sites including, when known, the kinase(s) involved ( Figure 2 C). To allow the user to investigate how a specific prediction was made by NetworKIN, we provide a link to the string network viewer, in which the most probable path connecting the kinase and the substrate will be highlighted ( Figure 2 D). Alternatively, the user can select multiple predictions and display the network context for all the proteins involved. From the network viewer, the evidence underlying each individual association can be inspected in further detail. This ability to thoroughly investigate individual predictions is particularly useful for interpreting non-obvious cases, which are often based on indirect links between the kinase and the substrate.
Although Phospho.ELM, PhosphoSite and hence NetworKIN are kept up-to-date with new published phosphorylation sites, many researchers will be interested in predictions for their own, unpublished sites. We thus allow users to submit protein sequences and a corresponding set of phosphorylation sites for analysis; although possible, we discourage submitting sequences without prior knowledge on phosphorylation. After uploading the data, the user will be presented with a confirmation page where potential data entry errors can be detected and fixed. The final predictions will be presented in a tabular format similar to the one used when querying the precomputed results in the database.
Many users are interested in specific kinases or substrates; however, others may want to get an overview of the complete phosphorylation network. To facilitate this, the resource offers a global map of all predictions currently in the database. All kinases and substrates are shown using a colour scale to signify their connectivity, namely the number of substrates for a given kinase or the number of kinases for a given substrate. By selecting one or more kinases, all corresponding substrates are highlighted and vice versa. Deselecting one kinase will deselect only the substrates specific for that kinase, keeping the other ones. We find this approach to be an intuitive way to gain insight into pleiotropic properties of kinases. Similar to the search interface, map selections can be visualized in their network context.
In the future, we intend to keep NetworKIN up-to-date with the latest data on phosphorylation sites from Phospho.ELM and PhosphoSite, functional associations from string and consensus sequence motifs from Scansite and other sources. Furthermore, the algorithm will be extended to take into account docking motifs (e.g. for MAP kinases) and phosphorylation-dependent binding modules (e.g. SH2, PTB and BRCT domains), which is expected to both improve the prediction accuracy and facilitate more comprehensive modelling of signalling networks. We also intend to extend the method to include phosphatases as soon as data for this is available to us. Although, NetworKIN is so far specifically aimed at protein phosphorylation, many other post-translational modifications are mediated by enzymes that recognize short linear motifs. We thus expect the same principle of specificity through context to apply. For example, the modifications of histone tails through acetylation, methylation and phosphorylation has been shown to be context dependent (16), and acetylated or methylated sites in turn bind interaction domains, such as bromo- or chromodomains. Extending the resource to cover also other post-translational modifications is thus a long-term goal.
We thank Christian von Mering for developing the best-path viewer at our request and Sara Quirk, Claus Jørgensen and Ginny I. Chen for feedback on the online resource and comments on this manuscript. Thanks to Chris Tan Soon Hen for technical assistance. We are deeply grateful to Phospho.ELM and PhosphoSite for their continued hard and absolutely essential work on high-quality annotation of the phosphoproteomes.
This project was funded by Genome Canada through Ontario Genomics Institute, by the National Institute of Health (U54-CA112967 and GM60594) as well as through the ADIT Integrated Project, contract number LSHB-CT-2005511065, and through the BioSapiens Network of Excellence, contract number LSHG-CT-2003-503265, both funded by the European Commission FP6 Programme. R.L. is a Human Frontiers Science Research Fellow. Funding to pay the Open Access publication charges for this article was provided by Genome Canada.
Conflict of interest statement . None declared.