Knowledge of the various interactions between molecules in the cell is crucial for understanding cellular processes in health and disease. Currently available interaction databases, being largely complementary to each other, must be integrated to obtain a comprehensive global map of the different types of interactions. We have previously reported the development of an integrative interaction database called ConsensusPathDB (http://ConsensusPathDB.org) that aims to fulfill this task. In this update article, we report its significant progress in terms of interaction content and web interface tools. ConsensusPathDB has grown mainly due to the integration of 12 further databases; it now contains 215 541 unique interactions and 4601 pathways from overall 30 databases. Binary protein interactions are scored with our confidence assessment tool, IntScore. The ConsensusPathDB web interface allows users to take advantage of these integrated interaction and pathway data in different contexts. Recent developments include pathway analysis of metabolite lists, visualization of functional gene/metabolite sets as overlap graphs, gene set analysis based on protein complexes and induced network modules analysis that connects a list of genes through various interaction types. To facilitate the interactive, visual interpretation of interaction and pathway data, we have re-implemented the graph visualization feature of ConsensusPathDB using the Cytoscape.js library.
A major goal of systems biology is to assemble an exhaustive global map of the functional relationships, or interactions, between physical entities in the cell (genes, proteins, metabolites, etc.) (1). Many methods have been developed to measure such interactions and have been applied to model organisms and to human. Thus, hundreds of thousands of interactions have already been detected, reported in the literature and assembled in interaction databases (2); however, these databases are often complementary and tend to focus on one or a few types of interactions while in reality all the different interaction types coexist in the cell. In order to obtain a global interaction map that reflects biology as completely as possible, subject to the currently available interaction knowledge, many available interaction resources have to be considered. The heterogeneity of databases in terms of interaction type, data model and data exchange format complicates their integration. To facilitate the exchange and integration of data from different resources, standard file formats such as PSI-MI (3) and BioPAX (4), and respective platforms for data exchange such as PSICQUIC (5) and Pathway Commons (6) have been developed. However, not all interaction resources have adopted standard formats, e.g. because they are not compatible with the data model of the respective resource. Despite these hurdles, we have developed a database called ConsensusPathDB that integrates different types of interactions from numerous resources into a seamless global network (7,8). In this network, physical entities (genes, proteins, metabolites, etc.) from different sources are matched depending on their accession numbers and interactions are matched depending on their participants to reduce data redundancy. The web interface of ConsensusPathDB aims to serve as a one-stop shop for searching, visualizing and retrieving the integrated interaction data, as well as for tools that use these data for interaction- and pathway-centric analysis of genes, proteins and metabolites (resulting, e.g. from large-scale transcriptomics, proteomics or metabolomics experiments). In this database update article, we report the most significant recent advancements of ConsensusPathDB in terms of human interaction content and web interface functionalities. In addition to human data, ConsensusPathDB instances exist for interaction and pathway data from the model organisms, mouse and yeast.
DATABASE CONTENT UPDATE
Since our last report on ConsensusPathDB (8), the database has grown both in terms of different types of interactions supported and in terms of source databases (that is databases whose interaction data are integrated in ConsensusPathDB). Newly integrated interaction types comprise genetic interactions and drug–target interactions in addition to the already supported types (protein–protein interactions, biochemical reactions—metabolic and signaling—as well as gene regulatory interactions). Although human genetic interaction data are currently scarce and there are only 265 such interactions in the latest ConsensusPadthDB version [originating from BioGRID (9)], their number is expected to increase in the future. On the other hand, there are bulks of drug–target interaction data extracted from the literature into several dedicated databases. There are currently 33 081 drug–target interactions in ConsensusPathDB that originate from four such databases.
The number of source databases integrated in ConsensusPathDB has grown over the last 2 years since our last report (8) from 18 to 30 databases. The newly integrated resources are BIND (protein–protein interactions) (10), DrugBank (drug–target interactions) (11), InnateDB (protein–protein, biochemical and gene regulatory interactions) (12), MatrixDB (protein–protein interactions) (13), PDZBase (protein–protein interactions) (14), PhosphoPOINT (protein–protein and biochemical interactions) (15), PhosphoSitePlus (biochemical interactions) (16), PINdb (protein–protein interactions) (17), SignaLink (biochemical pathways) (18), SMPDB (biochemical pathways) (19), TTD (drug–target interactions) (20) and WikiPathways (biochemical pathways) (21). Drug–target interactions have been additionally extracted from the previously integrated databases KEGG (22) and PharmGKB (23). Although we do not curate primary datasets, we have integrated a recently published, large-scale spliceosomal protein–protein interaction network obtained through yeast two-hybrid screening from our own research (24).
The number of unique interactions stored in ConsensusPathDB has grown in the last 2 years from 155 432 to 215 541 interactions, in part because of the integration of new databases and in part because the content of the previously included resources has grown. Analysis of the total number of source databases per interaction in ConsensusPathDB shows that the respective distribution is right-skewed, with most of the interactions (161 396 interactions, 75%) originating from a single-source database (Figure 1). These results evidence that currently available databases are highly complementary [in agreement with other reports in the literature, e.g. refs. (25) and (26)] and, importantly, that the integrated interaction map present in ConsensusPathDB has not saturated yet. This underlines the importance of further interaction data integration. To rule out effects from missed interaction mappings due to technical issues (e.g. missing accession number annotation of interaction participants), we repeated the analysis considering only those interactions with unambiguously identifiable participants. This analysis showed very similar trends (data not shown).
Apart from extending the quantity of the ConsensusPathDB content, we have also taken measures for assessing its quality. Interactions stored in public resources are known to be of different confidence. Reportedly, considerable fractions of the available protein–protein interaction data may result from experimental or literature mining errors (26,27). Thus, we have assessed the confidence of binary protein–protein interactions stored in ConsensusPathDB. This was done using an integrative approach that exploits network-topological features and annotation features to derive confidence scores for each individual interaction. Among the network-topological methods integrated in the approach is a parameter-free, reference data-independent method for scoring large binary interaction networks called CAPPIC, which was developed by us (28). The integrative approach has been implemented as a web tool called IntScore (http://intscore.molgen.mpg.de) (29), which was applied to the ConsensusPathDB protein–protein interaction network (Supplementary Methods). Notably, the protein–protein interactions in ConsensusPathDB are only scored and not filtered. The available scores are displayed in the web interface and can be used as a filtering criterion by the users.
WEB INTERFACE FEATURES UPDATE
Pathway analysis of metabolite lists
Over the past decade, pathway over-representation/enrichment analysis of gene lists has proven a very useful tool for interpreting large-scale transcriptomics and proteomics data (30). Such analysis is able to pinpoint biochemical pathways that are dysregulated and may have a causative relationship to the phenotype under study or act as conductors of biological signal leading to it. With the possibility to measure the cellular concentrations of a panel of metabolites provided by state-of-the-art technologies, metabolite signatures for more and more phenotypes are being generated (31). Like abnormal gene expression, abnormal metabolite concentrations can also provide clues about potentially dysregulated metabolic or signaling pathways in the samples under study. To facilitate the analysis of metabolomics data on the pathway level, the web interface of ConsensusPathDB now provides pathway over-representation and enrichment analysis functionality for user-specified metabolite lists. It exploits the fact that most of the pathways stored in our database (3321 out of 4601 pathways, 72%) contain metabolites additionally to genes. Statistical tests are performed with the user-specified metabolite input that are analogous to those described previously in the context of gene set analysis to identify candidate phenotype-associated pathways (7). Although several tools for pathway-based evaluation of metabolite lists are already available (32–34), ConsensusPathDB has the advantage of possessing a rich pathway repertoire collected from 12 resources for biochemical pathways. Moreover, if the user has both transcriptomics/proteomics and metabolomics data from a particular phenotype at hand, ConsensusPathDB can serve as a one-stop shop for analyzing these data based on the same set of pathways. This will save the user time and effort needed to get familiar with two separate tools for the analysis of the different data types, which will besides be typically based on different sets of pathways.
Visualization of functional gene/metabolite sets as overlap graphs
The typical output of most tools for gene/metabolite set over-representation/enrichment analysis is a table where predefined functional gene/metabolite sets (e.g. pathways) are listed, ranked according to some statistical measure of association with the user-specified input (most often a P-value). However, the functional sets often overlap with each other to some extent—for example, they may stand in a hierarchical relationship to each other [like Reactome pathways (35) or Gene Ontology categories (36)] or may share key elements. Thus, to facilitate the visual interpretation of over-representation/enrichment analysis results, we have introduced in ConsensusPathDB the possibility to visualize the resulting functional gene/metabolite sets (pathways, neighborhood-based entity sets (NESTs) (8), Gene Ontology categories and protein complexes) as overlap graphs (Figure 2). In these graphs, each node represents a separate predefined functional set whose member list size (i.e. number of genes/metabolites contained) and P-values are encoded as node size and color, respectively. Two nodes are connected by an edge if the according functional sets share members (genes/metabolites). The edge width reflects the relative overlap calculated with the Fowlkes–Mallows index (37) from the number of shared members and the sizes of the two gene/metabolite sets. The edge color encodes the number of shared members that are also found in the user’s input (denoted ‘shared candidates’). The user can click on the nodes and edges of the overlap graph to view a list of the pertinent genes/metabolites. The visual representation helps the user to quickly identify related biological processes that together show a changed activity, e.g. because they have the same key regulators. Moreover, it gives a quick overview over the relationships between the different types of functional sets (e.g. particular Gene Ontology biological process categories may be very similar to particular pathways contained in pathway databases). Last but not least, the color coding of edges can provide clues about potentially dysregulated crosstalks between different biological processes. The overlap graph visualization environment features a filter that can be applied to edges in order to highlight only the closest relationships between functional gene/metabolite sets.
To exemplify how this overlap graph feature of ConsensusPathDB can be used for interpreting transcriptomics/proteomics data, Figure 2 displays results from a toxicogenomics context. Here, functional gene sets are shown that are significantly over-represented (P < 0.05) in an input list of 410 genes that appeared differentially expressed (P < 0.01) in an in vitro assay of human hepatocite-like cells that were treated with the genotoxic chemical benzo[a]pyrene, compared with an untreated control (38). The functional gene sets include manually curated pathways, Gene Ontology categories, NESTs and protein complexes that overlap with each other in different extent. The largest module of overlapping functional sets visible in Figure 2 is formed by genotoxic stress response pathways related with p53, DNA damage, apoptosis and cancer signaling. The module also includes gene sets centered at several ubiquitin E3 ligases: COP1 [gene symbol: RFWD2, a negative regulator of p53 (39)], RBX1 (Gene Ontology annotation: DNA repair) and DDB2 complex [involved in DNA repair (40)]. The results are in line with the fact that benzo[a]pyrene is a highly carcinogenic compound due to its mutagenic nature. Confirmatory, the benzo[a]pyrene metabolism pathway from WikiPathways forms a separate module together with estrogen metabolism pathways from PharmGKB and WikiPathways (upper right part of Figure 2). A third module is formed by gene sets associated with the mitochondrial ribosome (upper left part in Figure 2).
Gene set analysis based on protein complexes
A further new feature of the ConsensusPathDB web interface is the over-representation/enrichment analysis of gene lists based on curated protein complexes [in addition to functional gene sets defined over curated pathways, Gene Ontology categories and NESTs, as reported previously (8)]. ConsensusPathDB currently contains 12 263 unique curated protein complexes originating from overall 10 resources. Totally 4070 complexes have at least four distinct protein components and thus define functional sets whose size (i.e. number of member genes) is adequate for statistical over-representation/enrichment tests. These 4070 protein complex-based functional gene sets contain a total of 4645 unique genes. Notably, many of these gene sets do not correspond to any human-curated pathways or otherwise defined gene categories.
Induced network modules analysis with gene lists
In addition to the over-representation/enrichment analysis of predefined functional gene sets as detailed above, the web interface of ConsensusPathDB now provides another approach for the interaction- and pathway-centric analysis of lists of genes, called induced network modules analysis (41). Given a list of so-called seed genes (e.g. resulting from microarray experiments, which are unable to directly disclose the functional relationships between genes), it aims to interconnect those genes through different types of interactions (e.g. physical, biochemical, regulatory, etc.; selectable by the user) (Figure 3). This information on the pairwise functional/physical relationships between the genes can shed light on the biological reasons why they are identified together in the experiment. For example, if a group of genes found to be over-expressed in a microarray experiment are highly interconnected through physical interactions, this suggests that those genes may encode proteins which together form a protein complex that has a high concentration in the phenotype under study and thus may be relevant for this phenotype.
Notably, the induced network modules may optionally include genes that are not in the user-supplied seeds list, but associate two or more seed genes with each other and overall have significantly many connections within the induced network module (Figure 3, nodes with purple labels). These so-called intermediate genes could be associated with the phenotype under study, although they may not be regulated on the transcriptional level and thus do not appear in the input gene list. For example, if a group of seed genes are all connected through gene regulatory interactions to an intermediate node that represents a transcription factor, this suggests that the transcription factor may be dysfunctional (e.g. due to a mutation, which does not necessarily impact the transcription factor’s expression). Intermediate genes are ranked according to the significance of association with the seeds list given their overall connectivity in the background network. This is quantified by a z-score calculated for each intermediate node with the binomial proportions test as per Berger et al. (41). The z-score threshold can be controlled dynamically by the user in order to create sub-networks involving many intermediate and seed genes with a less stringent threshold or more compact sub-networks with a more stringent threshold.
Berger et al. (41) originally suggested the induced network modules approach and implemented it as a web tool called Genes2Networks. Their tool is limited to physical protein–protein interactions only that furthermore originate from a much smaller number of sources compared with ConsensusPathDB. Nevertheless, Genes2Networks allows the user to replace the default background network by a custom one, if available. The induced network modules analysis of ConsensusPathDB additionally features the possibility to overlay user-specified numerical values (e.g. expression values or fold changes) on nodes (genes/proteins) of the visualized network. Upon upload of a two-column, tab-delimited file containing gene accession numbers in the first column and numerical values in the second column, the values are encoded in the node color (green denoting low, negative values and red denoting high, positive values) to facilitate their visual interpretation in the context of the network (Supplementary Figure S1). The values may even be artificially created to reflect groupings of genes/proteins, e.g. according to their sub-cellular localization.
Figure 3 depicts a network module induced by a list of genes differentially expressed in metastatic prostate cancer compared with primary prostate carcinoma [data obtained from (42) and available as an example gene set on the ConsensusPathDB web site]. The module is held together by different types of interactions, comprising protein–protein, biochemical, gene regulatory and drug–target interactions. Many intermediate nodes (Figure 3, nodes with purple labels) are known cancer-associated genes although, per definition, they are not present in the input set of genes differentially expressed in metastatic prostate cancer. Examples include TP53, TAF1 (node name: ‘Transcription initiation factor TFIID 250 kDa subunit’), VHL and SNCG. Interestingly, the breast cancer drug letrozole is also present in the module and connects the seed genes EGR1, CYR61 and COLEC12 through drug–target interactions. Furthermore, the induced network modules analysis suggests metastatic prostate cancer association of RPAIN (node name: ‘rip_human’; Gene Ontology annotation: DNA repair), VPS35 (Gene Ontology annotation: cell death) and a few other genes that appear as intermediate nodes. Overall, the module constitutes an interaction network ‘cold-spot’, since most of its members are under-expressed (Supplementary Figure S1).
Graph visualization tool
BioPAX Level 3 export
Networks viewed with the interaction visualization tool can now be exported in BioPAX Level 3 (4) format. This format is more descriptive than previous levels and allows a more precise standard description of the sub-network of interest. For example, BioPAX Level 3 is able to represent genetic and gene regulatory interactions, which was not possible in BioPAX Levels 2 or 1.
Display of drug information for genes/proteins
The integration of drug–target interactions with physiological ones (biochemical reactions, physical interactions, etc.) mentioned above is advantageous when it comes to interaction graph-centric analysis of disease phenotypes. For example, we have previously described a new class of functional gene sets called NESTs (8). A NEST is a set of genes that are linked through different types of interactions (possibly originating from different interaction databases) to a certain gene, which is itself also included in the NEST. Given an interaction network of genes, each gene and its direct network neighbors define a separate NEST. We have shown that NEST analysis in the context of gene expression data can aid the identification of disease-causing genes (8). If available, drug information is now shown for every gene/protein in the web interface of ConsensusPathDB (including the visualization tool). Thus, ConsensusPathDB can now serve for identifying a potential target for pharmaceutical treatment and, at the same time, for retrieving information on available drugs for that target.
Improvements of the ConsensusPathDB web services
We have extended the functionality of the ConsensusPathDB web services by adding enrichment analysis functions for lists of genes or metabolites. The repertoire of predefined gene sets that can be analyzed through gene set over-representation or enrichment analysis has been extended to include NESTs, Gene Ontology categories and protein complexes in addition to curated pathways. Thus, the web services now cover completely the gene/metabolite set over-representation/enrichment analysis functionality of the web interface.
Through the integration of 30 public interaction/pathway resources, ConsensusPathDB assembles to our knowledge the most comprehensive available map of human interactions and pathways. With regular content updates and database releases every 3 months, it is ensured that this map stays up-to-date. New databases are integrated into ConsensusPathDB at the rate of 1–2 databases per release; furthermore, new interaction types are occasionally added. The recent extensions of the web interface functionality, most of which serve for the interaction- and pathway-based interpretation of sets of genes coming e.g. from transcriptomics/proteomics studies, sets of metabolites e.g. from metabolomics measurements, and the integration of drug data with physiological interactions, open further perspectives for ConsensusPathDB applications in systems biomedicine and translational research.
The web interface of ConsensusPathDB is freely available to academic users at http://ConsensusPathDB.org. Information on web service access is provided on the ConsensusPathDB web page. The protein interaction part of the database content is available for download in tab-delimited and PSI-MI 2.5 formats on the web site. The gene compositions of biochemical pathways contained in ConsensusPathDB are available for download on the web site in a gene identifier namespace selectable by the user. Custom networks constructed by the user through interaction searches are downloadable in BioPAX Level 3 format. ConsensusPathDB can also be used for evidence mining of user-specified protein–protein interactions (e.g. obtained from an interaction screen) through a Cytoscape plugin (43). Moreover, separate ConsensusPadthDB instances exist for the model organisms, mouse (http://ConsensusPathDB.org/MCPDB) and yeast (http://ConsensusPathDB.org/YCPDB).
Supplementary Data are available at NAR Online: Supplementary Figure 1 and Supplementary Methods.
The European Commission’s Seventh Framework Programme [DiXa, 283775]; German Ministry of Education and Research [MedSys PREDICT, 0315428A; NGFNp, NeuroNet-TP3, 01GS08171]; Max Planck Society. Funding for open access charge: European Commission.
Conflict of interest statement. None declared.
We are grateful to the developers of all source databases who have provided interaction data to the public domain. We would also like to thank the ConsensusPathDB users who have provided valuable feedback. ConsensusPathDB is developed exclusively with open-source software whose contributors are gratefully acknowledged.