Extracting reaction networks from databases–opening Pandora’s box

Large quantities of information describing the mechanisms of biological pathways continue to be collected in publicly available databases. At the same time, experiments have increased in scale, and biologists increasingly use pathways defined in online databases to interpret the results of experiments and generate hypotheses. Emerging computational techniques that exploit the rich biological information captured in reaction systems require formal standardized descriptions of pathways to extract these reaction networks and avoid the alternative: time-consuming and largely manual literature-based network reconstruction. Here, we systematically evaluate the effects of commonly used knowledge representations on the seemingly simple task of extracting a reaction network describing signal transduction from a pathway database. We show that this process is in fact surprisingly difficult, and the pathway representations adopted by various knowledge bases have dramatic consequences for reaction network extraction, connectivity, capture of pathway crosstalk and in the modelling of cell–cell interactions. Researchers constructing computational models built from automatically extracted reaction networks must therefore consider the issues we outline in this review to maximize the value of existing pathway knowledge.


Kyoto Encyclopaedia of Genes and Genomes
Data from KEGG [5] was sourced from the KEGG PATHWAY database between March and September 2012. KEGG offers two methods of access to its underlying data -subscription based access via FTP to the full dataset, or access to descriptions of individual pathways via download links on the graphical representation of each pathway or via the get method of its API. We have included examples from KEGG to illustrate the applicability of our points to multiple databases, but due to the lack of access to the entire dataset, have not used KEGG as a primary resource.

Variation in Implementation
Data: PANTHER Pathways: 'JAK STAT signalling pathway.owl' Reactome: 'Homo sapiens.owl' NCI-PID: 'NCI-Nature Curated.bp3.owl' Note: Data entry and curation in the PANTHER Pathways database is handled using the CellDesigner software, which provides graphical editing of interaction networks. This data can then be exported to the various levels of the BioPAX format [1] from this software. As such, comments about structure and format implementation of PANTHER's data are generalisable to all databses specified with this software.

Storing Data
One major problem from an analytical perspective is variation in where databases store information about participants in the reactions they describe. This variation can happen even when the databases use identical formats to store their information. . Each database varies in terms of how and where they store the Uniprot accession data (larger diamonds) for this protein. We display a tree-based representation of these records where the root is the parent record, each internal node represents a field or sub-field in the database, and the leaf nodes represent values. a) NCI PID representation. b) Reactome representation. Note two UniProt accessions are stored -both containing the same accession details in different formats ('P23458' and 'UniProt:P23458 JAK1'). c) PANTHER Pathways representation records multiple UniProt accession details, including those for orthologs in other species, and a number of alternative accessions. 'P23458' is one of many captured.
For example, most databases cover the post-translational modifications of Janus kinase (JAK) (UniProt:P23458). We have selected three records describing JAK phosphoproteins with two features (non-identical due to lack of overlap) from PANTHER Pathways, Reactome and NCI-PID to illustrate this variation. These are shown in diagrammatic form in Figure 1 and in a detailed tabular representation in the appendix to this document (Tables 4, 5, & 6).
In the worst case situation, this variation necessitates a complete search of every attribute linked to a given biochemical entity (complex, protein, small molecule, RNA or DNA) in a reaction system to find a given attribute. If these attributes need to be frequently accessed, as they are during the mapping of experimental data labelled with UniProt [11] or ENSEMBL [3] identifiers to computational models, then the number of searches required become problematic.
Other variation arises from ambiguity within the specification itself. For example, the description of the implementation of Complexes in the BioPAX Level 3 specification (pg. 48 [1]) states the following: In general, complexes should not be defined recursively so that smaller complexes exist within larger complexes, i.e. a complex should not be a component of another complex (to avoid errors in interpretation -see comments on the component property below). Instead, the subunits should be a simple list.  This recommendation not implemented by any of the major databases. The NCI-PID data contains 2,751 (of 9,016) recursively defined complexes, Reactome, 3,485 (of 6,040), and Panther 34 (of 913). This is problematic for a number of reasons. Firstly the presence of recursively defined complexes implies an assembly order that is not necessarily biologically or mechanistically correct. Secondly, mapping components becomes difficult due to the need to recurse to the bottom of the resulting 'Complex tree' that results from this type of structure. Finally, this makes checking to see whether a Complex is duplicated vastly more difficult (Figure 2). Entries in a database should be unique -if the database describes adenosine-5'triphosphate (ATP, CAS:56-65-5) in the cytosol of a cell, this should be represented by a single entry for cytosolic ATP. Duplication of entries leads to loss of network connectivity -the participations of the molecule are fragmented and no longer connected.

Uniqueness Criteria
This poses a major problem for modelling applications due to the importance of capturing crosstalk, or interactions between pathways. In current pathway databases, data is curated in a functional fashion -a team of curators begin looking at some functional process (for example, apoptosis), and conduct extensive searches of the literature to enumerate the set of interactions and reactions that have been determined to be relevant to that process [7,9,5]. Because of this, the main points at which crosstalk can occur lie at the intersections of the sets of participants in individual curated pathways. Duplication of entities in the network obscures this overlap, and prevents capture of signal flow.
Examples of unexpected duplication are common in the modelling literature. For example, Dasika et al [2] describe a modelling experiment where nine signalling pathways (sourced from the PANTHER Pathways database) implicated in prostate cancer are combined for simulation using constraint-based techniques. They provide their data as supplementary material. This data has a large number of duplications, with significant effect on the topology of the interaction network ( Figure 1 in main paper, table describing duplication as supplementary file).
This problem is not constrained to the PANTHER Pathways data. Precisely determining the level of duplication in a given database is difficult. A given entry in the database contains a set of names, a cellular location, a set of identifiers (internal and external), and a set of modifications and features (e.g. posttranslational modifications in proteins). All of these properties (with the exception of the databases' internal identifiers, which are unique and set at data entry) need to be considered when comparing entries for identification purposes.

KEGG: FGF
KEGG currently offers subscription-only access to their underlying data. The publically available data is presented as sets of pathway diagrams. One of these, the diagram describing the 'MAPK signalling pathway' (hsa04010) 4 , contains 46 bucketed entities and 13 unique generic events (such as 'Proliferation, differentiation').

Multicellular Systems and Cell-Cell Interaction
Many databases describe interaction systems across multiple cells, such as paracrine signalling, synaptic signalling and host-pathogen interactions. Problems arise when these databases use only subcellular localisation. An example of this is in the BioPAX Level 3 version of the Reactome 'Latent infection of Homo sapiens by Mycobacterium tuberculosis.' pathway. This pathway describes the internalisation of Mycobacterium by macrophages, and the countering of the innate immune response prior to entering its persistent state [8,6]. The diagrammatic representation of the pathway 5 shows two unlabelled cells interacting, and is suggestive of a two-part interaction system. In contrast, the BioPAX version of the same pathway makes no distinction between the subcellular locations of the two cells -ie, the cytosol of the bacterium and the cytosol of the host cell are merged. This merging of bacterial and human reactions and interactions dramatically changes the interaction system and its capabilities, introducing significant error in models derived from these data.