Reactome and the Gene Ontology: digital convergence of data resources

Abstract Motivation Gene Ontology Causal Activity Models (GO-CAMs) assemble individual associations of gene products with cellular components, molecular functions and biological processes into causally linked activity flow models. Pathway databases such as the Reactome Knowledgebase create detailed molecular process descriptions of reactions and assemble them, based on sharing of entities between individual reactions into pathway descriptions. Results To convert the rich content of Reactome into GO-CAMs, we have developed a software tool, Pathways2GO, to convert the entire set of normal human Reactome pathways into GO-CAMs. This conversion yields standard GO annotations from Reactome content and supports enhanced quality control for both Reactome and GO, yielding a nearly seamless conversion between these two resources for the bioinformatics community. Supplementary information Supplementary data are available at Bioinformatics online.

and their outcomes. It depicts transitions of entities from one form to another as a result of different influences; thus, the temporal qualities of molecular events occurring in biochemical reactions are represented as in familiar drawings of metabolic pathways. As knowledge of a biological process increases, its PD representation increases rapidly in complexity. An AF representation avoids this complexity. Molecular details of processes are omitted and influences between activities are represented directly. Instead of displaying the details of biochemical reactions with process nodes and connecting arcs, AF diagrams show only influences such as 'stimulation' and 'inhibition' between the activities enabled by molecular entities. For example, a signal activity 'stimulates' a receptor activity, and the receptor activity in turn 'stimulates' the activity of an intracellular signaling adaptor protein.
To describe the action of a protein in an AF model of a reaction, it is necessary to identify an input entity that undergoes a transformation enabled by the activity of the protein. Successive transformations can be assembled into an AF model when the transformation in one reaction enables the activity required for a subsequent reaction. Enabling takes the form of generating a required input for that reaction. Reactions involving transformations of entities to novel chemical forms mediated by enzyme catalysts or to novel subcellular locations mediated by transport proteins readily fit this AF model: the Reactome PD controller maps to the GO-CAM AF active entity, and the PD reaction maps to the AF activity ( Figure S1-2). This mapping is straightforward because small molecules, when covalently modified, are considered to be different molecules in both PD and AF: they are transformed into different entities. The causal relationship between sequential steps in a metabolic pathway is thus that the product of one reaction is the input of the other, or the MF of the first reaction 'directly provides input for' the MF of the second reaction.

Figure S1-2
Reactions that influence the activity of a gene product, e.g. by activating a regulator or catalyst, are not straightforward to map, because of a fundamental difference in the PD and AF representations. In PD, a gene product entity is transformed into a different entity in the same way as a small molecule. This may be covalent (a protein is distinct from its phosphorylated form), spatial (a cytosolic protein is distinct from its nuclear form), or noncovalent (a protein complex is distinct from its constituent subunits). In AF, on the other hand, the focus is not transformations of a protein entity but rather its activity, and its modulation by an upstream activity. An upstream activity may also be covalent or noncovalent: a protein's activity can influence the activity of a second protein by covalently modifying it, transporting it, or binding it.
For a sequence of reactions that each have a controller (enzymatic or transport), the mapping between PD and AF is still fully specified. In this case, in PD the output of an upstream kinase reaction (e.g. a phosphoprotein) is the controller of a downstream reaction, which can be transformed into AF by asserting that the kinase activity of the upstream protein positively influences the activity of its downstream target ( Figure S1-3).

Figure S1-3
There is no such general PD-AF mapping when the upstream reaction is a binding reaction, because it is not possible to identify which of the inputs to the binding activity is the causal agent with respect to influencing downstream activities. That is, an AF representation requires that one input somehow influences the activity of the other to determine the directionality of causal influence from a binding reaction. This direction-of-influence information is not present in the standard PD representation.
We have instead made hybrid models of such binding reactions, adopting the PD model for binding reactions but the AF model for controlled reactions ( Figure S1-4). A hybrid model is not strictly AF and thus loses the advantages of this representation, and violates the specific GO standards for representation of gene product function: an activity is to be associated with a specific gene product unless the activity is an emergent function that requires the products of multiple genes. For example, GO (and GO-CAM) treats the protein product of INSR as having "receptor activity," while in the hybrid model this activity is associated with a macromolecular complex comprised of the INSR and INS gene products, and the activity is incorrectly modeled as enabled by a complex rather than a gene product. We therefore attempted to find additional information to allow us to infer causal directionality and define a fully causal model. An example of additional information provided by Reactome is the "active subunit" of a macromolecular complex. Currently, the active subunit is only specified for a subset of the reactions that are controlled by a macromolecular complex, and not for any binding reactions. If active subunit information were available, it would allow us to infer that one of the non-active input subunits preceding binding reaction to create the complex must be the upstream causal influence on the active subunit. In most cases, it is clear that Reactome curators had such causal activity flows in mind when annotating these reactions, demonstrated by the reactions' free text names. Figure S1-5 shows the first several steps of the WNT signaling pathway, and compares the literal interpretation of each Reactome reaction with the directionality implied by each reaction name. The reaction name clearly implies a causal directionality, e.g. "WNT binds FZD and LRP5/6" specifies WNT as the causal actor, but the PD model does not, stating simply that three proteins come together to form a complex. Reactome-derived GO-CAM models could in principle be edited manually to add this information post-import, but a better fix will be to annotate active unit information for Reactome binding reactions, to allow a fully specified GO-CAM model to be automatically constructed from the Reactome model.

Supplement 2 -detailed Pathways to GO conversion procedure
The current procedure to generate a GO-CAM from a Reactome pathway expressed in the BioPAX (Level 3) exchange format is described here. The conversion software is freely available at https://github.com/geneontology/pathways2GO Generate or identify an OWL ontology that contains a class representation of each of the physical entities in the BioPAX file. b) Capture the location of the entity using the BioPAX:cellularLocation annotation. Locations in Reactome-provided BioPAX pathways are instances of classes in the GO cellular component (CC) ontology. This is captured in OWL with an axiom indicating that the class is a subclass of entities located in the GO CC term. E.g.
c) Capture the taxon for the gene products and complexes using a subclass axiom, e.g. for human gene products (subclass of ('only in taxon' some OBO:NCBITaxon_9606)). d) Capture the canonical record for each entity. The canonical record is an ontology term from the standard ontology collection used for GO-CAMs. It is either equivalent to or, more typically, a superclass of the entity term. This term is used when GO-CAMs that use this entity ontology are exported in forms such as GPAD that do not allow the use of imported ontologies to define their terms. This term will correspond to an entity at the deepest possible level of semantic granularity. These records are added as annotation property assertions on the class.
■ For classes representing sets, add a canonical record annotation for each member of the set. E.g., for ("glucokinase and hexokinases", the canonical records are the UniProt ids for each of the proteins in the set, GCK, HK1, HK2, and HK3 (https://reactome.org/content/detail/R-HSA-450097) ( Figure S2-3). When the canonical records are used for sets, for example to produce a GPAD export of a GO-CAM that uses the set, each canonical record is used in the same way it would be if there were only one entry. For example, if a model contained an assertion that the set named ("glucokinase and hexokinases" enabled a molecular function MF1, then an export would contain the statements: GCK enables MF1, HK1 enables MF1, HK2 enables MF1, and HK3 enables MF1. In essence, the set construct is used by Reactome as a shorthand and this is preserved in the GO-CAM conversion process but for simpler formats it must be expanded.

2) Construct the OWL:individuals for the model with default types:
a) Create a biological process node with a GO:Biological Process type for each BioPAX:Pathway b) Create a molecular activity node with a Molecular Event (parent of Molecular Function) type for each BioPAX:BiochemicalReaction.

3) Use provided BioPAX:RelationshipXrefs to add deeper classifications:
a) BioPAX:Pathway nodes are directly annotated with GO Biological Process terms. For example, the pathway node in the Reactome pathway "Signaling By BMP" provides a RelationshipXref link to the GO Class "BMP signaling pathway" (GO:0030509). The converter would thus assign the corresponding individual rdf:type GO:0030509. b) BioPAX:BiochemicalReactions that have controllers such as catalysts are linked to a BioPAX:Control node that may in turn have RelationshipXref annotations linking it to a GO Molecular Function term. Where these are available, they are used to provide a deeper type for the associated GO:Molecular Function individual in the GO-CAM. For example, the reaction alpha-D-glucose 6-phosphate Û D-fructose 6-phosphate" has an associated BioPAX:Catalysis event that is annotated with a RelationshipXref to GO:0004347 (glucose-6-phosphate isomerase activity). Hence the GO-CAM would have a function node for this reaction with type GO:0004347 ( Figure S2-4).  ■ When the catalyst is a complex, Reactome sometimes provides an indication of which protein is the active unit within the complex. When this information is present, the active protein is added to the GO-CAM as the enabler of the molecular function. When there is no active unit annotation, a node corresponding to the complex is created and linked to the molecular function node with the 'enabled by' relation.
• The active-unit information cannot be represented in the BioPAX model. For this project, Reactome added references to the active protein within comments on the associated BioPAX:Control nodes. E.g. rdfs:comment activeUnit: #Protein436 links a Control node with a protein complex as its Controller to the specific protein within that complex that is the active unit. b) All reactions have substrates and products, indicated in BioPAX using the properties 'left' and 'right' along with a conversionDirection annotation e.g. 'LEFT-TO-RIGHT'. These physical entities are added to the GO-CAM using the "has input" (RO:0002233) and "has output" (RO:0002234) properties. They may be physical entities of any kind: complexes, proteins, small molecules, RNA, or DNA.

6) Infer location information assertions for all activity nodes
a) The locations of the physical entities in a reaction are used to establish where the reaction occurs.
■ If all the entities are in the same location ( Figure S2-4), • Then the function node is linked to that location using the "occurs in" (BFO:0000066) relation from the Basic Formal Ontology (a super property imported by the Relation Ontology). Reactome makes use of the cellular component branch of the GO to represent location.
■ If a reaction has an enabler, the reaction location is assigned to be that of the enabler, regardless of the locations of participating physical entities.
These additional locations are preserved as attributes of the physical entities If the input entities are the same as the output entities, the input entities have different locations from the output entities, and the type of the molecular function term associated with the activity is a subclass of 'transporter activity' • Then the function node is linked to the locations via the 'has target end location' (RO:0002339) and 'has target start location' (RO:0002338) relations.
• In addition, the physical entity that is transported is linked to the activity with the 'transports or maintains localization of' (RO:0002313) relation.

7)
Infer causal relations between molecular function nodes. BioPAX captures connections between reactions using BioPAX:PathwaySteps. In contrast to the GO-CAM activity flow model, which identifies different types of relationships between activities, BioPAX only captures the order of events. Information about how one reaction affects another must be inferred based on the inputs and outputs of the connected reactions. The first step in the conversion process is to translate all of the 'nextStep' relations between reactions into RO 'causally upstream of' (RO:0002411) assertions linking activity nodes in the GO-CAM. Note that this relationship does not imply positive or negative influence, only that the activities are related to one another in a causal chain. Once this basic network of activities is established, several rules are applied to further specify the nature of the causal relationships between activities. Note that these rules are applied to the GO-CAM model itself, after it has been generated from the BioPAX and, as such, could be applied to any GO-CAM model. More than 30% of the causal relationships between reactions in Reactome occur between reactions in different pathways. The conversion framework produces one GO-CAM for each pathway and currently has no ability to handle links from one node of a GO-CAM to another node in a different GO-CAM. In order to address this limitation, when a causal relationship exists that crosses a pathway boundary, a copy of the upstream converted reaction is added to the pathway containing the downstream reaction. This introduces a redundancy as the system as a whole now contains two activity units (one per pathway) for the same Reactome reaction. However, this redundancy is reduced through the use of the same internal identifiers for the reaction in both converted pathways. Through this unique identifier approach, when the whole knowledgebase is queried via a SPARQL endpoint, the reaction nodes in the different pathways can be merged. This allows the full causal graph to be captured and queried over in a unified way. It also provides us with an automated flag for examination of reactions that should be added to more than one pathway. Future work in the GO-CAM software framework should better enable this kind of activity node linking and re-use. ■ If Reaction 1 has an output that is a controller (that is not a catalyst) of Reaction 2 ■ And Reaction 1 is causally upstream of Reaction 2 • Then create new activity node B1 with type Binding • Assert that B1 is part of the main Biological Process node representing the pathway • Add the controlling entity as an input to the new Binding node • Add Reaction 1 directly provides input for B1 • Add B1 directly (positively or negatively) regulates Reaction 2 Figure S2-7. Inference of binding activity.

8) Convert non-catalytic entity regulators to binding activity nodes.
In contrast to the GO-CAM model, which strictly focuses on the activities of gene products, Reactome models the regulatory effects of other kinds of molecules on reactions. For example, the Glycolysis pathway in Reactome asserts that ADP positively regulates the reaction Dfructose 6-phosphate + ATP → D-fructose 1,6-bisphosphate + ADP. To put this in the GO-CAM framework, which does not model direct regulation by physical entities, it is necessary to generate an activity node to capture the information and link that node to the reaction being regulated. The rules for representing regulation by a physical entity other than a gene product follow and the results are presented in Figure S2-2. ■ If a reaction is positively or negatively regulated by a physical entity and that entity is not a catalyst, then: • Create a new binding node • Add the regulator as an input • Add a positive or negative regulatory relationship from the new binding node to the reaction node as indicated in the BioPAX Controller element. • Make the Binding node part of a root Biological Process node. • Make the Biological Process node regulate the pathway node that the reaction is a part of. • If the original reaction is enabled by something, • Then add the enabler as an input to the new Binding node.