Fixing molecular complexes in BioPAX standards to enrich interactions and detect redundancies using semantic web technologies

Abstract Motivation Molecular complexes play a major role in the regulation of biological pathways. The Biological Pathway Exchange format (BioPAX) facilitates the integration of data sources describing interactions some of which involving complexes. The BioPAX specification explicitly prevents complexes to have any component that is another complex (unless this component is a black-box complex whose composition is unknown). However, we observed that the well-curated Reactome pathway database contains such recursive complexes of complexes. We propose reproductible and semantically rich SPARQL queries for identifying and fixing invalid complexes in BioPAX databases, and evaluate the consequences of fixing these nonconformities in the Reactome database. Results For the Homo sapiens version of Reactome, we identify 5833 recursively defined complexes out of the 14 987 complexes (39%). This situation is not specific to the Human dataset, as all tested species of Reactome exhibit between 30% (Plasmodium falciparum) and 40% (Sus scrofa, Bos taurus, Canis familiaris, and Gallus gallus) of recursive complexes. As an additional consequence, the procedure also allows the detection of complex redundancies. Overall, this method improves the conformity and the automated analysis of the graph by repairing the topology of the complexes in the graph. This will allow to apply further reasoning methods on better consistent data. Availability and implementation We provide a Jupyter notebook detailing the analysis https://github.com/cjuigne/non_conformities_detection_biopax.


Molecular complexes and biological interactions in system biology
Understanding how biological systems adapt to their environment requires to better capture, describe, and model the interactions between their constitutive entities. With the accumulating knowledge on biological entities and their interactions, the need of a general framework to understand this information at a system level has led to a formal description of these entities and interactions. Within the context of biological pathways, several formats have been proposed such as SBML (Hucka et al. 2004(Hucka et al. , 2019 and BioPAX (Demir et al. 2010).
Biological systems typically involve an intricate network of interactions between numerous participants (Fearnley et al. 2014). Among these participants, complexes are a major class of physical entities that results from the chemical assembly of several molecules (nucleic acids, proteins, and other molecules) that bind each other at the same time and place, and form single multimolecular machines. Biologically, they play an important role in transcription, RNA splicing and polyadenylation machinery, protein export, and transport (Spirin and Mirny 2003;Zahiri et al. 2020). From the data analysis perspective, complexes cause some indirection between molecules and interactions, as molecules can participate directly to an interaction but also be a component of a complex that participates to an interaction. This introduces an additional node (the complex) between two entities that are no longer directly connected by a link in the cascade of events that triggers cell behavior.

Description of complexes in BioPAX
The Biological Pathway Exchange format (BioPAX; http://www.bio pax.org/release/biopax-level3-documentation.pdf) is a well-established 1 This specification about the flat representation of complexes was introduced when moving from the BioPAX1.0 to the BioPAX2.0 standard (between April and December 2005), and the rationale given for the introduction of this constraint was that the use of a tree structure could be interpreted by some users as an order in macromolecular assembly: The reason for keeping complexes flat is to signify that there is no information stored in the way complexes are nested, such as assembly order. Otherwise, the complex assembly order may be implicitly encoded and interpreted by some users, while others created hierarchical complexes randomly, which could lead to data loss.

The Reactome use-case
BioPAX is based on Semantic Web technologies, with RDF facilitating integration, SPARQL facilitating querying, and OWL facilitating knowledge-based reasoning. All the major pathway databases are available in BioPAX. Among them, Reactome (https://reactome.org/) is a free, open-source, curated, and peer-reviewed pathway database (Gillespie et al. 2022). It is widely used in genome analysis, modeling, systems biology, clinical research and education, and biological pathways can be explored to shed light on interconnected proteins (Kazemzadeh et al. 2013). Despite the BioPAX specifications, we noticed the presence of recursive complexes, i.e. complexes composed of other complexes that are not black-box complexes. Importantly, these non-conform complexes were not detected by the BioPAX validator (https://biopax.baderlab.org/; Rodchenkov et al. 2013). As this pattern eludes validation, their presence in Reactome and possibly in other databases, may preclude further analyses aiming to provide robust information about mechanisms and phenotypes.

Motivations and results
We hypothesized that identifying and correcting recursive complexes would be valuable to enrich the biological database into interactions. This may allow a better analysis of the participants, direction, and stoichiometry of the biological interactions, either when browsing the data, or when data are processed automatically. We believe that these nonconformities in the description of the complexes are an important obstacle to the development of analysis methods that would work directly from the BioPAX format. Using SPARQL queries, we show that the well-curated Reactome database includes a large fraction of recursively defined complexes, i.e. whose components contain at least one complex (that is not a black box one). We showed that these nonconform complexes averaged one-third of the total number of complexes. Using other SPARQL queries, we corrected these recursive decompositions of complexes. Then, we showed that these corrections led to the detection of implicitly redundant complexes, whose redundancy was previously hidden by the different recursive definitions of complexes.

Definition of invalid recursive complexes
Recursive complexes are complexes whose component contains at least one complex. These recursive complexes are invalid if the inner complex is not a black-box one, i.e. the inner complex itself contains at least one component. The different categories of (in)valid complexes in BioPAX are illustrated in Fig. 1A.

Invalid recursive complexes in interactions
Recursive complexes can cause false negatives when identifying the interactions in which a physical entity can participate. For example, if A, B, and C are physical entities, A can directly participate in several interactions, but also indirectly when associated to B as a complex, or to B and C as another complex. If (A, B) and (A, B, C) are valid complexes, the two situations can be correctly processed by identifying the interactions matching the criterion "having a participant that is a complex composed of A." However, if (A, B, C) complex is composed of the complex (A, B) and of C, all the interactions in which (A, B, C) participates would fail to meet the aforementioned criterion, because their participants are not "a complex composed of A" but "a complex composed of a complex composed of A." We will see that in practice, such nested composition can occur over multiple levels. The approach we used to identify and fix these invalid recursive complexes, while respecting the stoichiometry where available, is illustrated in Fig. 1B.

Redundancies
As an additional consequence, fixing invalid recursive complexes can also result in the identification of redundancies between complexes, previously hidden by different recursive decompositions of complexes. For example, as illustrated in Fig , C] could be distinct (invalid) complexes with their own identifier and at first glance, having different participants. However, they could be all fixed as a single complex having the same components A, B, and C.

Materials and methods
We developed semantically rich SPARQL queries for identifying and fixing invalid recursive complexes, and detecting the resulting redundancies. We applied this method on the Reactome pathways database as a use-case study [version 81 (13 June 2022)]. For the sake of reproducibility, we provide a Jupyter notebook detailing the analysis (https://github.com/cjuigne/non_conformities_detection_ biopax).

Identifying invalid complexes
We first analyzed all the configurations of BioPAX complexes according to the nature of their components (1A). The SPARQL query presented in Fig. 2 allows to identify the invalid recursive complexes. Other SPARQL queries to identify non-recursive complexes and valid recursive complexes, are presented in the Jupyter notebook.

Fixing the invalid complexes
Fixing an invalid recursive complex can be decomposed with a four steps methodology: (i) collapsing as direct components all its direct or indirect atomic components (i.e. those that are not complexes or black-box complexes); they correspond to the leaves in the tree of components, (ii) deleting all the other components, (iii) setting the correct values for the stoichiometric coefficients, and (iv) preserving all other attributes of the complex. The whole procedure consisting of python scripts and SPARQL queries is available in the Jupyter notebook.
To collapse the direct components of the complex in steps (i) and (ii), the original BioPAX relation component was replaced by a component relation between the root complex and its leaves in the tree of components. We kept all the other relations of the complex in step (iv).
To compute the stoichiometric coefficients in step (iii), we traced the stoichiometric coefficients from each leaf up to the root complex. We also considered the fact that a physical entity can be a component of several parts of the recursive complex. Tracing stoichiometry is illustrated in Fig. 1B where complex Y was composed of c Z, and X was itself composed of a Y. This resulted in a Â c Z in X (via Y). If, in addition to being a component of Y, Z was also a direct component of X with b as stoichiometric coefficient value, this resulted in a Â c þ b occurrences of Z in X.
We noted S y (z) the global stoichiometric coefficient value of Z at Y, i.e. the number of occurrences of Z in Y, and C(y) the set of the direct components of Y. Formula 1 recursively computes the stoichiometric coefficient value of any physical entity Z. (1)

Identifying redundant complexes
We considered as redundant the complexes that have exactly the same components with the same stoichiometric values and the same cellular location. Figure 1C illustrates how invalid recursive complexes can be the cause of redundancy due to the order by which the components are nested in the complexes. Fixing the invalid complexes made possible the detection of redundancy. For that, we developed a SPARQL query that identifies the pairs of complexes that have the same components and properties but different identifiers. This query is also available in the Jupyter notebook.   On the opposite, we identified 5833 complexes that have at least one component that is a complex with components. Altogether, these 5833 invalid recursive complexes represent 39% of the 14 987 complexes in Reactome. None of them have been detected by the BioPAX-validator tool .
These invalid recursive complex participate to 7149 out of the 22 237 interactions identified in Reactome (i.e. 32%).

Fixing invalid complexes increases the average number of components participating to complexes in Reactome
All the 5833 invalid complexes were fixed thanks to a python script and SPARQL queries (3.2). After fixing recursive complex description, all components of the complexes are described by a flat representation in accordance with the BioPAX specification.
As expected, with this flat representation, the number of direct components implicated in a complex increases (paired t-test, P < .0001). Indeed, in the initial Reactome dataset, the average number of direct components in a complex is 2.2 ðr ¼ 2:6Þ and the complexes with the largest number of components are R-HSA-5626171 and R-HSA-72069, each having a maximum of 65 components. After fixing the invalid complexes, the average number of direct components in a complex is 4.3 ðr ¼ 8:7Þ and the largest complex is R-HSA-156656 with 151 components. The most drastic changes concern complexes R-HSA-927767 and R-HSA-927890 which both move from 3 to 103 direct components. Figure 3 illustrates the number of direct components identified before and after fixing the invalid complexes. The distribution of the gain in the number of direct components is available in Supplementary Fig. S1.

Fixing invalid complexes reduces the path length from a complex to each of its components
We studied the composition depth of the complexes, before and after fixing the invalid ones. For a given complex, its composition depth was measured as the maximal length path between the root complex to its leaves. As illustrated in Fig. 3A, the tree-like definition of invalid complexes artificially increases the depth of the complex composition.
In Fig. 3B, the depth of complex composition is represented before the fixing procedure. Valid complexes have a depth of either one, or zero when they are black-box complexes. The red part of the figure represents invalid complexes having a depth greater than one. The maximum depth is 10, which leads to an artificial extension of the path length from the root complex to the majority of its leaves (example of R-HSA-68 466 is given in Supplementary Fig. S2).
As expected, the correction of invalid complexes repairs the topology by reducing the path length between a complex and its components to a maximal length of 1. The depth of all complexes is 0 (black-box complexes) or 1 (complexes with components).

Fixing invalid complexes improves the detection of redundant complexes
Redundancies are detected between entity pairs but can also occur between more than two entities, as illustrated in Fig. 1C with three equivalent complexes. In this example, three pairs of redundancies (X 1 , X 2 ), (X 2 , X 3 ), and (X 1 , X 3 ) are thus detected, corresponding to the size of a maximal clique with three vertices.
Among the 14 987 complexes from the original Reactome database, the SPARQL query identifies 137 pairs of redundant complexes involving 249 distinct complexes. They constitute 121 maximal cliques of equivalent complexes (clique size ranging from 2 to 6 complexes), corresponding to 128 complexes in excess. In other words, we highlighted 121 groups of 2 to 6 redundant complexes identified by searching for pairs of redundant complexes. These redundancies are explicit, as they are not masked by a tree-like definition of complex components.
After fixing the invalid complexes, we identified 217 pairs of redundant complexes involving 347 distinct complexes. They constitute 164 maximal cliques of equivalent complexes (clique size ranging from 2 to 6), corresponding to 183 complexes in excess. The fixing procedure, by replacing the tree-like definition of complexes by a flat description of complexes, thus allows to detect more redundancies. These redundancies are implicit, as they are masked by a tree-like definition of complex components. They become explicit with the flat description of complex components. Figure 4 illustrates how fixing structurally different complexes reveals this redundancy.
Reactome contains cross-references to ComplexPortal (Meldal et al. 2021) to annotate complexes. We sought to verify the possibility of identifying complex redundancies in Reactome without our SPARQL query-based procedure, by simply and unambiguously identifying complexes via their identifiers in a specialized database dedicated to complexes such as ComplexPortal, even if ComplexPortal incompletely supports stoichiometry, which may be a severe limitation. Because of the very modest size of the ComplexPortal Database (1429 complexes for Homo sapiens), ComplexPortal IDs do not allow to detect any redundancies identified with the SPARQL query: among the 347 complexes that constitute 217 pairs of redundant complexes identified, only 3 out of 347 have a ComplexPortal ID.
This study is available in a Jupyter notebook on the GitHub repository. Supplementary Table S1 lists all symetric hasSame CompositionAs relations between cliques of redundant complexes. The corresponding.ttl files are available on the GitHub repository (https://github.com/cjuigne/non_conformities_detection_biopax).

Application to nonhuman organisms in the Reactome database
The complete procedure was then applied to all other organisms available in Reactome, to determine whether the large fraction of nonconform complexes detected in 4.1 (39%) are specific or not to the well-explored Human dataset. The results are presented in Table 1. This shows that the large fraction of invalid complexes is not restricted to the Human dataset but also accounts for all of the 13 tested species. Datasets from all species exhibit between 30%  Table 1. For each organism in the Reactome database, we counted the total number of complexes, then we identified the number of invalid complexes and evaluated the average number of direct components and the number of redundant complexes (pairs and cliques), before and after fixing the invalid complexes.

Organism
Total number of complexes For all these species, repairing the topology of the complexes in the same way as for H.sapiens significantly increases the average number of direct components of the complexes. With the flat representation of complexes, we also observed complex redundancies in complexes for each organism, although not in the same proportion as in the Human dataset (from 2 cliques for P.falciparum to 16 cliques for Mus musculus).

Discussion and perspectives
In this study, we show that nonconform recursive complexes affect a large proportion of Reactome database exported in the BioPAX format. Indeed, they constitute 30%-40% of the complexes for all organisms, and participate to about one-third of the interactions. The fact that this phenomenon occurs in all Reactome organisms may be explained in part by the fact that some interactions and complexes in nonhuman species are inferred if a large fraction (at least 75%) of the proteins involved in the interactions or complexes have an ortholog for the species considered in PANTHER. Thus, taking into account corrections for H.sapiens prior to PANTHER inference process will probably result in a diminution of the phenomenon for other species. Due to this pervasiveness, any solution to overcome or repair recursive complexes would be valuable for automated graph analysis whatever the biological questions to be addressed. In addition to being widely present in the datasets, we also show that invalid complexes composition reaches up to 10 levels in the tree of components of the complex. In these situations, navigating in the BioPAX file from some components of the complete complex to the complete complex is both painstaking and detrimental to computational performance.
Fixing recursive complexes consists in adding all the indirect components of a complex as new direct components. This leads to two main outcomes. First, it helps to reduce the path length from a complex to its components, since the maximal path length with the conform topology is now 1 (which solves the navigation limitation identified previously). Second, this increases the average number of components participating to complexes in Reactome. In the human dataset, this correction doubled the average number of direct components of a complex, from 2.2 to 4.3 components. This result is not specific to the human dataset: for all other organisms, the number of direct components has doubled.
As a side effect, the procedure reveals redundancies between complexes. These redundancies would have been would have been difficult to identify without the procedure, because they are masked by the recursive definition of complexes. This is particularly relevant for the Human dataset, since the redundant complexes increase from 137 to 217 redundant pairs (þ58%) before and after the fixing procedure. For the identification of redundant complexes, the verification that the cellular location is the same is crucial. Indeed, without considering cellular location, 243 pairs of redundant complexes can be identified whereas taking into account cellular location reduces this number to 217 pairs of redundant complexes. This difference is caused by the fact that physical entities can exist several times in BioPAX as long as they refer to physical entities in different states (including modifications, cellular location, etc.).
The BioPAX ontology, as defined in Demir et al. (2010), is a very powerful format to represent and unify all the subtile levels of interactions occurring in biological pathways. It exploits the ontology formalism to conciliate genericity (with the high-level entity classes Physical Entity and Interactions) and a precise description of the processes (with low-level classes and various properties). However, this is also a quite complicated format description, and induces a computational complexity when reasoning on the data structured in this format. Strö mbä ck and Lambrix (2005) anticipated this, stating that: This makes it possible to benefit from reasoning and conclusions based on the semantics given by OWL and the ontology, but the cost is a higher computational complexity for reasoning and integration of data.
We have showed that more recent computational advances such as SPARQL, which is designed to handle ontologies, are better adapted to this task and can be also scaled up.
Taking advantage of SPARQL queries, the procedures developed herein to (i) fix recursively defined complexes and (ii) identify redundant complexes, allow to correct the noncompliance with BioPAX specifications. As these procedures are applied at the level of the BioPAX files, they can be applied to all other major BioPAX pathway databases, including KEGG (Kanehisa and Goto 2000), MetaCYC (Caspi et al. 2020), PathwayCommons (Cerami et al. 2010), and WikiPathways (Martens et al. 2021), to assess the importance of invalid recursive complexes in these databases. Our strategy to modify directly the BioPAX files also ensures that further analyses of biological networks can be processed without any need to modify any standard queries or scripts based on BioPAX libraries such as Paxtools  or PyBioPAX (Gyori and Hoyt 2022).