NICEpath: Finding metabolic pathways in large networks through atom-conserving substrate–product pairs

Abstract Motivation Finding biosynthetic pathways is essential for metabolic engineering of organisms to produce chemicals, biodegradation prediction of pollutants and drugs, and for the elucidation of bioproduction pathways of secondary metabolites. A key step in biosynthetic pathway design is the extraction of novel metabolic pathways from big networks that integrate known biological, as well as novel, predicted biotransformations. However, the efficient analysis and the navigation of big biochemical networks remain a challenge. Results Here, we propose the construction of searchable graph representations of metabolic networks. Each reaction is decomposed into pairs of reactants and products, and each pair is assigned a weight, which is calculated from the number of conserved atoms between the reactant and the product molecule. We test our method on a biochemical network that spans 6546 known enzymatic reactions to show how our approach elegantly extracts biologically relevant metabolic pathways from biochemical networks, and how the proposed network structure enables the application of efficient graph search algorithms that improve navigation and pathway identification in big metabolic networks. The weighted reactant–product pairs of an example network and the corresponding graph search algorithm are available online. The proposed method extracts metabolic pathways fast and reliably from big biochemical networks, which is inherently important for all applications involving the engineering of metabolic networks. Availability and implementation https://github.com/EPFL-LCSB/nicepath. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Extracting meaningful metabolic pathways from large metabolic networks is essential for the computational design of bioproduction pathways, for the elucidation of biosynthesis of natural products, and for the fundamental understanding of metabolism. Metabolic pathways describe the transformation from one or several source molecules over consecutive reaction steps into one or several target molecules and provide the roadmap guiding these various applications (Cravens et al., 2019;Lin et al., 2019;Nielsen and Keasling, 2016). Traditionally, metabolic pathways were drawn by hand after directly inferring biochemical transformations from experimental evidence. However, the advent of the omics era and the dramatic increase of computational data resources has drastically changed the way we study biochemistry. Biochemical knowledge is now collected in continuously growing databases, providing new opportunities for fundamental research and metabolic engineering. These resources can be harnessed to design non-canonical pathways that do not exist in nature. While many non-natural pathways have been historically designed by intuition using paper and pencil, it is likely that alternative and more efficient solutions will be missed given the wealth of available biochemical data, thus making systematic and automated pathway extraction indispensable. To address this challenge, computational pathway search tools have been developed to extract metabolic pathways from biochemical databases (Hadadi and Hatzimanikatis, 2015;Wang et al., 2017).
Pathway search tools aim to provide meaningful, easily interpretable metabolic pathways as they are shown in textbooks, but through an automated approach. A typical, linear pathway starts with a precursor molecule that is chemically modified by subsequent enzymatic steps until a target molecule is obtained. The atoms of the precursor compounds that are conserved throughout the pathway can be defined as core atoms. Cofactors and co-metabolites are considered boundary compounds and their metabolic provenance and fate are not further detailed in the linear pathway. In reality however, pathways are not always linear. Anabolic pathways may involve the concatenation of two or more metabolites to form a bigger compound [e.g. (S)-norcoclaurine synthase], and catabolic pathways may break down a compound into two or more smaller molecules. However, every branched pathway can also be described as a combination of linear pathways, and we therefore only consider linear pathways in this work. Here, we define biologically relevant, linear pathways as biochemical routes that fulfill the following criteria: (i) Core atoms are conserved throughout the pathway, (ii) loops are not allowed, meaning that no metabolite appears twice and (iii) other metabolites that contribute to the main biotransformation route in a lesser degree are considered as cofactors or co-metabolites.
Pathway search methods are applied to biochemical networks that define the search space. There are two main approaches to mathematically describe a biochemical network: A stoichiometric matrix or a mathematical graph (Wang et al., 2017). Stoichiometric reaction matrices can be searched for pathways by optimizing the production of a target molecule within a metabolic network (Kumar et al., 2018). However, this technique is not applicable to biochemical networks the size of biochemical databases with tens to hundreds of thousands of reactions and is therefore not further discussed here. Graph-based methods on the other hand are suitable for large-scale applications due to their computational efficiency. They represent metabolic networks as mathematical graph structures, and then find paths within the graph from a given source to a target metabolite. Several approaches have been explored to bias a biochemically blind graph search algorithm toward biologically meaningful pathways, such as the exclusion of cofactors from the network to avoid shortcuts through hub metabolites (e.g. CO 2 , ATP, H 2 O) (Ma and Zeng, 2003), defining substrate-product pairs through the chemical similarity of reactants (Pertusi et al., 2015), precomputing recurring subpaths (Kim et al., 2020) and atom or substructure conservation throughout the pathway (Sankar et al., 2017). Atom conservation in general, and carbon conservation in particular, is a valuable criterion for finding biologically meaningful pathways (Arita, 2004;Blum and Kohlbacher, 2008;Heath et al., 2010;Huang et al., 2017;Kumar et al., 2018;Tervo and Reed, 2016).
Several solutions to pathway discovery employ the concept of atom conservation. Initially, the tracking of single atoms was used by Arita et al. to calculate network properties of the metabolism of Escherichia coli (Arita, 2004). Later, atom tracking was used to improve the quality of pathway search tools by ensuring that one or several atoms were conserved throughout the pathway (Fooshee et al., 2013;Heath et al., 2010;Huang et al., 2017;Latendresse et al., 2012;2014;Pey et al., 2013). Atom-tracking methods have been shown to find biologically relevant pathways, although the high quality came with an increased computational cost. An alternative strategy has been pursued by the Kyoto Encyclopedia of Genes and Genomes (KEGG). Their reactions are annotated with chemical structure alignments, also called substrate-product pairs or reactant pairs (short RPAIRs) (Shimizu et al., 2008). The KEGG RPAIR database consists of manually curated, atom-mapped, substructureconserving substrate-product pairs. KEGG differentiates between five types of RPAIRS: 'main', 'cofac', 'trans', 'ligase' and 'leave'. The four latter ones describe cofactor pairs, small groups transferred by transferases, nucleotide triphosphate consumption by ligases, and the addition or removal of small inorganic compounds by lyases and hydrolases, respectively. The first type, 'main', describes the main biotransformation in a given reaction.
KEGG's pathway prediction server, named PathPred, uses the 'main' reactant pairs to create a searchable graph of biologically meaningful biotransformations (Moriya et al., 2010). Instead of tracking atoms individually, PathPred approximates the atom conservation by defining moiety-conserving reactant pairs, which decreases the complexity of the path search problem. However, their classification system is based on a combination of manual curation and automatic annotation, a strategy that is not easily applicable to large biochemical networks, such as the ATLAS of Biochemistry with its more than 140 000 predicted reactions (Hadadi et al., 2016;Hafner et al., 2020), or its successor database ATLASx with its more than 5 million predicted reactions (Mohammadi-Peyhani et al., 2021). Large biochemical databases, especially those including hypothetical reactions, require reliable and computationally efficient algorithms to extract biologically relevant biochemical pathways.
Here, we address the challenge of efficiently searching and analyzing big biochemical networks. We propose a new method, named NICEpath, that biases the graph search toward atom-conserving pathways. To achieve this, we calculate weighted reactant-product pairs that reflect the atom conservation in each reaction, and we use the atom-conserving pairs to represent biochemical reaction networks as weighted graphs that are compatible with efficient search algorithms. The pathways found by NICEpath therefore fit our definition of 'biologically meaningful' in the sense that they fulfill the three criteria mentioned earlier. The algorithm finds atom-conserving pathways first and returns a pathway list ranked by overall atom conservation. NICEpath can be readily employed to extract and compare metabolic pathways from biochemical database (e.g. KEGG) or from metabolic networks specific to an organism (e.g. genome-scale models). The method can be further applied to efficiently search large biochemical networks, as they are generated by reaction prediction tool such as BNICE.ch (Hatzimanikatis et al., 2005).

Materials and methods
Our approach can be divided into four steps ( Fig. 1): (i) The first step consists of acquiring an atom-level representation of each reaction. The atom maps can come from databases, atom-mapping algorithms, or, in our case, enzymatic reaction rules as implemented in BNICE.ch (Hatzimanikatis et al., 2005). (ii) In a second step, each atom-mapped reaction is decomposed into all the possible reactant-product pairs. For each pair, we calculate the Conserved Atom Ratio (CAR) from the number of conserved atoms between reactant and product and the size of the molecules in terms of number of atoms. (iii) The atom-weighted substrate-product pairs are used to construct a weighted undirected graph, where the distance between reactants and products are inversely proportional to the CAR. (iv) Once the graph of weighted substrateproduct pairs is constructed, we can apply well-established graph search methods to find the shortest paths, which will inherently find the pathways that conserve the highest number of atoms. NICEpath uses the Yen's k-shortest loop-less path (Yen, 1971) algorithm, a standard method to find a predefined number (k) of shortest paths in weighted graph, avoiding the repetition of nodes.

Biochemically correct atom mapping with BNICE.ch
Atom-mapped reactions are the prerequisite for calculating weighted reactant-product pairs. Here, we use the computational tool BNICE.ch, developed to predict hypothetical biochemical networks, to calculate biochemically correct atom mappings of enzymatic reactions. The core of BNICE.ch consists of 442 bidirectional, generalized biochemical reaction rules that describe the biochemical reaction mechanisms of enzymatic reactions. The reaction rules are applied to a molecular structure to (i) reconstruct atom-mapped, known biochemical reactions; and (ii) to predict all possible biochemical transformations that a given compound can undergo along with the product compounds generated in the process. Here, BNICE.ch calculates atom maps for metabolic reactions using the mechanistic knowledge stored in the reaction rules, as described by Hadadi et al. (2017). In this step, other tools for the automatic atom mapping of reactions may also be applied to generate atom maps (Chen et al., 2013;Fooshee et al., 2013;Latendresse et al., 2012).

Calculation of weighted reactant-product pairs
The following steps are applied to each reaction in the network to generate atom-weighted reactant-product pairs: (i) Each reaction is split into all possible reactant-product pairs. (ii) For each pair of reactant and product, the number of common atoms (n c ) between reactant and product is calculated along with the total number of atoms in the reactant (n r ) and the total number of atoms in the product (n p ). Hydrogen atoms are omitted from the calculation. (iii) For each pair, the ratio of conserved atoms (in the following, called Conserved Atom Ratio, or CAR) is calculated with respect to the reactant (CAR r ) and with respect to the product (CAR p ).
To calculate a bidirectional CAR, the mean CAR is multiplied with a correction factor that decreases with the difference between the number of common atoms and the total number of atoms in the molecule.
The only exception to this approach is made for reactions involving the cofactor Coenzyme A (CoA). In a molecule, CoA is treated as a single atom when it occurs in both the reactant and in the product, mainly because the high number of conserved atoms between the comparably big CoA leads to high CARs, thus masking the biochemically more interesting connections between the smaller metabolites that are attached to and detached from CoA during metabolic transformations. The final CAR value is used to weight reactantproduct pairs in the network.

Assigning mechanisms to biochemical reactions from the KEGG reference network
We used KEGG as a reference database for enzymatic reactions, from which we extracted all reactions that have an associated mechanism in BNICE.ch. If a given reaction from KEGG could be reconstructed with BNICE.ch, it was assigned a reaction mechanism that allowed us to retrieve the number of conserved atoms between each reactant-product pair. The set of KEGG reactions with assigned reaction mechanisms and pre-calculated CAR values was used for further validation and as an example network for network analysis and pathway search. The set of BNICE.ch curated KEGG reactions is available from the GitHub repository at https://github.com/EPFL-LCSB/nicepath.

Graph representation of biochemical networks
For a given reaction network, NICEpath loads all the reactant-product pairs to generate a weighted, undirected graph, where metabolites are nodes connected by edges, representing the reactantproduct relationship. Edges are assigned a weight that defines the relation between two connected nodes. To use state-of-the-art shortestpath graph search algorithms, highly atom-conserving reactants should be close to each other, and pairs that only share a few atoms should be further away. Hence, we convert the CAR into a distance: NICEpath accepts two alternative ways to calculate the distance, which can be used to modulate the influence of the atom conservation on the weight of the reactant-product pair.
The type of transformation can be changed to square root or exponential depending on the nature of the pathway search problem, i.e. the structures of source and target molecules as well as the estimated number of biotransformation used to convert one into the other. The distance measure is used to reconstruct a directed graph whose edge weights represent the atomic distance between reactants and products. The different transformations as a function of the CAR are visualized in Supplementary Figure S1. For longer pathways, we recommend using the exponential transformation because it increases the penalty for pairs with low CARs, which makes the search more conservative in terms of atoms. For this study, we grouped duplicate KEGG compounds into one node. Duplicates were identified based on the first fourteen letters of the InChIKey that describe the atom connectivity of a compound, but that do not contain additional information (i.e. charge, stereochemistry, isotopes) In practice, this means that different stereoisomers of the same molecular structure were merged into one node.  1. The workflow of the pathway search is divided into two parts. The first two steps (left) describe the atom weighting of the network from atom-mapped reactions. In this study, steps 1 and 2 are performed by BNICE.ch. Steps 3 and 4 (right), implemented in NICEpath, take the atom-weighted network as an input to create a searchable graph structure and finally apply a Yen's k-shortest pathway search 2.5 Finding metabolic pathways with graph search NICEpath applies a Yen's k-shortest loop-less path search (Yen, 1971) to extract the shortest pathways from the weighted network of reactant-product pairs using the python package NetworkX. As inputs, the pathway search algorithm takes a weighted graph, a source compound, a target compound and the maximum number of shortest paths (k) to be found. As soon as this number k is reached, the algorithm stops and returns all the k-shortest paths in terms of summed edge weights.
The run time of NICEpath depends on the structure of the network, the distance between the source and target compound in the graph, the number of pathways to be found and the maximum pathway length allowed. As an example, to find 10 000 pathways of maximum length 100, the algorithm runs for about 15 min on a standard desktop computer using a single core. If there are several source compounds given as input, NICEpath can run path searches in parallel for different source compounds using all available cores.

Network analysis in NICEpath
NICEpath first calculates standard network statistics, such as the number of nodes and edges, and then extracts an undirected, unweighted network from the original network by only considering edges with a CAR higher than a given threshold. For this new network, the number of components, or disjoint graphs, is extracted, and the biggest component is further analyzed regarding its size relative to the previous network as well as its diameter. Since searching for pathways between two compounds belonging to different disconnected graphs will not yield any good pathways, NICEpath will warn the user in this case.

Software
The NICEpath code can be executed with any python version up to 3.7. The NetworkX python library (https://networkx.github.io/) was used to implement and search the reaction graph. An extensive list of libraries used can be found in the specification file on GitHub.

Weighted substrate-product pairs capture the main biotransformations
To validate the biochemical relevance of weighted substrate-product pairs, we compared them to the KEGG RPAIR database. KEGG RPAIR distinguishes 'main' substrate-product pair of the reaction from secondary types of pairs (e.g. 'cofac', 'leave'). To take the alcohol dehydrogenase reaction as an example, the main pair would be the transformation of the primary alcohol to the aldehyde, and the conversion of the cofactor NADþ to NADH would be of type 'cofac' (Fig. 2). 'Main' pairs are used to draw the KEGG metabolic pathway maps. Therefore, a method that accurately predicts KEGG RPAIRS of type 'main' can be used to reconstruct biologically relevant metabolic pathways. It should be noted that KEGG discontinued the manual definition and curation of RPAIRS in 2016, and replaced the concept of RPAIRS with an automatically calculated alternative, RCLASS.
We validated the NICEpath method by predicting KEGG RPAIRS of type 'main' using the concept of the Conserved Atom Ratio. We used BNICE.ch to calculate CAR values for a test set of 6546 KEGG reactions for which the exact reaction mechanism is known, and which are, therefore, reconstructed by BNICE.ch (Supplementary Table S1). From these 6546 reactions, we determined 10 747 substrate-product pairs with a non-zero CAR, meaning that at least one non-hydrogen atom is conserved between the substrate and the product (Supplementary Table  S2). Out of these 10 747 pairs, 5148 were found to be KEGG RPAIRS of type 'main'. Since RPAIRs are defined based on the conservation of structural moieties within a reaction, we hypothesized that the higher the CAR value, the more atoms conserved between a substrate and a product, and hence the higher the probability that the pair would be a KEGG RPAIR of type 'main'. We should therefore be able to predict the membership of a pair to the set of 'main' KEGG RPAIRS by using a given CAR threshold as a classifier. To test our hypothesis that the CAR is a good predictor for a reactant-product pair to be of KEGG RPAIR type 'main', we performed a Receiver-Operator Characteristic (ROC) analysis (Fig. 3). The reference for true pairs were the 5148 'main' RPAIRs (true positives), and the remaining 5599 pairs were true negatives.
For 100 CAR cutoff values between zero and one we calculated the number of good predictions (i.e. number of pairs with a CAR above the cutoff and of type 'main', or true positives) and bad predictions (i.e. number of pairs with a CAR above the cutoff and not of type 'main', or false positives). By drawing true positives versus false positives, we found an Area Under Curve (AUC) of 0.88. An AUC above 0.8 is considered an 'excellent discrimination' (Hosmer and Lemeshow, 2000). We further show the tradeoff between sensitivity and specificity, as well as the Youden's index (i.e. sensitivity þ specificity -1) to characterize this tradeoff (Youden, 1950) and to determine an optimal CAR cutoff. We found that the Youden's index is maximal at a CAR equal to 0.34, which suggests that this is the optimal CAR cutoff to tell whether a given substrate-product pair conserves enough atoms to be considered a 'main' pair. This analysis shows that we can reliably use the CAR to predict KEGG RPAIRS of type 'main'. The network of weighted KEGG reactant pairs for 6546 KEGG reactions is included in the NICEpath software and used as a reaction database in the default search.

Graph-theoretical analysis of metabolic networks
Characterizing biochemical networks from a graph-theoretical point of view can be used to evaluate the quality and connectivity of the represented network, and also bring new insights into the overall organization of metabolism. Furthermore, knowing the graph-theoretical properties of a biochemical network can be crucial for anticipating potential problems in the pathway search. NICEpath provides basic network statistics that allow us to assess the quality of the data. Here, the weighted graph of the KEGG network used for validation initially contained 5578 compounds, or nodes, and 20 911 directed edges representing reactant-product pairs. Certain graph properties are not defined for weighted directed graphs, such as the number of components or the network diameter. For (B) Decarboxylation reaction: the atoms of the reactant are distributed between a leaving CO 2 molecule with a low CAR value and a product molecule with a higher CAR value corresponding to the 'main' RPAIR calculating these properties, a simple, non-directed graph was generated by removing reactant-product pairs with a CAR lower than the previously calculated optimal threshold of 0.34 and by removing the weights on the remaining reactant-product pairs. The unweighted graph contains 5518 nodes and 5541 edges, which are distributed over 813 smaller disjoint graphs, or so-called components. The biggest component contains 2663 nodes (48%) and 3422 edges (62%), and it has a network diameter of 40. In other words, the longest shortest pathway connecting two compounds counts 40 biotransformation steps in the main component of the KEGG network. This means that our KEGG network is dominated by one big component, or 'island', that includes half of the metabolites in KEGG and represents the core metabolism plus connected secondary metabolism. The remaining metabolites are organized in small, disconnected subnetworks, which we hypothesize to be mostly secondary metabolites without defined biosynthesis pathways.

Finding biologically relevant pathways with NICEpath
To illustrate the output of NICEpath, we discuss two example pathway searches. In the first example, we tried to biochemically connect tyrosine to caffeate, and we allowed a maximum number of ten pathways to be found. The pathway search resulted in ten pathways with lengths ranging from two to six consecutive reaction steps ( Table 1). The quality of the pathway can be estimated from the pathway score and the average CAR. The pathway score sums the distances for each reactant-product pair in the pathway. The score reflects both the length of the pathway as well as the quality of atom conservation within the pathway, and it is eventually used by NICEpath to rank the paths. The average CAR estimates the quality of the pathway by averaging the atom conservation over each reaction step. Out of these ten best pathways, the pathways ranked first, second and fifth were chosen for visual inspection (Fig. 4). The first pathway had a very low score of 2.24 combined with a high average CAR (0.89) and a length of two, which indicatesthat the pathway is of good quality because it only requires a small number of steps, all of them showing high atom conservation. Indeed, KEGG proposes this pathway in its phenylpropanoid biosynthesis map, meaning that it is biologically relevant. The second pathway, although longer, has a similarly high average CAR of 0.93, a length of four steps, and it can also be found in KEGG. To contrast these two good pathway examples with a poor example, the pathway ranked fifth shows a slightly lower average CAR of 0.81, which is due to the attachment and subsequent detachment of a one-carbon unit. In this pathway, out of five reaction steps, the last step is redundant with the first pathway, while the first four steps describe a detour from tyrosine to coumarate (C00811). This last, suboptimal pathway cannot be found in the KEGG map for phenylpropanoid biosynthesis.
In a second example, we searched for pathways connecting the compounds tyrosine and syringin. The number of pathways to be found was restricted to five, and we used three different transformations to calculate the distance between reactant-product pairs: The default transformation 1/CAR, the square root transformation, and the exponential transformation. Using the default option, NICEpath first listed three short pathways with a low average CAR ($0.5), followed by two longer pathways with high average CAR ($0.8) ( Table 2). The square root option yielded only short pathways with a low average CAR, while the exponential option only resulted in longer pathways of high average CAR. Interestingly, all the long pathways with high CAR were identified as known metabolic pathways in KEGG, indicating that increasing the influence of the CAR on the distance by choosing an exponential transformation operator is helpful to reliably extract longer pathways.
Two pathways were chosen to understand in detail the influence of the type of transformation used for calculating the distances between reactants and products: one was short with a low CAR (A) and one was long with a high CAR (D*) (Fig. 5). Pathway A connected tyrosine to syringin in four reaction steps, with a relatively low average CAR of 0.51. As already indicated by the low CAR, the pathway turned out to be a shortcut through pathway was ranked first in the default and the square root transformation types, but, interestingly, ranked 1114th in the exponential case when re-running the search with an upper limit of 2000 pathways. The exponential transformation increases the penalty of atom loss in biotransformation, which leads to a higher pathway score assigned to the shortcut pathway. Pathway D* connected tyrosine to syringin in eight reaction steps, with a high average CAR of 0.86. It was ranked first using an exponential transformation, ranked fourth using the default distance calculation, and ranked 43rd for the square root case. This second pathway kept the molecular core structure of tyrosine and modified it to produce syringin, conserving a maximum number of atoms. Remarkably, this pathway is part of the KEGG pathway map for phenylpropanoid biosynthesis, and it 6.27 0.81 6.41 0.94 6.44 0.93 6.50 0.93 6.50 0.93 6.60 0.91 Note: KEGG identifiers are used to specify compounds and reactions. The maximum number of pathways (k) was set to 10, and only one reaction alternative was printed when several reactions could do the same biotransformation.  Table 1 connecting tyrosine and caffeate with index numbers 1, 2 and 5 are visualized in detail for comparison. For each biotransformation, the CAR value as well as the default distance (d) are indicated can therefore be called a confirmed, biologically meaningful pathway. These two examples of pathway search problems illustrate the capacity of NICEpath to efficiently extract biologically relevant pathways from large biochemical networks. The algorithm robustly handled searches for long pathways of eight and more biotransformation steps, as they are usually present in secondary metabolism.
To demonstrate the universality of our approach, we performed a systematic validation on 50 pathways collected from KEGG consisting of 40 biosynthesis and 10 biodegradation routes and involving between 3 and 15 reactions steps (Supplementary Table S3, Supplementary Fig. S2). We evaluated the NICEpath performance using the three proposed transformation operators and compared it to a two-sided breadth-first search within (i) an unweighted network of substrate-product pairs with edges where the CAR exceeds the identified threshold value of 0.34, (ii) an unweighted network of 'main' KEGG RPAIRs (as used in the KEGG PathPred server) and (iii) an unweighted network of all possible KEGG RPAIRs without cofactors. The first and second networks are expected to yield similar results, because a CAR threshold of 0.34 has been shown to well predict KEGG RPAIRs of type 'main'. The third network represents the common approach of removing cofactors from a network of all possible substrate-product edges to avoid extracted pathways to shortcut through cofactors acting as hub metabolites.
For each of the reference pathways, we performed a k-shortest path search within each network between the source and the end compound of the pathway. To evaluate the performance, we retrieved the rank of the reference pathway within the top 100 pathways produced by the algorithm, and we measured the runtime of the search algorithm (Supplementary Table S4). We could show that within the weighted networks, the exponential transformation resulted in longer search time, but it had the highest probability of finding the reference pathway first, while the square root transformation yielded the opposite result (Supplementary Figs S3 and S4.). We further found that the runtimes within the unweighted networks were significantly lower than within the weighted networks, which is due to the fact that the unweighted network allows for a two-sided search starting simultaneously from the source and the target compound. Within the unweighted networks, searches within the network without cofactors were found to be the slowest and the less accurate. The RPAIR 'main' network and the CAR > 0.34 network showed a similar performance with respect to runtime, but the ranking performance was better in the CAR > 0.34 network, which is due to the fact that not all reactions within the reference pathway had RPAIRs assigned in KEGG.
Interestingly, the ranking performance of the exponentially transformed network and the unweighted CAR > 0.34 were very similar. This indicates that when the algorithm runtime becomes a limiting factor, the search can be performed within an unweighted graph (CAR > 0.34) to reduce the runtime, allowing a two-sided search starting simultaneously from the source and the target compound and thus speeding up the search.

Limitations and future challenges
There are cases in which NICEpath will not find satisfactory solutions. Possible reasons for suboptimal results are (i) the network does not contain the necessary reactions to connect the starting compound to the target compound, and (ii) the source and the target compound initially have only a few atoms in common. The first issue can be solved by adding the missing reactant-product pairs to the network. Missing steps can be hypothesized manually or predicted using reaction prediction tools such as BNICE.ch. The second issue is more complex, since it depends on the molecular structure of the source and target compound, as well as on the real number of biochemical transformations needed to transform one into the other. Possible solutions to improve the output include breaking down the search into several sub-searches by identifying intermediates and increasing the penalty on atom loss by using an exponential transformation of the CAR into the distance between  Note: Three different CAR transformations were used to calculate the distance between pairs of reactants and products: The first one (1/CAR) is the default option in NICEpath, the second is a square root transformation that decreases the effect of atom conservation, and the third one is an exponential transformation that increases the effect of conserved atoms in reactant-product pairs. Pathways are mapped across distance transformations with letters indicated in the column labeled 'Mapping'. *Pathways marked with an asterisk correspond to known metabolic pathways that fulfill the criteria of biologically relevant pathways.
reactants and products. While our algorithm successfully circumvents the recurrent problem of shortcuts through small hub metabolites, it does not satisfactorily avoid shortcuts through big hub metabolites such as Coenzyme A (CoA). In fact, reactant pairs involving CoA structures on both sides have a lot of atoms in common, and therefore a high CAR value. For this reason, NICEpath excludes CoA by default from the reactant pair network.

Conclusion
We introduce a new pathway search method based on weighted reactant-product pairs. To our best knowledge, this is the first to use automatically generated atom-weighted reactant-product pairs in combination with a k-shortest graph search approach. We benchmark our method for reactant-pair weighting against the KEGG RPAIR database, and we evaluated the performance of the proposed pathway search on 50 pathways obtained from KEGG. The strong point of NICEpath is that it is suitable for big biochemical networks, spanning more than hundreds of thousands biochemical reactions, such as hypothetical reaction networks generated by retrobiosynthesis tools and predictive biochemistry (Hadadi et al., 2016;Hafner et al., 2020). Pathway search constitutes the first step, and hence the foundation, of the overall pathway design pipeline. Downstream pathway analyses include the evaluation of the stoichiometric and thermodynamic feasibility of a pathway within a host organism via Flux Balance Analysis and Thermodynamic Flux Analysis, respectively, the estimation of the kinetic properties of the pathway, and the assessment of enzyme availability (e.g. from databases or from enzyme prediction tools) (Hadadi and Hatzimanikatis, 2015). As all these tools require substantial effort and computational resources, it is key that the pathways proposed initially by a pathway search tool only deliver biologically feasible biotransformation routes. We estimate that the future development of reaction prediction tools, based on biochemical reaction rules or machine learning methods, will yield big hypothetical reaction networks that require optimized search tools to efficiently extract biochemical pathways. Furthermore, the presented method to translate metabolic networks into a graph structure can be used in the future to analyze the global characteristics of biochemical networks, such as the diameter of a network or its connectivity, and finally to detect and map knowledge gaps in metabolic databases.
Finally, the herein proposed framework will lay the foundation for further developments. Other types of weights, such as kinetic and thermodynamic considerations, can be integrated into the weighting of substrate-product pairs to steer the pathway search toward biochemically feasible pathways, and a set of user-defined parameters will make it easy to fine-tune the pathway search. Additionaly, metrics other than the here proposed CAR could be utilized to quantify atom conservation within substrate-product pairs, such as the Jaccard index (Jaccard, 1908). The NICEpath code is available on GitHub (https://github.com/EPFL-LCSB/nicepath), and it comes with a collection of 5434 known metabolic reactions with pre-calculated atom-weighted reactant pairs.