MCPNet: a parallel maximum capacity-based genome-scale gene network construction framework

Abstract Motivation Gene network reconstruction from gene expression profiles is a compute- and data-intensive problem. Numerous methods based on diverse approaches including mutual information, random forests, Bayesian networks, correlation measures, as well as their transforms and filters such as data processing inequality, have been proposed. However, an effective gene network reconstruction method that performs well in all three aspects of computational efficiency, data size scalability, and output quality remains elusive. Simple techniques such as Pearson correlation are fast to compute but ignore indirect interactions, while more robust methods such as Bayesian networks are prohibitively time consuming to apply to tens of thousands of genes. Results We developed maximum capacity path (MCP) score, a novel maximum-capacity-path-based metric to quantify the relative strengths of direct and indirect gene–gene interactions. We further present MCPNet, an efficient, parallelized gene network reconstruction software based on MCP score, to reverse engineer networks in unsupervised and ensemble manners. Using synthetic and real Saccharomyces cervisiae datasets as well as real Arabidopsis thaliana datasets, we demonstrate that MCPNet produces better quality networks as measured by AUPRC, is significantly faster than all other gene network reconstruction software, and also scales well to tens of thousands of genes and hundreds of CPU cores. Thus, MCPNet represents a new gene network reconstruction tool that simultaneously achieves quality, performance, and scalability requirements. Availability and implementation Source code freely available for download at https://doi.org/10.5281/zenodo.6499747 and https://github.com/AluruLab/MCPNet, implemented in C++ and supported on Linux.


Introduction
Gene network inference from high-throughput gene expression data is a compute intensive task. Although several different inference algorithms based on Bayesian networks (Hartemink 2005), mutual information (MI) theory (Faith et al. 2007, Aluru et al. 2013) Pearson correlation (Langfelder and Horvath 2008), regression (Bonneau et al. 2006, Huynh-Thu et al. 2010, and random forest (Aibar et al. 2017) have been developed over the past two decades, scalability remains a critical challenge when working with tens of thousands of genes and/or observations. Aluru et al. (2022) found that 9 of the 15 methods analysed failed to infer networks from a dataset with $18 000 genes. Furthermore, of the six that succeeded, four required between 4 and 49 days to complete. To construct large networks with tens of thousands of genes in reasonable time, scalable algorithms and efficient parallel implementations are necessary. Arboreto (Moerman et al. 2019) recently introduced distributed computing capability to construct random forest-based transcription factor (TF)-target gene regulatory networks (GRNs); however, this is still not scalable for large networks. TINGe (Zola et al. 2010), using a parallel MI-based approach can construct large genome-scale gene networks in a significantly shorter amount of time.
Past surveys conducted with established benchmark data suggest that Bayesian and MI-based methods are among the best performing (Marbach et al. 2012, Lachmann et al. 2016, Aluru et al. 2022) with respect to the quality and accuracy of inferred interactions. However, unlike Bayesian networks, MI-based methods are more amenable to large scale network reconstruction (Chockalingam et al. 2017). One key caveat though is that these methods require post-processing such as Stouffer Transform in CLR (Faith et al. 2007) or MI value filtering based on P-values and data processing inequality (DPI) measures in ARACNe-adaptive partitioning (AP) (Lachmann et al. 2016) and TINGe. Such filtering complicates network evaluation by standard metrics such as the area under the precision-recall curve (AUPRC) as it introduces discontinuities in the value range. Additionally, DPI requires a usersupplied tolerance parameter that is challenging to determine a priori. As DPI reflects the information transmission capacity, it is a special case of the maximum capacity path (MCP) problem in graph theory, which has previously found applications in network routing (Pollack 1960), image compositing (Fernandez et al. 1998), and metabolic pathway analysis (Ullah et al. 2009). Formulated as such, the tolerance parameter is no longer required.
In this paper, we present a novel gene network reconstruction approach based on MCP to characterize and compare indirect gene interactions to identify significant gene-gene interactions or TF-target relationships without thresholding or user-specified parameters. Our unsupervised approach examines all indirect paths between two genes to compute the maximum indirect interaction capacity and allows efficient investigation of paths of arbitrary lengths. Using the same framework, we also established an ensemble method that combines interactions from multiple path lengths using optimized weights identified with partial groundtruth. We further created an efficient parallel implementation of the algorithm, called MCPNet, that can scale to hundreds of cores. We evaluated performance of our unsupervised and ensemble approaches using MI as the direct interaction measure and paths with length up to four for both gene-gene and TF-target interactions and demonstrate that our method delivers higher network quality and superior computational performance when compared to other state-of-the-art gene network reconstruction software.

Materials and methods
A gene expression dataset with samples S is formulated as a matrix P with jVj rows and jSj columns, each row corresponding to expression levels for a gene, and each column corresponding to the gene expression profile of a sample in S. Here, we reconstruct and examine two different types of gene networks-gene co-expression networks (GCN) and GRN. GCN is defined as an undirected weighted graph Vg is the set of edges representing interacting genes, and W is a matrix of size N Â N, where the entry W ij is an edge weight that quantifies the interaction strength of v i with v j . GRN is directed in contrast to a GCN, and can be defined as G ¼ hU; V; E; Wi, where U ¼ fu 1 . . . u M g represent the regulators, V is the set of N target genes. The edge set is E ¼ fðu i ; v j Þju i 2 U; v j 2 Vg, and W ij contains the weights corresponding to the edges in E. Since a GRN is directed, W is rectangular.
Different simplifying mathematical models have been adopted to compute the edge weights. Correlation measures such as Pearson computes W ij from the expression profiles of genes v i and v j . Correlation measures are commutative, W ij ¼ W ji , thus the square weight matrix of GCN is symmetric but may contain negative elements. In the Arboreto model (Huynh-Thu et al. 2010), W ij indicates the degree to which the expression of gene v i can predict v j 's gene expression, thus GCN W is non-negative but may not be symmetric. An alternative to correlation is MI, where the expression profiles for genes v i ; v j 2 V, rows i and j in matrix P are modeled as random variables X i and X j , from which MI is estimated. Since MI is non-negative and commutative, GCN W is symmetric and positive semi-definite. As these measures compute pairwise interaction strengths, the weight matrix W of a GRN can be computed element-wise in the same way as W for a GCN.
Our proposed method transforms the input edge weight matrix W to produce a new MCP score that accounts for indirect interaction strengths and can leverage different interaction strength measures while preserving the directionality of the measures. In this paper, MI is used to demonstrate the performance of our proposed method, though our method is agnostic of the semantics of the W matrix or its symmetry and thus is suited for both undirected GCN and directed GRN reconstruction. For W with negative values, our proposed method can use the magnitude, i.e. absolute value, of the interaction strength instead.

MCP score
We define an indirect interaction between two genes as one that is mediated through one or more intermediary genes and can be modeled as a length-L path ðv s ; . . . ; v ir ; . . .
where v s and v t are source and target genes, v ir is an intermediary gene in V indexed by i r 2 f1 . . . jVjg, and r 2 f1 . . . L À 1g indicates position along the path. Indirect biological interactions are well-known in eukaryotic organisms-e.g. the gene networks of biochemical, cellular, and signal transduction pathways where a change in the structure or function of a gene/protein can have indirect effects on pathway genes that are not adjacent to each other through regulatory positive or negative feedback (Harris and Levine 2005, Lu et al. 2007, Mitrophanov and Groisman 2008, Itzhack et al. 2013, Vermeirssen et al. 2014, Rittschof and Robinson 2016, Cowen et al. 2017. Reconstructing the network from gene expression profiles is guided by two goals: identifying strong direct interactions between gene pairs, and simplifying the network by removing direct interactions made redundant by strong indirect interactions. Our proposed method seeks to accomplish both by first identifying the strongest indirect interactions between two genes, then compare it to the direct interaction strength.
The maximum indirect interaction strength is determined by solving the MCP problem. Figure 1 illustrates different length-2 and length-3 paths between genes v s and v t as dashed lines and their direct interactions as solid lines. For each path, the minimum edge weight is its maximum capacity. Among all the possible length-L paths between v s and v t in G, the MCP is the path with the highest minimum edge weight. We use g L st to denote the capacity of such a path between v s and v t , and refer to it as L-path Capacity or Path Capacity. Formally, Figure 1. Computing g 2 st and g 3 st in a network requires exploring different paths with one and two intermediate vertices, respectively, as shown above.
The maximum operation finds a combination of indices i 1 ; . . . ; i LÀ1 with the maximum path capacity, with each index having range f1 . . . jVjg. Multiple indices may refer to the same gene, thus forming a cycle in the L-path, representing, e.g. feedback loops for gene regulation. In subsequent sections, the range f1 . . . jVjg for the indices is implied in the maximum operation.
Once computed, g L st is compared to the direct interaction strength W st . We define the L-path MCP score between the genes v s and v t , denoted as q l st , as the following ratio: We use both the terms L-path MCP Score and MCP Score to refer to q L st . The output of our proposed method is denoted R L , a matrix of all the q L st 's for every pair of s and t. As a ratio of the edge weight to the path capacity, we expect q L st to be an indicator of the interaction strength of genes v s and v t while factoring in the effect of other interactions in the L-neighborhood of v s and v t . Strong direct interactions relative to indirect ones suggest that the edge should be kept in the network. We note that the DPI-based approach used by ARACNe-AP and TINGe can be modeled as a specialization of the L-path MCP score, as shown in Supplementary Section S1.1. Whereas ARACNe-AP and TINGe uses a DPI threshold to filter the MI matrix thus removing edges representing weak direct interactions compared to their indirect counterparts, our approach quantifies the relative strengths of the direct and indirect interactions and allows for post-processing according to application needs, e.g. thresholding to retain relatively strong direct interactions by examining the precision-recall curve.

MCP score algorithm
The 2-path capacity g 2 st ¼ max i ðminðW si ; W it ÞÞ can be computed explicitly by enumerating all possible length-2 paths between v s and v t . Algorithm 1 shows the pseudo-code for this approach. The 2-path MCP score defined in Section 2.1, q 2 st , can be calculated for the complete network directly by first computing g 2 st for a pair of genes via Algorithm 1, iterating over the two gene expression profiles in OðjVjÞ time. The algorithm has a run-time complexity of OðjVj 3 Þ as it iterates over all elements of a jVj Â jVj matrix (Algorithm 2).
Algorithm 2 applies for L-path MCP scores for any arbitrary L. Naive extension of Algorithm 1 to longer range interactions is exponential in computational complexity, however, as an increase of L by 1 increases computation by jVjfold. For an indirect interaction of length L, the naive g L st computation complexity is OðjVj LÀ1 Þ, leading to OðjVj lþ1 Þ to compute the scores for all gene-gene pairs. Pollack (1960) showed that the maximum capacity g L st of all length-L paths between two vertices in a graph can be computed efficiently via recursive path bisection. Formally, where h ¼ bL=2c. The length-2 realization, g 2 st , forms the base case for the recursion. This partitioning leverages the associative properties of the maximum and minimum operations. Let (1) can then be rewritten as An L-path does not require unique genes on the path. This implies that c si h does not depend on i hþ1 . . . i LÀ1 , and c si h is effectively a constant with respect to the max i hþ1 ;...;iLÀ1 ðÞ operation. Since the maximum and minimum operators are distributive over each other with respect to a constant, i.e. maxðminðz; xÞ; minðz; yÞÞ ¼ minðz; maxðx; yÞÞ, Applying the identity again leads to Equation (3).The recursive path bisection allows L-length MCP scores to be computed in OðjVj log 2 LÞ for a single gene-gene pair, and the L-path capacity for all gene pairs to be computed in OðjVj 3 log 2 LÞ time. A realization of Equation (3) is shown in Algorithm 3. For a gene pair (v s , v t ), the g values for half-length indirect interactions between v s and v t with all possible intermediary genes v r 2 V are computed. The g value for (v s , v t ) is then computed following the MCP problem solution as the twopath capacity case.

Parallel implementation
MCPNet pipeline consists of a sequence of functions. We first compute the MI between pairs of genes, using the gene expression profile matrix as input. The resulting MI matrix was then transformed to MCP score via the algorithm described in First, we note that Algorithm 2 has data access pattern identical to matrix-matrix multiplication, with Algorithm 1 replacing vector dot product as the computational kernel. Indeed, fast matrix multiply algorithms have been applied to the MCP problem (Vassilevska et al. 2007, Duan andPettie 2009) to achieve sub-cubic run-time. For simplicity and ease of parallelization, we compute each g directly in OðjVjÞ time. In contrast to the dominant product algorithm by Vassilevska et al. (2007) that requires random memory access, our algorithm incurs only sequential memory access and therefore is cache-friendly and can leverage hardware memory prefetching.
Since each element can be computed independently from the other elements of the output matrix, a simple parallelization scheme is sufficient. The output matrix is partitioned into 8 Â 8 tiles and tiles are assigned to the compute nodes and CPUs. Use of tiles improves cache reuse as the eight rows and eight columns of the input matrices needed to compute one tile are likely to remain in cache memory during computation.
To ensure that the required input matrix row and column to compute one g 2 st value are in local memory, the MI matrix is replicated on each compute node, and shared between cores of the same node. The MI computation is similarly organized across all nodes, with the gene expression profile matrix replicated and the MI matrix tiles partitioned across cores. One round of all-to-all personalized communication via message passing interface reconstructs the MI or MCP score matrices on all compute nodes. The input replication reduces communication during computation. The overall parallelization approach is shown in Figure 2.
Finally, the computation to produce g 2 st is readily vectorized using element-wise minimum and maximum operators. For reconstruction where the input weight matrix W is symmetric, the MCP score matrix is symmetric as well thus half of the computation can be avoided.

Ensemble multipath MCP scores
MCP scores as defined in Section 2.1 are for a fixed path length. To account for indirect interactions of multiple lengths, we consider a linear combination of the g values in computing an ensemble DPI score: where a 2 , a 3 , . . ., a L are coefficients for the different indirect interaction path lengths, with the standard constraint that the weights add to 1.0. The optimal ensemble parameters are generally challenging to determine. We propose to use available partial groundtruth to help optimize the ensemble parameters. For large gene networks, there are known and experimentally validated genegene and TF-target interactions. Using the known interactions as partial groundtruth for optimizing the ensemble parameters, the proposed ensemble approach promises to improve predictions of unknown interactions in the remainder of the network.
For the ensemble multi-path MCP score, we compute a weighted average of the 2-, 3-, and 4-path capacities by iterating over combinations of coefficients at a default interval of 0.1 for each coefficient. The AUPRC of the resulting networks are computed based on known interactions. The coefficient combination with the maximum AUPRC is used as the optimal combination for the ensemble. The ensemble approach allows the network inferring process to adjust to the dataset available, the organism of interest, and the available partial groundtruth.

Evaluation methodology
We evaluate performance of MCPNet using both simulated and real datasets, and compare it with seven other popular gene network reconstruction software-four MI-based methods [ARACNe-AP (Lachmann et al. 2016), CLR (Faith et al. 2007), MRNET (Meyer et al. 2008), TINGe (Aluru et al. 2013)], a Pearson correlation-based method (WGCNA; Langfelder and Horvath 2008), a random forest-based method (Arboreto; Aibar et al. 2017), and a regression-based method (Inferelator; Bonneau et al. 2006). Arboreto and Inferelator are ensemble methods, in contrast to the other methods referenced above which are standalone. Each method evaluated is executed with its default parameters, or published parameters for the target dataset (Tchourine et al. 2018). We applied the standard statistical measure AUPRC Figure 2. Paralellization scheme for the MCP score algorithm. The input W matrix is replicated to all nodes and shared between cores of a node. Each core (colored) computes some tiles (light shading in the MCP matrix) in the MCP matrix thus the output matrix is distributed across the compute nodes. For each MCP tile being computed (solid square with thick border in the MCP matrix), the required rows and columns are shaded in the W matrix according to the core color. Pan et al.
for assessing the inferred networks. AUPRC is the area under the precision-recall curve, and is recommended in cases where there is an imbalance of positive and negative edges in the underlying network (Davis and Goadrich 2006). Using AUPRC for both ensemble optimization and final network evaluation leads to the ensemble network having the maximal AUPRC for the given partial groundtruth. However, when different groundtruth are used to parameterize the AUPRC function, the generalization performance of the MCPNet ensemble method can be assessed as demonstrated in Section 4.4. For in silico evaluations, we used NetBenchmark (Bellot et al. 2015), an R Bioconductor package that internally employs the GeneNetWeaver simulator (Schaffter et al. 2011) to generate synthetic yeast datasets. NetBenchmark uses 5196 real interactions from the Yeast Transcriptional Regulatory Network (Balaji et al. 2006) and employs non-linear ordinary differential expression simulation to produce 2000 observations for 2000 genes. Noise is then injected at five different levels in ten replicas to create 51 total datasets. The 5196 real interactions are used as true positives in the groundtruth, while the remaining 1 993 804 gene-pairs are considered as true negatives. The groundtruth for the simulated, as well as real Yeast and Arabidopsis thaliana are majority TF-target interactions and are used for both GCN and GRN evaluations. For synthetic TF-target network reconstruction, 187 TFs from Inferelator source repository (Castro et al. 2019) were included.
To determine accuracy and scalability of gene network reconstruction with real-world data, we used gene expression data from two different organisms-S. cerevisiae and A. thaliana. The S. cerevisiae (yeast) dataset is a compilation of multiple yeast RNA-seq expression studies, and contains 2577 observations and 5716 genes (Tchourine et al. 2018). To determine the quality of the real-world yeast networks, we adopted the groundtruth from Castro et al.  (Teixeira et al. 2006) and SGD (Costanzo et al. 2014) databases, and TF-target interactions derived from chromatin accessibility data (Boyle et al. 2008, Buenrostro et al. 2013) and TF binding sites (Weirauch et al. 2014) by associating TFs with the closest downstream genes. For real yeast TF-target network reconstruction, the 563 TFs from the Inferelator source repository (Castro et al. 2019) were included.
The A. thaliana microarray data was downloaded from public repositories (Aluru et al. 2022), and processed according to (Chockalingam et al. 2016). Data from six different tissues/conditions with varying sizes of gene expression datasets (Table 1) were used for gene network reconstruction. Microarray data was used as it was readily available from our previous studies, and it demonstrated the applicability of the MCPNet methods to both RNAseq and microarray data. Arabidopsis thaliana network(s) were evaluated using interactions from the following two known networks as groundtruth: (i) Arabidopsis Transcriptional Regulatory Map (ATRM) constructed from a priori biological knowledge by Jin et al. (2015), which includes 1359 TF-target interactions; (ii) 295 highest confidence TF-target interactions with scores greater than 9.5 selected from the 6863N-response dynamic factor graph (DFG) network edges identified in Brooks et al.
(2019, Supplementary File 10). These 6863 edges are themselves considered high confidence by Brooks et al., as they were selected from the network generated by DFG with timeseries nitrogen-treatment response data (Marbach et al. 2012) based on the 71 836 validated TF-target interactions from TARGET assays. This approach of incorporating highconfidence computed interactions have been used previously to validate gene networks (Vermeirssen et al. 2014, Castro et al. 2019. The interactions from these two given networks can be considered as positives, i.e. interactions that are expected to be highly weighted edges in any predicted network. For true negatives, we used a set of 4347 interaction pairs between chloroplast-encoded and mitochondriaencoded genes (Aluru et al. 2022) given the unlikelihood of a direct transcriptional interaction occurring between such genes in A. thaliana tissues (Woodson and Chory 2008). For A. thaliana TF-target network reconstruction, 2015 TFs from Chen et al. (2017) were included.
All software were run on a computing cluster with each node having two 2.7 GHz 12-Core Intel Xeon Gold 6226 Processor processors and 256 GB of main memory and running RedHat Enterprise Linux (RHEL) 7.0 operating system. For the simulated and Saccharomyces cervisiae RNA-seq datasets, all 24 cores of a node were used for software that could be run with multiple cores. For example, Arboreto, WGCNA, TINGE, Inferelator, and MCPNet are capable of using all the cores in a node. For A. thaliana datasets, we used up to eight nodes and all of the 192 cores distributed across these eight nodes for Arboreto, TINGE and MCPNet, which are capable of using multiple distributed cores. While Inferelator is also capable of utilizing multiple shared cores, it was not used in this case due to extremely long run-times. Scripts used for the evaluations presented here can be accessed at https://doi.org/10.5281/zenodo.6499756.

Results
The proposed MCPNet method takes as input correlation measures, which for the purpose of evaluation presented here are the MI values between pairs of gene expression profiles. We evaluated three common methods for estimating MI values from observed data and chose adaptive partitioning (AP) with rank transform for its accuracy and speed (see Supplementary Section S1.2). We use our implementation of the AP MI estimation algorithm based on the hybrid multithread, multi-node parallelization approach in Section 2.3. The existing tools utilized their respective built-in MI algorithms where applicable.
The unsupervised MCP score has a single parameter, path length L. Empirical testing with simulated Yeast data showed that the network quality reached maximum at L ¼ 4 and MCPNet improvements diminished beyond 4 (Supplementary Section S1.2). Subsequent evaluations of MCPNet, including the ensemble multipath scores, were conducted with L of 2, 3, and 4. The ensemble method optimizes the multipath score via a global coefficient search at a regular interval, which is a useradjustable parameter. We used 0.1 as the search interval as it represents a good balance between search quality and computation time.

Evaluation using simulated yeast expression data
We begin by evaluating the effect of noise in the gene expression profile data on GCN quality. Simulated yeast datasets were generated with varying noise levels using the NetBenchmark package.
For each of the datasets MCPNet generates three different sets of networks: (i) L-path MCP score networks (i.e. MCP score matrices R 2 ; R 3 , and R 4 from Section 2.1); (ii) ensemble networks constructed using ensemble method discussed in Section 2.4 [i.e. M 4 matrix in Equation (4)] constructed as the weighted combination of R 2 ; R 3 , and R 4 matrices; and (iii) Stouffer-transformed ensemble network, where Stouffer's Z-transform is applied to each of the M 4 matrices and the AUPRC for the best combination is reported as M 4 Z . This is similar to post-MI processing in CLR (Faith et al. 2007) to reduce background contribution to the score. Table 2 shows the average AUPRC for networks constructed by MCPNet and other network construction methods. Results for all of the simulated data instances are given in Supplementary Table S2. The rows R 2 ; R 3 , and R 4 are networks constructed using the path scores [Equation (2)] with path lengths of 2, 3, and 4, respectively. Note that for M 4 and M 4 Z networks, the average AUPRCs for the best combination is reported.
As expected, the AUPRC values, and hence the network performance, decreases with increasing noise levels. Our results also show that ensemble MCP Score achieves the best performance for low and moderately noisy datasets when compared to all the other gene network reconstruction methods. At noise levels higher than 50%, MCPNet is the second best performing method. The selection of best coefficients for each of the L-path capacities (i.e. R 2 ; R 3 , and R 4 ) results in an improvement over the individual MCP score networks. By exploiting known interaction information from user supplied partial groundtruth, the ensemble approach is able to identify the best combination for each of the noise levels. Thus, MCPNet's ensemble method is able to adjust to the unique data and noise characteristics of each dataset through fine tuning of its coefficients.

Evaluation with real gene expression datasets
We next evaluated performance of MCPNet using real-world data from two different organisms-yeast (S. cervisiae) and A. thaliana. For GCNs inferred from the real yeast data, our results show that Arboreto is the highest performing method followed by M 4 Z (Table 3). For this dataset, all methods showed low AUPRC scores, and the absolute AUPRC differences between Arboreto and M 4 Z can be considered marginal. The improvement of M 4 Z over M 4 and the relatively high AUPRC for CLR suggest that Stouffer transformation as a post-processing step is beneficial for this particular dataset. While Arboreto achieved the best network quality, it does so in $29 h and 7 GB of memory using 24 CPU cores whereas MCPNet completed the computation in 38 s and less than half of the memory. The quality and run-time balance compared to existing tools strongly favors MCPNet for its ability to reverse-engineer a good-quality network in reasonable time. An interesting observation is the uniformly low AUPRC values when compared to those in Section 4.1, and A. thaliana. The same observation can be seen in Castro et al. (2019) which suggest this maybe inherent in the expression profile data or the groundtruth.
Arabidopsis thaliana networks were constructed using data from the Development category (Table 1) containing the largest number of microarray datasets. Data were processed both on one node and eight distributed nodes for all network inference methods, with the exception of Inferelator which does not support multi-node execution and whose run-time for the development dataset is expected to exceed the cluster's job f Arboreto and Inferelator are expected to exceed run-time limit for single-core run.  a Numbers indicate average AUPRC values over ten sampling runs. The best performing networks are highlighted in bold, and the second best is underlined. Coefficients for the best performing M 4 and M 4 Z networks vary for each randomly sampled network. scheduler time limit. Table 4 shows that for the A. thaliana dataset, R 4 and M 4 produced the best results when compared to all other methods. Importantly, all the MCPNet algorithm variants (R 2 ; R 3 , and R 4 ) and ensemble methods (M 4 ; M 4 z ) outperformed all existing network inference methods in terms of AUPRC. With the large A. thaliana data, MCPNet's performance advantage over existing tools is even more evident. The two tools with the next best AUPRC values, CLR and Arboreto, required 18.5 h using 1 CPU core and 44.1 h and 437 GB of memory using eight nodes, respectively, while MCPNet completed the computation in <10 min on one node and 84 s on eight nodes with moderate per node memory footprints.
Finally, we evaluate the proposed methods using six A. thaliana datasets of different tissues and experimental conditions, each with differing number of samples and genes (see Table 1). Table 5 shows the AUPRC and run-times for the multipath ensemble MCP score network (M 4 ) and the next best results amongst ARACNe-AP, CLR, MRNET, TINGe, WGCNA, and Arboreto. Complete Results for all the methods are given in Supplementary Table S3. For all of the datasets analysed, MCPNet M 4 shows higher or equivalent AUPRC values while the next best AUPRC is achieved by CLR, MRNET, or Arboreto depending on the datasets, while being consistently faster by up to three orders of magnitude.
MCPNet consistently produces networks that are amongst the highest in quality as assessed by AUPRC and fastest in performance by orders of magnitude (see Supplementary Section S1.3 for a more in-depth computational performance discussion). Our results show that dataset characteristics can significantly influence the quality of the reconstructed network. The coefficients of the weighted average of L-path capacities for the best performing M 4 and M 4 z networks also varied by dataset as noted in Tables 3 and 4. The optimization approach therefore is well-suited for the ensemble methods as it adjusts automatically to data characteristics. In addition, since MCPNet is capable of efficiently generating multiple networks in one run (R 2 ; R 3 ; R 4 ; M 4 , and M 4 z ), the user can efficiently choose and further evaluate the networks with an appropriately chosen, biologically relevant downstream metric.

GRNs with synthetic and real yeast data sets
In this section, we evaluate MCPNet for the task of reconstructing GRNs, namely to compute the maximum path capacities between TFs and their target genes. By the definition of indirect interactions in Section 2.1, indirect TF-target interaction path may be modeled as a combinatorial sequence of TF-TF, TF-gene, and gene-gene interactions. For this evaluation, we focus on the mathematical path model consisting of a sequence of TF-TF interactions followed by one TF-target interactions as a simplification, similar to ARACNe-AP's DPI modeling for length 2 indirect interactions. We note that an alternative simplification, TF-target gene. . .gene-gene, is trivially contained in the GCN output matrix. GRNs were reconstructed MCPNet, ARACNe-AP, Arboreto, and Inferelator using real yeast and noise-free simulated datasets. CLR, MRNET, TINGe, and WGCNA were not included as they do not support GRN inferencing. Arabidopsis thaliana is not used for the evaluation as its true negative groundtruth do not involve the genomic TFs, thus precluding evaluation by AUPRC. Table 6 shows the qualities of the reconstructed GCNs and GRNs. All of the reconstructed TF-target GRNs have significantly higher AUPRC than their GCN counterparts. MCPNet methods, M 4 and M 4 Z particularly, outperformed existing methods for both GCN and GRN reconstructions using simulated datasets. For the real yeast dataset, MCPNet's M 4 Z method is second only to Arboreto in network quality, reflecting the same trend as for the GCN reconstruction. The high AUPRC values suggest that the mathematical formulation and algorithm implementation of MCPNet is well suited for both GCN and GRN inference.

Ensemble optimization with partial ground truths
The MCPNet ensemble method optimizes the coefficients of g 2 s;t ; g 3 s;t ; g 4 s;t to maximize the AUPRC given available partial groundtruth and gene expression data as input. The M 4 results in Table 2 were generated by optimizing the ensemble network AUPRC using the full groundtruth. For real-world a The software that produced the next best AUPRC are noted as footnotes. The first two columns report the input and output matrix sizes as millions of elements. b CLR. c MRNET. d Arboreto. data, often only partial groundtruth is available, and the size and membership of the partial groundtruth affect network quality.
To assess the impact of the partial groundtruth, we divide the full yeast groundtruth randomly into training and testing sets, and use the training set to optimize for an ensemble network from the noise-free simulated yeast gene expression data. The ensemble network is then evaluated using the test set and the full groundtruth. As the AUPRC is parameterized with different groundtruth, effectively the AUPRC functions for optimization and final network evaluation are distinct and independent. This process is repeated 100 times to characterize the distribution of the AUPRC values and the robustness of the ensemble algorithm with varying groundtruth. We evaluated training-testing split proportions of 0.5%, 1%, 2%, 5%, and 10% of the full groundtruth. The yeast groundtruth contain 92 732 gene-gene interactions, representing 0.57% of all possible gene-gene interactions between 5716 yeast genes. Figure 3 shows the distributions of the AUPRC values for each training-testing splits and full groundtruth. As expected, smaller training sets produced greater AUPRC value dispersion, as the ensemble network optimization relies on fewer groundtruth elements, and AUPRC computation is more influenced by variations in the gene-gene or TF-target interactions.
In all GCN cases, the AUPRC values are tight in dispersion, with standard deviations for the test AUPRCs being 0.0118, 0.0045, 0.0040, 0.0023, and 0.0026 for the 0.5-10% training-testing splits, respectively. Even for networks generated from only 0.5% of the full set, the AUPRC distribution lies completely above the ARACNe-AP's AUPRC value of 0.3503 and nearly all above R 4 's AUPRC of 0.3938, and the majority of the networks have AUPRC more than 0.4. The means, 0.4135, 0.4169, 0.4173, 0.4184, and 0.4201, are remarkably close to the 0.4198 AUPRC of the network optimized with the full groundtruth, despite the networks being generated using small training sets. The similarity of the AUPRC distributions between test sets and full groundtruth reflects the fact that the test sets are 90-99.5% identical.
For the GRN cases, the AUPRC values are significantly higher than their GCN counterparts, likely due to better curated groundtruth. The AUPRC distribution standard deviation tightens (0.0088, 0.0063, 0.0050, 0.0036, 0.0036) as the size of training set increases. Similarly, the mean AUPRCs (0.4979, 0.4999, 0.5025, 0.5054, 0.5070) for the trainingtesting splits are higher than that of R 4 , 0.4988, and converges to the AUPRC of the ensemble network optimized using the full TF-only groundtruth, 0.5069, even when training set represents a small fraction of the already small set of TF- Figure 3. The distributions of AUPRC values for the ensemble GCNs (left) and GRNs (right) produced from training sets randomly selected from the full and TF-only groundtruth, respectively. The horizontal axes show the percentages of the full or TF-only groundtruth used as training set. Each violin plot shows the AUPRCs evaluated against the test set (left half of violin), and the full groundtruth (right half of violin). The solid green, dashed blue, and dotted red lines show the AUPRC values of the ensemble networks generated from the full or TF-only groundtruth, the R 4 network, and ARACNe-AP, respectively. only groundtruth. Regardless of the training set split, however, MCPNet's ensemble network outperforms ARACNe-AP significantly. These results suggest that even when the ensemble network is optimized using a small partial groundtruth set, with high probability it will be of higher quality than the R 4 network. Furthermore, a small partial groundtruth, even at 1% of full groundtruth, will on average produce a network with similar quality as one optimized using the full groundtruth. Finally, the ensemble method is robust to varying make-up of the training set, and thus the choice of ensemble weights.

Discussion
The efficient algorithm and parallel implementation of MCPNet creates opportunities for data exploration and biological hypothesis testing. Its scalability to hundreds of CPU cores enables analyses of data with sample and gene counts that are previously prohibitive, and exploration of parameter spaces and multiple methods at the same time, e.g. computing length-2, -3, and -4 MCP scores, our ensemble network and its Stouffer transform in one pass. This is useful where the individual methods are suspected to behave in a data-dependent manner, e.g. with Stouffer transform in CLR and M 4 Z for the real yeast data, and inspecting multiple networks is desirable. In addition, as the MCP score formulation differs from existing algorithms, MCPNet provides complementary network that can contribute to multi-method ensemble approaches, i.e. combining networks generated by the tools evaluated in this work.
There are multiple aspects of the MCPNet algorithm and implementation that deserve further exploration. An important consideration for MCPNet development is the usability of the software. While the evaluations in this paper utilized nodes in a high performance cluster, the real yeast and the A. thaliana experiments showed that commodity personal computers are well suited for smaller datasets, while professional workstations should easily exceed the requirements for processing large datasets using MCPNet. At the same time, user friendly integration into common, existing bioinformatic programming environments and pipelines will ease downstream processing and biological interpretation of the reconstructed network and promote user adoption. Memory footprint reduction and language binding for Python and R/Bioconductor represent two future software engineering efforts for MCPNet. The MCP mathematical model underlying MCPNet is agnostic of the edge weight formulation. While we focused on MI values as edge weights, other measures such as Boolean values and Pearson correlations may allow different network modeling and incorporate information such as up-and downregulations. Our evaluations showed that MCPNet performed well with MI of RNAseq and microarray data as averaging improves signal to noise ratio, but such may not be the case for single cell transcriptomic data as MI estimation with highly sparse data may present significant challenges for MCPNet and gene network reconstruction in general, and information regarding groundtruth for individual cell types is still limited for validation. Single cell transcriptomic data, promises to allow cell-type specific GCNs and GRNs, thus MCPNet applicability to such data is of significant interest.

Conclusion
In this paper, we present a new method MCPNet for GCN and GRN reconstruction that utilizes a novel edge scoring metric, the MCP score, based on the MCP problem in network and graph analysis. MCP score characterizes indirect gene-gene and regulator-target interactions to identify significant direct interactions in an unsupervised manner and with minimal user-specified parameters. We further present an ensemble approach that uses partial network groundtruth to optimize the weighted average of MCP scores from multiple Lpaths scores, with the objective of inferring higher quality novel interactions. The MCP score methods have been shown to generate networks with competitive or superior AUPRC scores on real-world S. cerevisiae and A. athaliana datasets, while reducing run-time by several orders of magnitude relative to existing best-in-class methods.
While we have applied MCPNet for GRN and GCN reconstruction, it operated on gene expression profile from a single time point, thus cannot infer causality, similar to the other existing tools evaluated in this work. As a consequence, the predicted gene co-expressions and regulatory interactions must be considered as candidates for further validation. With this understanding, the combination of network quality and computational performance suggests MCPNet to be a viable first-choice tool for gene network reconstruction and candidate gene and TF-target interactions.