fastBMA: scalable network inference and transitive reduction

Abstract Inferring genetic networks from genome-wide expression data is extremely demanding computationally. We have developed fastBMA, a distributed, parallel, and scalable implementation of Bayesian model averaging (BMA) for this purpose. fastBMA also includes a computationally efficient module for eliminating redundant indirect edges in the network by mapping the transitive reduction to an easily solved shortest-path problem. We evaluated the performance of fastBMA on synthetic data and experimental genome-wide time series yeast and human datasets. When using a single CPU core, fastBMA is up to 100 times faster than the next fastest method, LASSO, with increased accuracy. It is a memory-efficient, parallel, and distributed application that scales to human genome-wide expression data. A 10 000-gene regulation network can be obtained in a matter of hours using a 32-core cloud cluster (2 nodes of 16 cores). fastBMA is a significant improvement over its predecessor ScanBMA. It is more accurate and orders of magnitude faster than other fast network inference methods such as the 1 based on LASSO. The improved scalability allows it to calculate networks from genome scale data in a reasonable time frame. The transitive reduction method can improve accuracy in denser networks. fastBMA is available as code (M.I.T. license) from GitHub (https://github.com/lhhunghimself/fastBMA), as part of the updated networkBMA Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/networkBMA.html) and as ready-to-deploy Docker images (https://hub.docker.com/r/biodepot/fastbma/).


Our Contributions
We have previously described ScanBMA [14], an implementation of Bayesian model averaging (BMA) [20] for inferring regulatory networks. ScanBMA is available from the "networkBMA" Bioconductor package [21], written in R and C++. It has been shown that ScanBMA generates compact accurate networks that can incorporate prior knowledge.
In this paper, we present fastBMA, which is completely written in C++, and uses more efficient and scalable regression and hashing methods. The algorithmic improvements increase the speed by a factor of 30 on smaller sets, with greater increases observed on larger sets due to improved scalability.
fastBMA is parallelized using both OpenMP and MPI allowing for further increases in speed when using multiple cores and processors. Although fastBMA uses the same core methodology as ScanBMA, the increased scalability allows for more thorough sampling of the search space to increase accuracy. The new probabilistic hashing procedure used by fastBMA is faster and utilizes 100,000 times less memory when analyzing large numbers of variables. This allows fastBMA to operate on genome scale datasets without limiting the possible regulators of a given gene to a smaller subset.
A new feature of fastBMA is the implementation of a novel method for eliminating redundant indirect edges in the network. The post-processing method can also be used separately to eliminate redundant edges from networks inferred by other methods. The code is open-source (M.I.T. license). fastBMA is available from GitHub (https://github.com/lhhunghimself/fastBMA), in R as part of the networkBMA package ((https://www.bioconductor.org/packages/release/bioc/html/networkBMA.html)) and as Docker images (https://hub.docker.com/r/biodepot/fastbma/). The Docker containers include all the supporting dependencies necessary for MPI and make it much easier to run fastBMA on a local or cloud cluster.

Bayesian model averaging (BMA)
We can formulate gene network inference as a variable selection problem where the dependent variable (target gene expression) is modeled as a function of a set of predictor variables (regulatory gene expression Time series data can also be modeled by using the expression at the previous time point to predict the next time point. (2) Different candidate models can be constructed from different sets of regulator genes. Models can be evaluated based upon a measure of their goodness of fit, such as the sum of residuals. However, in ge-netic analyses, the number of genes often exceeds the number of samples, and many different models can fit the data reasonably well. The core idea behind the BMA methods is to find the posterior probability for each model and make a consensus prediction giving proportionately more weight to the higher confidence models. In terms of gene regulation, the posterior probability that gene j is a regulator of gene i is the sum of the posterior probabilities of all candidate models that include gene j in the set of regulators of i. This posterior probability becomes the weight of the edge drawn from gene j to gene i in the gene network.

Estimating model posterior probabilities
Estimation of the posterior probabilities of the models can be accomplished by a variety of methods, some of which are very computationally intensive [12]. The original BMA [20] and iterative BMA (iBMA) methods [22] use the Bayesian Information Criterion (BIC) [23] which is simple to calculate and penalizes larger models which are easier to fit. However, BIC is an asymptotic approximation that is most accurate for large sample sizes. As an alternative, ScanBMA provided the option of using Zellner's g prior [24] to compute the posterior probabilities. The optimal g prior parameter is obtained by finding the value that maximizes the total posterior probability of the models. Adjusting the range of possible values for the g prior allows us to tune the method for smaller sample sizes and produce better networks.
fastBMA exclusively uses the g prior to estimate the posterior probabilities and uses faster C++ code for the expectation maximization (EM) optimization of the g parameter.

Sampling candidate models
The number of possible candidate models grows exponentially with the number of possible regulators, necessitating an efficient methodology to find a subset of reasonable models. In the original implementation of BMA, the leaps and bounds algorithm [25] is used to identify the n best models for a given number of variables. Occam's window [26] is then used to discard models with much lower posterior proba-bilities than the best model. The leaps and bounds algorithm scales poorly and is limited in practice to fewer than 50 variables. Iterative BMA (iBMA) uses a pre-processing step to rank all variables (genes), iteratively applies the original BMA to the top w variables (w=30 by default), and discards predictor variables with low posterior inclusion probabilities [13]. In the iterative step, new variables from the ranked list are added to replace the discarded variables. This procedure of repeatedly applying BMA and variable swaps is continued until the w top ranked variables have been processed. In contrast to iBMA, ScanBMA removes the restriction of the search space to an initial list of variables [14]. ScanBMA keeps a list of the best current linear regression models found so far and adds or removes a variable from these models to search for better models. The process is repeated until no new models are added or removed from the best set of models. ScanBMA's greedy approach and the implementation of its core routines in C++ enable it to typically run faster than iBMA. In this paper, we present fastBMA that uses the ScanBMA approach but exploits the fact that new models are based upon existing models. In particular, new models are fitted using the results from the existing models which increases the speed and scalability of the search.

Post-processing graphs by transitive reduction
BMA and other methods for reconstructing biological networks can generate edges between genes that are the result of indirect regulation through one or more intermediate genes. While having edges that represent either direct or indirect interactions is a perfectly acceptable graph, biological networks are usually represented by edges that represent direct interactions. Such networks allow for more straightforward identification of potential driver genes. For genetic networks, it is therefore desirable to remove edges between nodes where the regulation is indirect (transitive reduction). This can be done through post-processing of the inferred network. One intuitive approach is based on eliminating direct edges between two nodes when there is a better indirect path [27]. For example, Bosnacki recently proposed comparing p-values of the best edge in an indirect path with that of the direct path [28]. fastBMA introduces a similar approach that reduces transitive reduction to a shortest path problem which can be solved more efficiently for the sparse graphs typically found in gene regulatory networks . Table 1 summarizes the key differences between the different BMA implementation.  Figure 1 shows an outline of fastBMA. In this section, we report our algorithmic and implementation contributions in fastBMA and our evaluation procedure. Pseudocode for the entire implementation is provided in supplemental materials.

Algorithmic outline of fastBMA
The core approach for fastBMA is similar to that used by ScanBMA. The best models are found using ScanBMA's search strategy with a starting value of g in the interval [1… NumberOfSamples]. Brent minimization [29] is then used to find the value g in the interval that gives rise to the set of models with the highest marginal probabilities. A graph is constructed by drawing edges between genes with an edge weight equal to the average posterior probability of the regulator over the set of reasonable models.
Transitive reduction is applied to this graph to remove edges that can be adequately explained by a better indirect path. A final graph is constructed by retaining edges with weights greater than a given cutoff.
There are 4 major algorithmic improvements that increase the speed, scalability and accuracy of fastBMA: 1. Parallel and distributed implementation 2.
Faster regression by updating previous solutions 3. Probabilistic hashing

Parallel and distributed implementation
Parallelization can be accomplished by using a shared memory system, such as OpenMP (http://openmp.org/wp/), which is designed for assigning work to different threads in a single CPU with multiple cores. In contrast, MPI (Message Passing Interface) (https://www.mpich.org/ ) launches multiple processes on one or more CPUs and passes messages between processes to coordinate the distribu-tion of work. Both of these approaches have their respective advantages and disadvantages. OpenMP is applicable only to CPU's on a single machine and is a bit slower for fastBMA. MPI is usable on a single machine or a cluster but requires some work to set up. fastBMA implements both approaches allowing the user to choose the preferred methodology based on their requirements.
Inferring the entire regulatory network involves finding the regulators for every gene in the set. Since each of these determinations is carried out separately, each thread or process can be assigned the task of finding the regulator for a subset of genes in the set. When OpenMP is used, it provides a scheduler that dynamically assigns the regression calculations for a given gene to each thread. Threads work simultaneously on their tasks and receive a new task when they finish the previous task. All threads share access to memory and the same input data for the regression is available to all the threads. The parallel code only extends to the regression loop -the final transitive reduction post-processing and output is done by a single thread.
When MPI is used, we initially split the tasks evenly among the available CPUs. In the case of MPI processes, memory is not shared. Instead the input data is read by a master process and distributed to all the participating processes using MPI's broadcast command. All processes then work on their tasks simultaneously in parallel and send messages to all the other processes so that all processes know which tasks are being worked upon. The length of time required for each calculation varies considerably and as a result, some processes will finish before others. A process that finishes early then works on tasks initially assigned to other processes that have not yet been started. When all the regulators for all the genes have been found, a master process gathers the predictions, performs transitive reduction post-processing and outputs the final complete network. OpenMP can also be used in conjunction with MPI to further subdivide the tasks among threads available to a CPU.

Faster regression by updating previous solutions
Even with the above parallel implementation, each individual calculation of regulators is still accomplished by a single process. If the regression procedure is too slow, this step can be rate- It is vital that the ScanBMA algorithm does not sample a model more than once to ensure that the method will converge and terminate. However, the methodology is quite tolerant of falsely excluding models that have not been sampled. ScanBMA only explores a small sample of the possible models -the vast majority of models are normally excluded. Furthermore, in the BMA approach, many models are averaged to obtain the final edges. Variables that are important appear in many models. In the rare case where a good model is falsely excluded, the impact is minimized because the key regulators in the falsely excluded model will be found in other models. When such false negatives are tolerated, an alternative to using a hash table is to ignore the collisions. This saves both time and space by removing the dependence on m for both time and space complexity. An example of a noisy or probabilistic hashing approach is the Bloom filter [30], which has been used for bioinformatics applications [31] due to fast computation and low memory requirements.
fastBMA includes an optimized implementation of a probabilistic hash (see Figure 2)  selective criterion of comparing best edge in the path. The search is also bounded: once a path's distance exceeds the direct distance, there is no need to further explore that path. In addition, fastBMA produces graphs with few high weight edges and in practice, the algorithm is much faster than the worst case as most searches are quickly terminated.

Datasets used for testing
We have previously benchmarked ScanBMA [14] against other network inference methods (MRNET [5], CLR [32] , ARACNE [4], DBN [8], and LASSO [11,33]) on smaller test sets. In this study we compare fastBMA only to ScanBMA and LASSO which were the two most accurate methods in these benchmarks and are the only two methods that could infer networks from the larger datasets in a reasonable time.
We used the following 3 datasets for testing.
1. Simulated 10-gene and 100-gene time-series data (5 sets of each) and the corresponding reference networks from DREAM4 [34][35][36][37][38]. As these datasets are simulated, the true regulatory relationships are known and are used to evaluate the accuracy of the predicted networks.
2. Yeast time-series expression data (ArrayExpress E-MTAB-412) consisting of 3556 genes over 6 time points [39]. Being actual data, there is no absolute ground truth. Instead, we compared the regulatory predictions with the literature-curated regulatory relationships from the YEASTRACT database [40].
3. Human single-cell time-series RNA-Seq data GSE52529 (9776 genes) from GEO [41]. As no satisfactory gold standard was available, we only used this to demonstrate that fastBMA could scale to noisy human genome-wide expression data.

Assessment metrics
We define a true positive (TP) as an edge in the inferred network that is also present in the ground truth We primarily use AUPR and AUROC for the assessment as these metrics measure the overall performance of the methods. In practice, however, predicting some edges accurately, even if only for the most confident predictions is still valuable for narrowing down a set of potential interactions to be further explored. Hence, we also plot the precision-recall graph to assess where the differences in accuracy are occurring.

RESULTS
We applied our fastBMA algorithm to both simulated and real time series gene expression data. We had previously tested several methods on these datasets [14] and found that ScanBMA and LASSO were the fastest and most accurate methods. Therefore we compared the fastBMA results to ScanBMA and LASSO [33,42]. LASSO is a non-Bayesian linear regression method that uses a penalty term to prevent overfitting to models with many variables. It is written in Fortran and is one of the fastest network infer- ScanBMA. The x-axis is logarithmic, indicating fastBMA is orders of magnitude faster than ScanBMA when using the same parameters. Alternatively, one can use a larger odds ratio with fastBMA and obtain a more accurate result in same time it would take to run ScanBMA with a smaller odds ratio. On the same datasets, fastBMA is also more accurate and faster than LASSO, the degree and nature of improvement dependent on whether the user chooses to emphasize speed or accuracy through the choice of the odds ratio parameter.
One of the main advantages of the BMA methods is that they are able to incorporate prior information to improve inference. This was not possible for the DREAM4 dataset as it is a synthetic dataset. In this case, an uninformative uniform prior probability is used. However, for the yeast dataset we had access to priors from external data sources [12]. This also allowed us to triage the variables to be explored to the 100 variables with the highest prior probabilities, saving considerable computational resources. As expected, using informative priors increases the accuracy and decreases the running time relative to LASSO. In addition, we ran fastBMA, without informative priors and without restricting the number of variables (i.e. using all 3556). This is beyond the capabilities of ScanBMA when using wider search windows. Even on this computationally demanding task, inferring the yeast network without informative priors, fastBMA is faster than LASSO with increased accuracy as assessed by AUROC and AUPR.
A common use for computational network inference is to identify a small set of potential regulators that could be verified with further experiments. For this use case, an improvement in the precision of the most confident predictions is more important than a small improvement in the overall performance of the method. As some of the differences in AUC for the yeast dataset are relatively small, we plotted the precision recall curves in figure 5. We see that the precision of the most confident predictions (i.e. lowest recall) is increased. The advantage of using informative priors when available is very clear. However, even when prior knowledge is not available, the fastBMA algorithm is superior which is especially evident in the case of the Dream4 dataset.
The effect of post-processing is more limited. In figure 5, the precision-recall curves for the Dream4 dataset are almost identical for fastBMA and LASSO with and without post-processing. The same result was observed for fastBMA on the yeast dataset and for clarity, we did not plot the overlapping precision-recall curves for the post-processed networks for fastBMA. However, we do see that postprocessing has an effect on LASSO for the yeast dataset.
We also tested fastBMA on a human single cell RNA-Seq dataset with 9776 variables. Using a 32 core cluster on Microsoft Azure, fastBMA was able to obtain a network in 13 hours without using informative priors. Neither ScanBMA, nor LASSO is able to return results for this dataset. We do not have a gold standard for this test -the purpose was to demonstrate that fastBMA could handle a very large and noisy genomic sized dataset and return a network within a reasonable time even in the worst case scenario where the data is noisy and there is no prior information.

DISCUSSION AND CONCLUSIONS
We have described fastBMA, a parallel, scalable and accurate method for inferring networks from genome wide data. We have shown that fastBMA can produce networks of increased accuracy orders of magnitude faster than other fast methods even when using a single thread. Further speed increases are possible by using more threads or processes. fastBMA is scalable and we have shown that it can be used analyze human genomic expression data even in the most computationally demanding situation of noisy data, no informative priors and considering all genes a possible regulators. prove useful as an adjunct to methods and datasets that give rise to denser networks and are more prone to over-predicting edges.
Although we have focused on biological time series data, fastBMA can be applied to rapidly infer relationships from other high dimensional analytics data. Also the fastBMA methodology can be extended for even more demanding applications. For example, multiple bit filters could be used to (i.e. a Bloom filter) to hash larger search spaces. We anticipate that the efficiency of fastBMA will be especially use-   We take the negative log of the probabilities (middle panel) to transform the multiplication into distances. The indirect path A→B→C is shorter than the direct path A→C which is equivalent to the probability of A regulating C through B being greater than the probability of A directly regulating C. As a result, the edge between A and C is removed.