Identifying strengths and weaknesses of methods for computational network inference from single-cell RNA-seq data

Abstract Single-cell RNA-sequencing (scRNA-seq) offers unparalleled insight into the transcriptional programs of different cellular states by measuring the transcriptome of thousands of individual cells. An emerging problem in the analysis of scRNA-seq is the inference of transcriptional gene regulatory networks and a number of methods with different learning frameworks have been developed to address this problem. Here, we present an expanded benchmarking study of eleven recent network inference methods on seven published scRNA-seq datasets in human, mouse, and yeast considering different types of gold standard networks and evaluation metrics. We evaluate methods based on their computing requirements as well as on their ability to recover the network structure. We find that, while most methods have a modest recovery of experimentally derived interactions based on global metrics such as Area Under the Precision Recall curve, methods are able to capture targets of regulators that are relevant to the system under study. Among the top performing methods that use only expression were SCENIC, PIDC, MERLIN or Correlation. Addition of prior biological knowledge and the estimation of transcription factor activities resulted in the best overall performance with the Inferelator and MERLIN methods that use prior knowledge outperforming methods that use expression alone. We found that imputation for network inference did not improve network inference accuracy and could be detrimental. Comparisons of inferred networks for comparable bulk conditions showed that the networks inferred from scRNA-seq datasets are often better or at par with the networks inferred from bulk datasets. Our analysis should be beneficial in selecting methods for network inference. At the same time, this highlights the need for improved methods and better gold standards for regulatory network inference from scRNAseq datasets.


Introduction
Inference of genome-scale regulatory networks from global mRNA profiles has been a long-standing problem in systems biology and gene regulation. These networks are important for understanding the molecular mechanisms of diverse biological processes, including response to environmental signals, cell type specification and disease. The availability of single-cell omic data opens up new opportunities to reverse engineer transcriptional regulatory networks, enabling us to study the regulatory network from multiple cell types. Single-cell RNA-seq (scRNA-seq) measures genome-wide expression profiles of tens of thousands of cells, which greatly reduces the cost of generating datasets with the large sample sizes needed for computational network inference. Furthermore, it offers the potential to infer cell-type-specific regulatory networks for both known and novel cell populations in the same experiment, offering an efficient way to dissect heterogeneous populations.
The availability of scRNA-seq datasets has fueled the development of a number of methods for network inference from these data that use different types of models ranging from Gaussian graphical models , information theoretic methods (Chan et al. 2017;Dijk et al. 2018;Qiu et al. 2018), random forests (Aibar et al. 2017), ordinary differential equations (Matsumoto et al. 2017), and Boolean networks (Lim et al. 2016) (See Fiers et al. 2018;Nguyen et al. 2021 for comprehensive reviews). Methods also vary in their inclusion of pseudotime (Matsumoto et al. 2017;Specht and Li 2017;Qiu et al. 2018) or imputed, de-noised signals (Dijk et al. 2018). Some of these methods specifically model the statistical properties of scRNA-seq data (Intosalmi et al. 2018;McDavid et al. 2019), while others are adaptations of existing methods for bulk data (Aibar et al. 2017). In parallel with method development, there have been two benchmarking studies (Chen and Mar 2018;Pratapa et al. 2020) to compare existing network inference algorithms. One study showed that existing methods perform poorly on both simulated and real data (Chen and Mar 2018), while the second showed that methods that perform well on simulated data can be relatively poor on real data (Pratapa et al. 2020). While these studies have been useful, a number of open questions need to be addressed to gain a comprehensive understanding of the strengths and weaknesses of existing network inference algorithms from scRNA-seq datasets. In particular, existing benchmarking efforts have studied relatively small networks from real data (<2000 genes), while in reality genome-scale regulatory networks can have between 5 and 10k genes. Another unknown is the extent to which different gold standards (e.g. ChIP-chip/ seq versus regulator perturbation) can affect measured performance and to what extent do methods agree with each other on novel predictions and recovery of the gold standard networks. In addition, imputation of sparse scRNA-seq count matrices has been suggested to improve downstream analysis including the identification of functional relationships among genes (Dijk et al. 2018). However, it is not clear whether this benefits network inference methods more generally. Finally, experiments from bulk RNA-seq datasets have shown that integrating auxiliary sources of data to inform the graph structure and estimate transcription factor activity (TFA) levels has been useful, but the utility of prior knowledge has not been assessed for scRNA-seq network inference.
To address these gaps, we first compared 11 network inference methods on seven published scRNA-seq datasets from human, mouse, and yeast ( Fig. 1). Benchmarking the algorithms for time and memory consumption on datasets of different sizes identified several algorithms that are unlikely to scale to genome-scale gene regulatory networks. We compare the algorithm performance using global network topology metrics such as Area Under the Precision Recall (AUPR) curve and F-score as well as local metrics such as the number of regulators whose targets can be predicted accurately. Based on global metrics, network inference methods had modest performance compared to random; however, the local metrics show that methods can predict targets of several relevant regulators. Generally, the top performing algorithms tend to agree with each other on predicted edges, although this depends upon the depth of the dataset. Imputation of scRNA-seq datasets while beneficial for a handful of methods and dataset combinations, did not benefit the majority of the network inference approaches and rather reduced performance. To our surprise, simple correlation-based methods performed as well or better on these metrics compared to many network inference algorithms. We compared the performance of these algorithms on single-cell datasets to those on bulk datasets for matched conditions and find comparable performance for both classes of algorithms. We also explored the effects of dataset size (i.e. the number of cells in a dataset) on network inference and observed that the sequencing depth had a greater effect on network inference when controlled for the dataset size. Finally, methods that incorporated priors and Transcription Factor Activity (TFA) had the best overall performance across different metrics, outperforming methods that use expression alone. Overall, our study presents an extended comparison of recent algorithms of network inference from scRNA-seq datasets and identifies common strengths and weaknesses on different types of gold standards. Our results should be beneficial to the community for selecting an algorithm for network inference as well as for developing new network inference algorithms for scRNA-seq datasets.

Dataset preprocessing
We obtained a total of seven scRNA-seq datasets from three different species, human, mouse and yeast (see Supplementary  Table S2 for details). Each dataset was filtered in a two step process. First, we filtered out genes that were expressed (count >0) in fewer than 50 cells and cells with fewer than 2000 total UMIs. Next, we added back regulatory proteins such as transcription factors (TFs) and signaling proteins for each dataset in which they were expressed. Each dataset was depth normalized and the count was square-root transformed. Finally, we removed genes that were not measured in any of the gold standards associated with the dataset. Pre-processing scripts are available at https://github.com/Roy-lab/scRNAseq˙NetInference/tree/master/ scripts/data˙preprocessing. For performing the memory and time profiling experiments, we used the Han human pluripotent stem cell dataset (Han et al. 2018), which had a total of 5,520 cells and selected number of genes, n ∈ {10, 25, 50, 100, 250, 500, 1000, 2000, 5000, 8000}. Each algorithm was applied on each dataset and profiled for memory requirements and execution time. Runtime and memory were captured using the /usr/bin/time -v command in a Linux environment.

Regulatory network inference algorithms for single-cell RNA-seq datasets
To assess algorithms for regulatory network inference, we considered recently published methods that span a variety of modeling formalisms including ordinary differential equations, probabilistic models, and information theoretic methods. We considered algorithms developed for both bulk and scRNA-seq datasets for network inference. Here we provide a brief summary of each algorithm included in our analyses (see Supplementary Table S1 for more details about the algorithm implementation).

BTR
BTR is based on a Boolean model representation of a gene regulatory network (Lim et al. 2016). Here a regulatory interaction is inferred by assuming a gene has a value of either 0 or 1 and the state of a target gene is specified by a boolean function of the state of its regulators.

HurdleNormal
This approach is based on a multi-variate Hurdle model which is a probabilistic graphical model (McDavid et al. 2019) suited for handling zero-inflated datasets. The graph structure, which represents the gene regulatory network is estimated by first defining the model in terms of a set of conditional distributions and solving these as a set of penalized neighborhood selection problems.

scHiRM
Single-cell hierarchical regression model (SCHiRM) is a probabilistic model based framework suited for handing sparse datasets such as those from scRNA-seq experiments (Intosalmi et al. 2018). The expression level of a gene is modeled as a Poisson log distribution with overdispersion. A regulatory network is estimated by a regression function that links the expected expression level of a gene to the expected expression level of target genes using a hierarchical framework.

kNN-DREMI
kNN-DREMI recovers gene-gene relationships by adapting the DREMI algorithm (Krishnaswamy et al. 2014) to scRNA-seq data (Dijk et al. 2018). DREMI makes use of a conditional density to estimate mutual information between pairs of genes. It makes use of a heat-diffusion based kernel density estimator to estimate the joint density, followed by conditional density estimation. kNN-DREMI replaces the heat diffusion kernel with a knn-graph based density estimator to handle the high-dimensionality and sparsity of the scRNA-seq data.

LEAP
LEAP (Specht and Li 2017) requires cells to be ordered along a pseudotime trajectory, and computes pairwise correlation between genes i and j at various lags along this trajectory. The algorithm chooses a window size s, and for all possible start times t and lags l, computes Pearson's correlation coefficient between the first s observations of i beginning at time t and the s observations of j beginning at time t + l. The score for a regulatory relationship from i to j is the maximum of all computed correlation coefficients ρ ij .

PIDC
PIDC (Chan et al. 2017) is based on an information-theoretic framework and estimates a specific type of multi-variate information called Partial Information Decomposition (PID) between three random variables, each representing a gene. PID is defined using the "unique," "synergistic", and "redundant" information between two variables, X and Y and a third variable Z. The "unique" information is the contribution to the multi-information specifically from two of the variables, "redundant" is the contribution from either X or Y, while "synergistic" is the contribution from both X and Y to Z. In the context of network inference, PIDC makes use of the ratio of the unique information to the mutual information between two variables and establishes an edge based on the magnitude of this ratio.

SCENIC
SCENIC is based on the GENIE3 (Huynh-Thu et al. 2010) network inference algorithm, which infers a gene regulatory network by solving a set of regression problems each solved by an ensemble of regression trees. SCENIC also offers two additional steps, RcisTarget for enriched motif identification in the target set of a TF and AUCell to determine the activity of a set of targets of a TF in each cell. For our experiments, we only used the GENIE3 outputs from SCENIC.

SCODE
SCODE is based on an Ordinary Differential Equation based formulation of a gene regulatory network and requires pseudotime as input in addition to the expression matrix for learning the regulatory network (Matsumoto et al. 2017). SCODE makes use of a lower dimensional representation of the major expression dynamics of regulators, z and uses a linear transformation matrix, W to estimate the observed expression dynamics of all genes from z. The gene regulatory network is represented as a function of W and estimated by solving a set of linear regressions.

Scribe
Scribe (Qiu et al. 2018) is based on an information theoretic framework that uses pseudotime as an additional input for regulatory network inference. Scribe is based on a specific type of information measure called Restricted Directed Information that measures the mutual information between the current state of a target gene and a previous state of a regulator, conditioned on the previous state of the target gene. Scribe additionally uses a Overview of network inference benchmarking study. Our study examined 11 algorithms developed for scRNA-seq datasets and two algorithms for bulk expression across seven different scRNA-seq datasets from three species: yeast, mouse, and human. bias correction scheme of the samples such that cells capturing transitioning states are weighted more towards establishing the regulatory relationships.

SILGGM
SILGGM makes use of a Gaussian graphical model  representation of a gene regulatory network and offers several recent and efficient implementations for estimating conditional dependence between random variables (genes): bivariate nodewise scaled Lasso, de-sparsified nodewise scaled Lasso, de-sparsified graphical Lasso, and GGM estimation with false discovery rate (FDR) control using Lasso or scaled Lasso. For our experiments, we applied SILGGM with de-sparsified nodewise scaled Lasso implementation of neighborhood selection, which is the default algorithm for SILGGM.

Bulk dataset network inference methods
In addition to above methods which are specifically geared for scRNA-seq datasets, we also considered methods developed for bulk RNA-seq datasets: MERLIN (Roy et al. 2013) and Inferelator (Greenfield et al. 2013). MERLIN is based on a dependency network learning framework and employs a probabilistic graphical model prior to impose a modularity prior. We ran MERLIN with stability selection and default settings for the sparsity p = −5, modularity r = 4, and clustering h = 0.6 parameters. Inferelator is also based on a dependency network learning framework which uses regularized regression for network inference. We used the BBSR algorithm within Inferelator, v0.3.0 downloaded from https://github. com/simonsfoundation/inferelator˙ng and used Inferelator's internal stability selection to obtain edge confidence.

Pearson and random networks
Our analyses included two inferred networks as baselines. The "Pearson" network was an undirected fully connected network, with edges weighted by the Pearson correlation between each pair of genes over all cells. The "random" network was an undirected fully connected network, with edge weights drawn from a uniform distribution Unif(0, 1).

Comparison to network inference on bulk expression data
Methods used to infer networks on bulk data include GENIE3 (Huynh-Thu et al. 2010), TIGRESS (Haury et al. 2012), and MERLIN (Roy et al. 2013). To run GENIE3 we used K = sqrt and nb trees = 1,000, to run TIGRESS we used R = 1,000, α = 0.3, and L = 30, and to run MERLIN we used default configuration of sparsity p = −5, modularity r = 4, and clustering cut-off h = 0.6. MERLIN was run in a stability selection framework, where 100 subsets were created by randomly selecting half the samples in the input expression matrix, a network was inferred on each subset, and a consensus network was made by calculating the frequency of observing each edge in the 100 inferred networks. As in the singlecell case, we created a network where edges were ranked by absolute value of Pearson correlation of expression profile of genes. PGG (Per Gene Greedy) is MERLIN without modularity prior and was run using the same parameters and in a stability selection framework similar to MERLIN. To run LARS-EN we used lambda2 = 1 × 10 −6 and the L1 penalty was set with ten-fold cross validation. We ran LARS-EN in a stability selection procedure similar to MERLIN. We used the MATLAB implementation of LARS-EN from the Imm3897 package downloaded from http://www2.imm. dtu.dk/pubdb/views/publication˙details.php?id=3897.
We performed these comparisons on three datasets that came from comparable conditions for bulk and single cell. In yeast, we used the environmental stress response data from Gasch et al. (2000) for bulk to match our scRNA-seq dataset (Gasch) also generated by Gasch and colleagues (Gasch et al. 2017). For human and mouse, we curated publicly available datasets from the gene expression omnibus (Supplementary Table S4) for human and mouse embryonic stem cell (ESC) state (A.F.S. in preparation). We compared the performance of bulk human dataset to that from the Han dataset (Han et al. 2018) and the performance of the bulk mouse dataset to that from the Tran (A2S) dataset (Tran et al. 2019). Because the gene sets of the bulk and single-cell data were different, we filtered the gold standards to include TFs and targets that were in both datasets to make the results comparable.

Assessment of the inclusion of priors and Transcription Factor Activity (TFA) on network inference
We tested the addition of priors and TFA on network inference performance for the MERLIN, SCENIC, and Inferelator algorithms. Specifically, for both MERLIN and Inferelator we tested the addition of priors, and priors with TFA. For Inferelator, we used the TFA estimated using the network component analysis (NCA) algorithm (Liao et al. 2003) (referred as NTFA) as well as its inbuilt TFA estimation. For SCENIC, we tested the addition of NTFA. We selected these algorithms because they were either among the top performing algorithms and/or had the ability to incorporate this additional information. We referred to these extended versions of each algorithm with "+P" for addition of prior (e.g. MERLIN+P was MERLIN with prior knowledge), "+TFA" or "+NTFA" for methods using TFA (e.g. SCENIC+NTFA) and "+P+TFA" if the algorithm used both prior and TFA (e.g. Inferelator+P+TFA). MERLIN+P and MERLIN+P+NTFA was applied in a stability selection framework using the same 100 subsets utilized for the algorithm without the use of additional information. SCENIC and Inferelator perform internal bootstrapping to extract confidence estimates on edges. We applied the algorithms to all of the datasets utilized in this paper with the exception of Zhao due to the size of the dataset and time constraints. For yeast, we used the prior network created in Siahpirani and Roy (2017). Briefly, this approach scanned the promoter of genes (defined as 1,000 bp upstream of the first ATG of a gene) using TestMotif (Barash et al. 2005) using position weight matrices (PWMs) from Gordân et al. (2011) andYeTFaSCo (de Boer andHughes 2012). The edges in the prior network were weighted using percentile ranking where lowest p-value is assigned the highest rank. For mammalian prior networks, we used PIQ (Sherwood et al. 2014) which scores motif instances using DNase I footprints. Motif instances were mapped to ±10 Kbp of the transcription start site (TSS). For mouse ESC (mESC) and dendritic (mDC) cell lines PWMs from the CIS-BP database (Weirauch et al. 2014) were used, while for the human ESC (hESC) cell line, in addition to CIS-BP, PWMs from the ENCODE (Kheradpour and Kellis 2013) and JASPAR (Khan et al. 2017) databases were also used. For estimation of footprints, DNase I assays from ENCODE were used (ENCODE Project Consortium 2012). To add an edge between a TF and a gene, we required that the TF's accessible motif occur in the gene's promoter. The edge weight between a TF and a target was the p-value of the motif instance overlapping the promoter; if there were multiple instances, the one with the most significant instance was used as the score. A threshold was applied to the score of the motif instances to create a network with similar edge density as the yeast motif prior network followed by percentile ranking on the edge score to create the final network. Inferelator has the option to estimate TFA internally using the input prior network. We additionally applied the original NCA algorithm (Liao et al. 2003), which uses an iterative approach to estimate TFA. Briefly, given gene expression matrix E and adjacency matrix A, NCA estimates the TFA matrix P in a way that ‖E − AP‖ 2 2 is minimized. The iterative approach works by estimating P matrix from the given A matrix, and then updating the A matrix by fixing the P matrix. We evaluated the inferred networks using F-score, AUPR, and predictable TFs of these algorithms and compared to the baseline versions that did not use priors and/or TFA and also correlation.

Application of network inference algorithms
Execution and post-processing Each algorithm was applied to the processed expression matrices on a local high-performance computing cluster using HTCondor https://chtc.cs.wisc.edu. Where possible, we deployed each algorithm in a stability selection mode. Specifically, we selected 50% of the cells from each expression matrix randomly, applied the algorithm to this subset, and repeated this process 100 times, averaging the interaction scores over the 100 subset networks to obtain a final inferred network. As some algorithms perform a similar downsampling internally (SCENIC, Inferelator), we did not run these algorithms within our external stability selection framework. Other algorithms could not be run 100 times due to excessive runtimes (PIDC, Scribe). After completing the runs of each algorithm, we formatted the variety of network formats to a single three-column file (TF, target, score). The score varied depending upon the algorithm and could be a pairwise correlation (LEAP, Correlation), information theoretic dependence (knnDREMI, SCRIBE, PIDC), regression weight (SILGGM, SCODE), confidence (SCENIC, Inferelator). The magnitude of the score was interpreted as the strength of the interaction. We restricted all networks to include only edges where the "regulator" was a known TF from one of our experimentally derived gold standard networks. The final networks were used in the evaluations of network accuracy and similarity (detailed below) and are available in Supplementary File S3. The scripts for postprocessing the results are available at https://github.com/Roy-lab/ scRNAseq˙NetInference.

Estimation of pseudo time
Some algorithms (LEAP, SCODE, Scribe) incorporate knowledge of cellular trajectory and pseudotime, which was learned using Monocle 2 (Qiu et al. 2017). To avoid biasing the trajectories with a choice of marker genes, we followed the best practices in the Monocle documentation to learn trajectories in an unsupervised approach. In brief, we clustered the cells using Monocle's densitypeak clustering algorithm after dimensionality reduction with t-SNE. We next tested for genes differentially expressed across these clusters, and selected the top thousand most significantly differential genes (as measured by q-value) to use as marker genes. Finally, we learned a DDRTree embedding of the expression in each cell and applied Monocle's cell ordering function to infer the position of each cell on a trajectory and their corresponding pseudotime assignments. We used Monocle trajectories for all datasets except the yeast Jackson dataset for which we used Diffusion Pseudo Time (DPT) (Laleh et al. 2016) due to high runtime of the Monocle software.

Description of gold standards
We curated multiple experimentally derived networks of regulatory interactions from published databases and the literature (Supplementary File S3) to serve as gold standards for our network inference algorithms. These experiments are typically based on ChIP-chip, ChIP-seq or regulator perturbation followed by global transcriptome profiling. We obtained multiple networks based on ChIP and TF perturbation experiments for each organism and cell type. When multiple ChIP or perturbation interactions were available we took a union of the networks. We refer to the ChIP derived gold standard as "ChIP" and the perturbation derived gold standard as "Perturb". Finally, we took the intersection of these two unions, as the third primary gold standard networks for our evaluations of network accuracy (ChIP+Perturb).
Finally, we created a fourth ESC specific gold standard network from the primary literature (Zhou et al. 2007;Kim et al. 2008;Young 2011;Buganim et al. 2012;Dunn et al. 2014;Xu et al. 2014;Malleshaiah et al. 2016) by conducting a literature survey of gene regulatory networks for the ESC state. Regulatory edges from the publications were manually extracted from network figures and further curated by a stem cell biologist (see Acknowledgments). The specific publications and figures used to create our curated gold standard are in Supplementary Table S5.

Evaluation metrics
To evaluate the accuracy of the inferred networks, we used three metrics-AUPR, F-score, and predictable TFs-between each network and each relevant gold standard network. Prior to computing each accuracy measure, we filtered both the inferred network and the gold standard to include only edges that contained TFs and target genes from the intersection of the gold standard's gene set with the original expression matrix's gene set. This was to eliminate any penalty for missing edges included in the gold standard that could not be inferred by the algorithm due to lack of expression in the original data.

Area Under the Precision Recall curve (AUPR)
We used AUPR to measure the accuracy of the global network structure. Specifically, we sorted the inferred edges in descending order of confidence, and computed the precision and recall with respect to a gold standard as edges were sequentially included. We report the area under this curve as AUPR. We used the script auc.java from https://github.com/Arun-George-Zachariah/AUC (Davis and Goadrich 2006).

F-score
We used F-score of the top x edges, x ∈ {100, 300, 500, 1000, 3000, 5000, 10000, 30000, 50000} to measure the accuracy of the most confident edges in an inferred network. Specifically, the F-score is computed as the harmonic mean of the precision and recall of the top x edges with respect to a gold standard.

Predictable TFs
We used predictable TFs to obtain a granular measure of network accuracy at the individual TF's target set. Specifically, for each TF in the inferred network, we use the hypergeometric test to assess whether the overlap between the predicted targets and the targets in the gold standard was significantly higher than random. The P-values were corrected using FDR for multiple hypothesis correction. We considered a P-value <0.05 as significant overlap. Each algorithm was ranked based on the number of predictable TFs.

Computing requirements of regulatory network inference algorithms for single-cell RNA-seq data
We started with 11 algorithms specifically designed to infer regulatory networks from scRNA-seq data, along with two methods for inference from bulk data ( Fig. 1, Supplementary Table S1). In addition to these published algorithms, we used Pearson correlation to infer a network of gene pairs with edges weighted by the experiment-wide correlation of expression. We first benchmarked these algorithms for their computing requirements, namely memory consumption and runtime, using different numbers of genes from 10 to 8000 genes (Methods, Fig. 2). Of the compared methods, SCHiRM and BTR did not complete in a reasonable amount of time and were excluded from further analysis. SCRIBE and HurdleNormal could be run up to 2,000 genes, however took much longer to run for more genes. Most algorithms that were runnable for up to 8k genes consumed up to 15G of RAM. Exceptions to this were SCHiRM, Inferelator and PIDC which took higher memory. SCHiRM memory consumption grew exponentially and was considered infeasible for larger gene sets. For the subsequent analysis, we excluded SCHiRM, BTR and HurdleNormal.

Assessing scRNA-seq regulatory network inference algorithms based on global metrics
We next compared the performance of algorithms on gold standard networks from three species, yeast, mouse and human (Fig. 1). These datasets included two yeast stress response datasets, four mouse datasets, one for dendritic cells, three for cellular reprogramming to pluripotent stem cells on different growth media, and one human dataset for ESC differentiation (Supplementary Table S2). The datasets had a wide range of numbers of cells, ranging from 163 cells from Gasch et al. (2017) to the largest dataset with 36,199 cells from one of the mouse reprogramming studies (Zhao et al. 2018 For each species dataset, we had three types of gold standards, those derived from knock down or knock out of the regulator followed by global mRNA profiling (Perturb, Fig. 3), those derived from ChIP-chip or ChIP-seq experiments (ChIP, Fig. 3), and those derived from the intersection of these two gold standards (Perturb+ChIP, Fig. 3). We assessed the performance of the algorithms using standard AUPR as well as F-score ( Fig. 3, Supplementary Fig. S1) applying each algorithm on each of the datasets (Methods). While AUPR considers all edges identified by a network inference algorithm, F-score is computed using the top edges ranked by confidence. To select the number of edges for F-score computation, we considered different number of edges and computed the stability of the metrics as a function of the number of edges ( Supplementary Fig. S2). We selected 5k edges for F-score computation as this resulted in the most stable results across algorithms.
We first examined the performance of each algorithm for individual datasets from the different species and ranked each method based on its F-score and AUPR (Fig. 3a, Supplementary Fig. S1). The ranking of methods was consistent based on both F-score and AUPR. There was more variation in ranking due to different gold standards (e.g. ChIP versus Perturb) than across datasets. Across datasets, no single method consistently had the highest scores, however the ranking of methods across datasets remained largely consistent, that is, methods that ranked high in one dataset ranked high in another dataset. A few exceptions were SILGGM and knnDREMI which had a relatively lower performance on most datasets with the exception of Tran A2S/FBS Perturb where it ranked highly. LEAP performed poorly on the Shalek dataset, but was generally among the middle ranking methods. Among the different methods, SCENIC, MERLIN and Pearson correlation were most consistently high performing across datasets based on F-score and AUPR. The size of the dataset (number of cells) was not a strong determinant of prediction performance (Supplementary Text S1). We next examined the performance of algorithms with respect to the different types of gold standards. Both F-score and AUPR were generally better for the Perturb gold standard datasets compared to ChIP gold standards (Fig. 3b, Supplementary Fig.  S1). The exception was SCODE which was better for ChIP compared to Perturb. This difference in performance could be Algorithm performance across different global metrics of network structure recovery. a) Algorithm ranks in each of the 21 comparisons (7 datasets × 3 gold standard networks) based on AUPR and F-score. Gray cells represent the missing experiments due to scalability issue of algorithms on large datasets. b) Box plots showing distribution of relative performance (AUPR or F-score) compared to random (defined by a randomly weighted network with the same set of nodes and edges as the dataset) across different datasets for each type of gold standard: ChIP, Perturb and their intersection, Perturb +ChIP. The solid black line represents the random baseline. c) Median rank of algorithm performance with respect to each type of gold standard (median over 7 datasets). d) Distribution of AUPR and F-score fold improvement over random performance in the same comparison. Algorithms are ordered as shown in panel e. The solid black line represents the random baseline. e) Median rank of algorithms across all 21 comparisons using F-score and AUPR. Rows are ordered based on the mean of rank of F-score and AUPR.
attributed to the coverage of regulator-target relationships in each dataset (Supplementary Table S3) and also that expressionbased methods may be able to capture effects of perturbation experiments better than ChIP experiments. Difference in performance due to different gold standards was also previously observed by Pratapa et al. (2020), who found poorer performance when the density of the edges was higher in a given gold standard.
We aggregated the metrics across all datasets for each gold standard based on their median rank (Fig. 3c). Based on median F-score on the Perturb gold standard, the top three methods were Pearson with a median rank of 3 and PIDC and SCENIC with a median rank of 4. When using ChIP, the top three methods were MERLIN, SCENIC, and PIDC. Finally based on Perturb+ChIP, Pearson, PIDC were the best with a median rank of 3, followed by SCENIC and MERLIN with a median rank of 4. AUPR-based median rank varied more with the gold standard, which is reflected by overall lower correlation across gold standards ranks when using AUPR ( Supplementary Fig. S3), compared to F-score. Using Perturb, the top three methods were SILGGM, Pearson and LEAP, using ChIP the top three methods were SCENIC, SCODE, and Pearson and based on Perturb+ChIP, they were Pearson, SCENIC and PIDC. Finally, we collated the metrics across gold standards and datasets (Fig. 3d) and ranked each method based on its overall median rank across both types of gold standards (Fig. 3e). Based on F-score, Pearson, MERLIN and PIDC had the best median rank of 3, followed by SCENIC (rank of 4) and LEAP (rank of 5.5). Based on AUPR, Pearson, SCENIC, and MERLIN were the top three ranking methods.

Assessing scRNA-seq regulatory network inference algorithms based on local network metrics
The AUPR and F-score quantify the global agreement of a predicted regulatory network with gold standard networks by comparing one edge at a time. However, from a biological application point of view, it may be beneficial to examine if there are some smaller components of the regulatory network that are predicted better than others. Therefore, we next focused on a more granular view of the inferred networks by measuring the ability to predict the targets of individual TFs based on the fold enrichment of true targets of a TF in the predicted set as described by Siahpirani and Roy (2017) (Methods). We quantified the overlap using an FDR-corrected hypergeometric test P-value and called a TF "predictable" if it had a significant P-value (FDR < 0.05). We used the number of significantly predictable TFs as another quantification of network performance (Fig. 4a, Supplementary Fig. S1). Unlike AUPR and F-score where the performance of the algorithms was often close to random, the number of predictable TFs for the random baseline was no more than 1, and typically 0. Across datasets, the methods ranked consistently, with the exception of LEAP performing much worse in Shalek and Tran (FBS) datasets, and SCODE performing much better on the Gasch and Han datasets based on the ChIP gold standard. Between Perturb and ChIP gold standards, the rankings of the algorithms were more correlated than AUPR or F-score (Pearson correlation of 0.55 for predictable TFs vs correlation of 0.41 for F-score and 0.07 for AUPR; Supplementary Fig. S3), but recapitulated the same preference of methods to perform better on Perturb versus ChIP datasets. We aggregated the metrics per gold standard type across datasets by computing the median rank (Fig. 4b). Based on the Perturb gold standard, the top three methods were Pearson, PIDC, and MERLIN, while based on ChIP, Peason and MERLIN were best followed by SCENIC and SCODE. Inspection of the number of predictable TFs showed there is a substantial variation across datasets and gold standards (Fig. 4c). Finally, as in the AUPR and F-score comparison, we ranked the methods based on the predictable TFs from the three types of gold standards. Based on this ranking, the top performing methods were Pearson, SCENIC, MERLIN, and PIDC (Fig. 4d), which was consistent with the rankings based on AUPR and F-score. We next examined for each dataset which specific TFs were predicted by the different methods (Fig. 5). This information can help determine if there are TFs that can be predicted by a specific category of methods or by all methods. For the Han dataset which profiled transcriptional programs during lineage specification from the ESC state, we found several ESC-specific regulators such as POU5F1, SOX2, and NANOG (Boyer et al. 2005) as predictable TFs in multiple methods when using the ChIP gold-standard (Fig. 5a). We also found TFs like CDX2 (Strumpf et al. 2005) and TBX3 (Esmailpour and Huang 2012), which serve as major regulators for different lineages, predicted by a number of methods. When comparing the Perturb gold standard, we found similar regulators as in the ChIP gold standard, but additionally regulators such as OTX2 and GATA3 for retinal (Yamamoto et al. 2020) and hematopoietic lineages (Ku et al. 2012), respectively. SCODE had a number of predictable TFs in the Han dataset, however, many of them are general regulators such as SP1, YY1, POL2RA, ATF2. For the mouse cellular reprogramming datasets (Tran (FBS), Tran (A2S), Zhao), we observed a similar behavior with many of the methods identifying several developmental regulators such as Pou5f1, Esrrb, and Sox2 in both ChIP and Perturb gold standards (Fig. 5b, Supplementary Figs. S4, S5). Importantly, when compared to the dendritic cell dataset (Shalek), we found a consistent over-representation of a different set of regulators across methods. This included Rel, Nfkb, Stat1, and Stat3 that are associated with immune response and were identified by several methods including SCENIC, MERLIN, PIDC, Scribe, Inferelator and Pearson ( Supplementary Fig. S6). We found a similar behavior for the yeast datasets (Fig. 5c, Supplementary Fig. S7), where the methods uniformly were able to recover key regulators associated with stress response such as HAP4 (oxidative stress) and GCN4 (amino acid starvation, Gasch et al. 2000). The overall fold enrichment for predictable TFs was higher for the yeast datasets compared to the mammalian datasets. Taken together, these results show that although the performance of methods based on global metrics such as F-score and AUPR is modest, methods are able to consistently recover relevant regulators for a particular system.
Finally, as large-scale gold standards themselves may be noisy, we curated a small gold standard network from regulatory interactions reported in the literature for the ESC state (Zhou et al. 2007;Kim et al. 2008;Young 2011;Buganim et al. 2012;Dunn et al. 2014;Xu et al. 2014;Malleshaiah et al. 2016) (Methods). The regulatory network has 35 regulators and 90 target genes (some of which included regulators as well) connected by 267 edges (Supplementary File S2, gold_standard_datasets.zip). We next used the top 5,000 edges from the networks inferred on the Tran (A2S) dataset and depicted true positives (found by a method) and false negatives (missed by a method) among 199 of the 267 interactions that remained after removing edges due to missing expression (Methods). Most of the methods recovered cross regulatory interactions between Nanog, Sox2, and Pou5f1, which are master regulators for establishing the ESC state (Fig. 5d). Methods that recovered the fewest number of true positives included SCODE, knnDREMI and PIDC. Interestingly SCRIBE which like PIDC is based on information theory was able to infer many more true positives (14) compared to PIDC (9). Overall the performance based on the recovery of true positive edges was consistent with our global and predictable TF metrics but also highlighted additional properties of the network inference methods in terms of recovery of curated regulatory interactions.

Similarity of inferred networks from different methods depends upon the dataset
We next asked the extent to which the algorithms agree in their predictions by measuring the Jaccard similarity and F-score between the top ranked edges of each pair of inferred networks (Fig. 6). The magnitude of the similarity increased as more edges were considered but the overall trend of pairwise similarity remained the same (Supplementary Figs. S8, S9). We focused on the Jaccard scores obtained on the top 5k edges for consistency with our other results; similar behavior was observed for F-score as well (Supplementary Fig. S10). To enable comparison of methods, we grouped the methods in two ways. First, we ordered the methods in the same way in each dataset based on the median similarity of inferred networks and visualized the pairwise  similarity for each dataset (Fig. 6, Common ordering). This allowed us to examine the change in similarity for a pair of methods as a function of datasets. Scribe's similarity to other methods depended upon the dataset, however, it is most similar to PIDC, which is also based on information theoretic metrics. LEAP was most similar to Correlation but this also depended upon the dataset; e.g. high Jaccard score in Gasch, Jackson, and Zhao and low scores in other datasets. Some datasets resulted in more similar networks compared to others. In particular, methods were most similar (high Jaccard coefficient) in the Han and Jackson datasets and were least similar for the Shalek and Gasch. These two datasets were also the smallest of the two datasets, which could explain the lower similarity. Methods that tended to have the highest similarity with other methods were Pearson, SCENIC, Inferelator, and PIDC. We next ordered the methods based on the Jaccard similarity per dataset producing dataset-specific orderings (Fig. 6, Per Dataset Ordering). SILGGM and knnDREMI generally learned different networks compared to other methods, including those based on the same class of models, e.g. information theory for knnDREMI (PIDC, Scribe) and probabilistic models for SILGGM (MERLIN, Inferelator). The methods that exhibited the highest similarity across datasets were LEAP, PIDC, and Pearson, with the exception of Tran (FBS), Tran (A2S), and Shalek where LEAP learned more different networks and Gasch where PIDC was different from Pearson and LEAP. This was followed by SCENIC, MERLIN, and Inferelator, which also tended to group together in multiple datasets. In summary, these comparisons revealed that the similarities between methods depended upon the dataset; however, several of the methods such as PIDC and Correlation    Fig. 6. Pairwise network similarity between inferred networks from each method. Shown here are the Jaccard similarity between the top 5,000 edges in each pair of networks inferred on each dataset. Common Clustering. The columns and rows of each matrix are ordered with respect to a hierarchical clustering of the median similarity across all datasets. Per Dataset clusters The columns and rows are ordered with respect to each dataset.
tended to infer similar networks across datasets. Furthermore, the networks inferred from the same family of algorithms were not necessarily more similar than from different families of algorithms.

Imputation of scRNA-seq datasets does not improve network inference
Single-cell expression data are characterized by a large number of zero expression counts which can arise due to technical (low depth) or biological reasons (cell type-specific expression). To address this problem, several methods have been developed to impute the value of missing expression counts (Hou et al. 2020). We applied the MAGIC algorithm to each dataset (Dijk et al. 2018), which was one of the top imputation methods from a recent benchmarking study (Hou et al. 2020). MAGIC computes pairwise similarity between cells and creates a Markov transition graph based on these similarities. It then "diffuses" expression counts among similar cells based on this Markov transition graph. We inferred networks using the imputed data from each experiment and compared their AUPR, F-score and predictable TF metrics with those of the networks inferred from the original sparse datasets (Fig. 7). Based on F-score (Fig. 7a), most methods did not benefit from imputation across datasets. An exception to this were the Shalek and Gasch datasets, which as noted before are among the smaller datasets. We examined the ratio of F-score after and before imputation across different gold standards and datasets and found SCODE and kNN-DREMI to improve when using imputed data (Fig. 7b). However, imputation generally did not benefit the network inference procedure with most algorithms exhibiting a ratio <1 across datasets and gold standards. We repeated these comparisons based on AUPR (Fig. 7c,d) and predictable TFs (Fig. 7e,f). AUPR did not change much with imputation, however, we noted a similar boost in performance for the Shalek dataset. Finally, based on predictable TFs, performance generally did not improve with the exception of the Tran (A2S), Tran (FBS) and Shalek datasets where algorithms such as SCODE and knnDREMI showed a benefit (Fig. 7e,f). Our results are consistent with previous work (Jackson et al. 2020), which showed reduced AUPR for network inference with imputation further supporting the limited utility of imputation for gene network inference. Taken together, our experiments across different datasets and gold standards showed that imputation benefitted in a handful of cases of algorithms and datasets, but in the majority of the cases either stayed the same or had a detrimental effect.

Comparison of network inference from single cell versus bulk expression datasets
Before the availability of scRNA-seq datasets, expression-based network inference was performed with bulk expression datasets, which required large sample sizes. Such datasets are available from a single study in yeast (Gasch et al. 2000) or can be created by collecting studies from public gene expression omnibus. Among the conditions (species and cell states) we examined in this study, we had matching bulk expression datasets in yeast (stress conditions), mESC and hESC. We therefore applied network inference on these bulk expression datasets using five state-of-the-art algorithms and compared the performance of the methods based on the same gold standards and metrics of F-score, AUPR, and predictable TFs (Fig. 8, Methods). For the scRNA-seq comparison, we used the inferred networks from the Gasch scRNA-seq dataset for yeast, Tran (A2S) for mESC, and Han dataset for hESCs.
Based on F-score scRNA-seq-based networks exhibited a moderate improvement on the Han (hESC) dataset and yeast dataset when using either the Perturb+ChIP or Perturb gold standards respectively. The results were largely consistent for AUPR and F-score. However, when comparing predictable TFs, we observed a marked decrease in the number of predictable TFs when comparing the results for mESC and hESC datasets using ChIP and ChIP+Perturb. Overall this suggests that individual scRNA-seq datasets capture meaningful variation that offers comparable performance to bulk expression datasets collected from a large number of experiments.

Addition of priors and transcription factor activities can improve performance
Our experiments so far utilized gene expression alone for network inference. Bulk network inference methods that incorporate additional genomic sources such as sequence-specific motifs of TFs have improved performance (Greenfield et al. 2013;Siahpirani and Roy 2017). Such motif-based TF target relationships have been used as priors to constrain the graph structure (Greenfield et al. 2013;Siahpirani and Roy 2017) and also estimate TFA (Liao et al. 2003;Miraldi et al. 2019;Ma and Brent 2021), which have both led to improvements in network recovery. We therefore examined if incorporation of prior knowledge is beneficial for network inference from scRNA-seq datasets. We created graphs encoding prior knowledge of regulator-target relationships as described in Siahpirani and Roy (2017) (Methods). We considered MERLIN, Inferelator and SCENIC, because only MERLIN and Inferelator incorporate priors, and SCENIC was among the top performing methods with efficient run times. We estimated the TFA of the regulators based on the structure of the prior network using NCA (Liao et al. 2003). Subsequently, the estimated TFA of these regulators along with their gene-level expressions were utilized to find their target genes. For Inferelator, we estimated the performance using NCA-based TFA and its own TFA estimation, while MERLIN and SCENIC used the NCA-based TFA (Referred to as NTFA). We add "+P" after an algorithm's name if it used prior information and "+TFA" or "+NTFA" if it used TFA. We applied these algorithmic configurations to the Gasch, Jackson, A2S, FBS, Shalek, and Han datasets and assessed performance using the ChIP, Perturb and Perturb+ChIP gold standards. We did not include the Zhao dataset due to the computational burden of this dataset. MERLIN and Inferelator were evaluated for the ability to incorporate prior knowledge and TFA, while SCENIC was evaluated for its ability to leverage TFA. We compared these algorithms to their versions that did not incorporate TFA or prior and to the baseline Correlation (Fig. 9, Supplementary Fig. S11).
Based on F-score, MERLIN with prior (MERLIN+P) outperformed MERLIN (8 out of 18 times, Fig. 9a, Supplementary Fig. S11), and incorporation of TFA (MERLIN+P+NTFA) outperformed MERLIN alone for 8 out of 18 of the dataset and gold standard combinations. Although MERLIN+P and MERLIN+P+NTFA outperformed MERLIN in less than 50% of the comparisons, they had better fold enrichments (Fig. 9b) and overall better rank (Fig. 9c). SCENIC with TFA (SCENIC+NTFA) outperformed SCENIC alone for 9 out of 18 dataset and gold standard combinations. The biggest gain for using priors was observed with Inferelator with Inferelator+P outperforming Inferelator for 11 out of 18 of the dataset and gold standard combinations. Incorporation of TFA in Inferelator+P+TFA led to higher performance in 7 out of 18 cases. Finally addition of NCA based TFA and prior was most beneficial making Inferelator alone to be outperformed in 15 out of 18 of combinations. Finally, when examining F-score ranks of Pearson, methods that used priors had a higher overall rank than Pearson correlation (Fig. 9b,c). Based on AUPR, we found a greater benefit of methods using priors or TFA. In particular, MERLIN+P and MERLIN+P+NTFA outperformed MERLIN in 17 and 15 out of 18 dataset and gold standard comparisons, respectively. SCENIC +NTFA outperformed SCENIC in 11 out of 18 comparisons. Inferelator+P, Inferelator+P+TFA, Inferelator+P+NTFA consistently outperformed Inferelator in the majority of the comparisons. Compared to Pearson correlation nearly all methods that used prior or TFA outranked Pearson correlation (Fig. 9c).  Fig. 7. Effect of imputation on inferred networks. a) Each separate plot corresponds to a dataset, the shape of the marker denotes the type of gold standard and the color depicts the method of network inference. F-score before imputation is on the x-axis and F-score after imputation is on the y-axis. A marker above the diagonal corresponds to improved performance after imputation. b) The box plots summarize the relative performance after imputation compared to before imputation over the 21 comparisons: 7 datasets × 3 gold standard networks. Algorithms are ordered based on median performance gain. c) Similar to a), showing AUPR. d) Similar to b) reporting relative AUPR after and before imputation. e) Similar to a), showing the number of predictable TFs. f) Similar to b), reporting the difference in the predictable TFs between the two networks after and before imputation.
Finally, based on the number of predictable TFs, MERLIN+P, and MERLIN+P+TFA outperformed MERLIN in 12 and 13 out of 18 cases, respectively. SCENIC+NTFA outperformed SCENIC in 11 out of 18 cases, and Inferelator+P, Inferelator+P+TFA, and Inferelator+P+NTFA outperformed Inferelator in 12, 6, 11 out of 18 cases, respectively. As in the case of AUPR and F-score Pearson correlation was outperformed in overall rank by a method that used priors (MERLIN+P, MERLIN+P+NTFA) or TFA (SCENIC +NTFA, MERLIN+P+NTFA). Taken together, these results show that inclusion of prior information to constrain graph structure or for TFA improves the quality of inferred networks for scRNA-seq datasets as well.

Discussion
The rapid growth of scRNA-seq datasets has opened up remarkable opportunities in the field of expression-based gene regulatory network inference. Accordingly, substantial effort has been invested to develop and apply regulatory network inference algorithms to such datasets. Here we benchmarked the computing requirements and overall performance of methods in network structure recovery across a large number of datasets spanning different species and model cell lines. Our work expands on previous efforts by considering larger scRNA-seq datasets across multiple model systems, additional classes of methods, different types of metrics, and gold standards that can provide biological insight into the predictions made by the different methods. We evaluated methods based on their computing requirements as well as performance on different gold standard networks. Due to excessive computing requirements, we could not rank several methods that were specifically geared towards handling the statistical properties of scRNA-seq datasets, such as SCHiRM, HurdleNormal, and BTR. As scRNA-seq datasets grow, efficient implementation of algorithms would be important. Our evaluation metrics included both global metrics such as AUPR and F-score as well as local metrics such as the number of regulators whose targets could be predicted significantly. Based on AUPR and F-score, the overall performance of methods remains modestly better than random, however, we find predictable TFs as a more sensitive metric that highlights the strengths of network inference  Fig. 8. Comparison of network inference on bulk versus single-cell RNA-seq data. Shown are the relative F-score and AUPR scores compared to random and the number of predictable TFs on comparable bulk and scRNA-seq datasets. Methods used for network inference for bulk and single-cell datasets are listed at the bottom. Each marker on each plot corresponds to a method for bulk (blue shading in left column per gold standard) or scRNA-seq data (yellow shading in right column per gold standard). The yeast dataset compared the scRNA-seq data from Gasch et al. (2017), to the bulk dataset from the same author (Gasch et al. 2000). Human and mouse bulk datasets were generated from publicly available datasets (Supplementary Table S3).  Fig. 9. Assessing the utility of incorporating prior knowledge and TFA in network inference on scRNAseq data. a) Ranks of each method based on F-score, AUPR, and predictable TFs shown for 6 datasets using the three gold standards. Methods compared include MERLIN, Inferelator and SCENIC and their enhanced versions for prior or TFA. The terms "+P" after an algorithm's name denotes adding prior while addition of "+TFA" or "+NTFA" corresponds to TFA estimated using Inferelator's inbuilt TFA estimation or TFA estimated from Network Components Analysis (NTFA), respectively. MERLIN+NTFA, MERLIN+Prior, MERLIN+P+NTFA, are extensions of MERLIN using NTFA, Prior, and Prior and NTFA, respectively. SCENIC+NTFA, uses NTFA-based TF activity in addition to gene expression. Inferelator+Prior, Inferelator+Prior+NTFA, Inferelator+Prior+TFA were extensions to Inferelator for Prior or Prior and TFA. Also included for comparison are the network inference results for Pearson correlation and a random network. b) Distribution of fold enrichment of F-score or AUPR over random or the number of TFs across the datasets. c) Aggregated ranks of the methods across datasets and gold standards.
methods. Importantly, different methods were able to recapitulate relevant regulators for the system of interest, for example key stem cell regulators in the developmental datasets and immune response regulators in the dendritic cell dataset. No single method was a winner across all datasets and gold standards, however, based on the overall performance and computing requirements of the methods, the methods PIDC, SCENIC, MERLIN, and Pearson are the top methods. Beyond using expression alone, we assessed the benefit of using prior networks for both constraining the network structure (Greenfield et al. 2013;Siahpirani and Roy 2017) as well as incorporating TFA . We found that incorporation of sequence motifs as priors to constrain the network structure or to incorporate TFA during network inference is beneficial for improved recovery of networks from scRNA-seq data. Importantly, methods that used priors or TFA had better overall performance. Our results suggest that future efforts for transcriptional network inference from scRNA-seq datasets must leverage available prior information obtained from complementary data sources such as sequence-specific motifs. The availability of relevant single cell or bulk accessibility datasets could greatly benefit the generation of high-quality network priors and ultimately the gene regulatory networks integrating these priors with expression. One challenge with scRNA-seq datasets is the high proportion of zeros, which can be due to both biological as well as technical reasons. As imputation has been proposed as an approach to handle sparse datasets such as scRNA-seq datasets, we studied the impact of imputation on the quality of inferred networks. As such imputation did not improve the performance of most methods, however, the datasets where it did benefit tended to have relatively smaller number of cells. One caveat in our analysis is that we considered only one imputation method, MAGIC, which was shown to be one of the top imputing methods. A direction of future work is to consider additional imputation methods to examine the impact of imputation on network inference. Related to the depth issue, we briefly explored the effects of differences in cell counts of datasets on network inference. We found that the performance of a method was not necessarily similar on two distinct datasets simply because the datasets have the same number of cells. In particular, we observed that decreasing the number of cells in Jackson (17,396 cells) to match the Gasch scRNA-seq dataset (163 cells) or Zhao (36,199 cells) to Tran (FBS) size (3,324 cells) did not have the same effect. The performance stayed largely unchanged for Zhao, but decreased for Jackson. One of the factors could be the difference in the sequencing depths of the datasets. Hence, we compared the same-sized datasets based on (a) the number of distinct genes detected per cell and (b) the total number of transcripts across all genes per cell. We found that the number of distinct genes detected per cell in a dataset was a good indicator of how the methods performed with that dataset (see Supplementary Text S1). Both SCENIC and MERLIN performed consistently better with Gasch than with the Jackson subsamples and the number of distinct genes detected per cell is indeed higher in the Gasch dataset. For Tran (FBS) vs. the Zhao subsamples, the numbers of distinct genes detected per cell are almost the same. It supports the observation that none of the methods consistently performed better for either of these datasets. On the other hand, the total number of transcripts across all genes per cell was not a good indicator of performance as Tran (FBS) has a higher number of transcripts per cell compared to Zhao, yet the methods did not consistently perform better for the Tran (FBS) dataset. These results suggest that the depth as measured by the number of distinct genes detected per cell is a better indicator of performance of scRNA-seq network inference methods than the number of cells or the total number of transcripts per cell in the dataset. Our observations also showed that SCENIC was less susceptible to dataset size change than MERLIN, which would indicate that the non-linear, nonparametric model of Random Forests regression is favorable for modeling scRNA-seq profiles. An approach to handle low depth data might be to use pseudobulk expression of a small cluster of cells, e.g. "metacells" (Baran et al. 2019), which was shown to be advantageous for a number of analytical tasks for scRNA-seq datasets (Bilous et al. 2022). A direction of future work would be to examine the effect of metacells on scalability and accuracy of network inference algorithms.
Single-cell transcriptomic datasets have the advantage that a single experiment can produce large number of samples that are comparable or larger than existing bulk datasets that have been used for network inference. Therefore, we compared the quality of the inferred networks from bulk and scRNA-seq datasets for yeast, mouse and human using methods for both bulk and singlecell datasets. We find that scRNA-seq datasets, despite being sparse, are able to capture sufficiently meaningful biological variation and perform at par when using bulk RNA-seq datasets. Our study however is not perfect since the bulk and single-cell datasets were collected from different sources. Generating controlled datasets capturing bulk and single-cell profiles for the same system could provide additional insight into the relative advantage of scRNA-seq datasets for inferring gene regulatory networks.
The gold standards and datasets that we have collected in our work should be beneficial for future studies. Another direction of future work is to leverage the inherent heterogeneity and population structure of scRNA-seq datasets. Methods-based multi-task learning is a promising framework to model population and network heterogeneity (Pierson et al. 2015;Castro et al. 2019). The relatively good performance of a simple Pearson correlation when using expression alone was surprising. This could be an artifact of our gold standards which are admittedly imperfect. Correlation could remain as a useful baseline for network inference method development, however, it does not permit the incorporation of prior knowledge which we found to be crucial for obtaining the best performance in network inference. Generation of improved gold standards, especially for mammalian systems, based on newer high throughput perturbation techniques such as Perturb-seq (Dixit et al. 2016) could significantly benefit our ability to infer genome-scale gene regulatory networks.

Data availability
Datasets used in our experiments and analyses have been deposited with DOI 10.5281/zenodo.5909090, along with the following groups of files which were too large to include in the manuscript as supplementary materials: • Gold standard networks used as ground truth to measure accuracy of the inferred networks. • Networks generated from network inference methods.
All scripts for executing and evaluating the algorithms are available at https://github.com/Roy-lab/scRNAseq˙NetInference.