Single-cell gene regulatory network inference at scale: The Inferelator 3.0

Gene regulatory networks deﬁne regulatory relationships between transcription factors and target genes within a biological system, and reconstructing them is essential for understanding cellular growth and function. In this work, we present the Inferelator 3.0, which has been signiﬁcantly updated to integrate data from distinct cell types to learn context-speciﬁc regulatory networks and aggregate them into a shared regulatory network, while retaining the functionality of the previous versions. The Inferelator 3.0 reliably learns informative networks from the model organisms Bacillus subtilis and Saccharomyces cerevisiae . We demonstrate its capabilities by learning networks for multiple distinct neuronal and glial cell types in the developing Mus musculus brain at E18 from a large (1.3 million) single-cell gene expression data set with paired single-cell chromatin accessibility data.

1 Background 1 Gene expression is tightly regulated at multiple levels in order to control growth, 2 development, and response to environmental conditions (Figure 1A).Transcrip-3 tional regulation is principally controlled by Transcription Factors (TFs) that bind 4 to DNA and effect chromatin remodeling [1] or directly modulate the output of 5 RNA polymerases [2].Three percent of Saccharomyces cerevisiae genes are TFs 6 [3], and more than six percent of human genes are believed to be TFs or cofactors 7 [4].Connections between TFs and genes combine to form a transcriptional Gene 8 Regulatory Network (GRN) that can be represented as a directed graph (Figure 9 1B).Learning the true regulatory network that connects regulatory TFs to target 10 genes is a key problem in biology [5,6].Determining the valid GRN is necessary 11 to explain how mutations that cause gene dysregulation lead to complex disease 12 states [7], how variation at the genetic level leads to selectable phenotypic variation 13 [8,9], and how to re-engineer organisms to efficiently produce industrial chemicals 14 and enzymes [10].15 Learning genome-scale networks relies on genome-wide expression measurements, 16 initially captured with microarray technology [11], but today typically measured by 17 RNA-sequencing (RNA-seq) [12,13].A major difficulty is that biological systems 18 have large numbers of both regulators and targets; there is poor network identifiability because many plausible networks can explain observed expression data and the regulation of gene expression in an organism [14].Designing experiments to produce data that increases network identifiability is possible [15], but most data is collected for specific projects and repurposed for network inference as a consequence of the cost of data collection.Large-scale experiments in which a perturbation is made and dynamic data is collected over time is exceptionally useful for learning GRNs but systematic studies that collect this data are rare [16].
Measuring the expression of single cells using single-cell RNA-sequencing (scRNAseq) is an emerging and highly scalable technology.Microfluidic-based single-cell techniques [17,18,19] allow for thousands of measurements in a single experiment.
Split-pool barcoding techniques [20] are poised to increase single-cell throughput by an order of magnitude.These techniques have been successfully applied to generate multiplexed gene expression data from pools of barcoded cell lines with lossof-function TF mutants [21,22], enhancer perturbations [23], and disease-causing oncogene variants [24].Individual cell measurements are sparser and noisier than measurements generated using traditional RNA-seq, although in aggregate the gene expression profiles of single-cell data match RNA-seq data well [25], and techniques to denoise single-cell data have been developed [26,27].
The seurat [28] and scanpy [29] bioinformatics toolkits are established tools for single-cell data analysis, but pipelines for inferring GRNs from single-cell data are still nascent.The SCENIC [30] GRN inference pipeline is based around the GENIE3 method that uses ensemble regression trees [31] to estimate the importance of TFs in explaining gene expression profiles.CellOracle [32] has been recently proposed as a pipeline to integrate single-cell ATAC and expression data using a motif-based search for potential regulators, followed by Bayesian ridge regression to enforce sparsity in the output GRN.SCODE [33] infers GRNs by solving linear ordinary differential equations using time-course single-cell data.GRN inference is computationally challenging, and the most scalable of these GRN pipelines has learned GRNs from 50,000 cells of gene expression data [30].
Here we describe the Inferelator 3.0 pipeline for single-cell GRN inference, based on regularized linear regression [34].The Inferelator 2.0 [35] has performed well inferring networks from Bacillus subtilis [36], human Th17 cells [37,38], mouse Lymphocytes [39], Saccharomyces cerevisiae [40], and Oryza sativa [41].We have re-implemented the Inferelator 3.0 with new functionality in python to learn GRNs from single-cell gene expression data.Specifically, this new package provides scalability, allowing millions of cells to be analyzed together, as well as integrated support for multi-task GRN inference, while retaining the ability to utilize bulk gene expression data.As a demonstration of these capabilities, we learn GRNs from several model organisms, and generate a mouse neuronal GRN from a publicly available data set containing 1.3 million cells.

The Inferelator 3.0
In the 11 years since the last major release of the Inferelator [35], the scale of data collection in biology has accelerated enormously.We have therefore rewritten the Inferelator as a python package to take advantage of the concurrent advances in data processing.For inference from small scale gene expression data sets (< 10 4 observations), the Inferelator 3.0 uses native python multiprocessing to run on individual computers .For inference from extremely large scale gene expression data sets (10 4 -10 7 observations) that are increasingly available from scRNA-seq experiments, the Inferelator 3.0 takes advantage of the Dask analytic engine [42] for deployment to high-performance clusters (Figure 1C), or for deployment as a kubernetes image to the Google cloud computing infrastructure.

Network Inference using Bulk RNA-Seq Expression Data
We have incorporated several network inference model selection methods into the Inferelator (Figure 2A).In order to evaluate the network inference performance of these methods, we test on the prokaryotic model Bacillus subtilis and the eukaryotic model Saccharomyces cerevisiae.Both B. subtilis [36,43] and S. cerevisiae [40,16] have large bulk RNA-seq and microarray gene expression data sets, in addition to a relatively large number of experimentally determined TF-target gene interactions that can be used as a gold standard for assessing network inference.
Using two independent data sets for each organism, we find that the model selection methods Bayesian Best Subset Regression (BBSR) [44] and Stability Approach to Regularization Selection for Least Absolute Shrinkage and Selection Operator (StARS-LASSO) [38] perform equivalently (Figure 2B).The data sets separate on the first principal component (Supplemental Figure 1A), indicating that there are substantial batch-specific effects between these independent data sets.These data set-specific batch effects make combining this data for network inference difficult; conceptually, each data set is in a separate space, and must be mapped into a shared space if they are to be combined.We take a different approach to addressing the batch effects between data sets by treating them as separate learning tasks [45] and then combining network information into a unified GRN.This results in a considerable improvement in network inference performance over either data set individually (Figure 2C).The best performance is obtained with Adaptive Multiple Sparse Regression (AMuSR) [45], a multi-task learning method that shares information between tasks during regression.The GRN learned with AMuSR explains the variance in the expression data better than learning networks from each data set individually with BBSR or StARS-LASSO and then combining them (Supplemental Figure 1B).There is a high overlap in the number of GRN edges learned from each data set, showing that there is a common network across different tasks (Supplemental Figure 1C).combination with the known DNA-binding preferences for TFs to identify putative target genes is a viable alternative [38].

Generating Prior Networks from Chromatin Data and Transcription Factor Motifs
To generate these prior networks we have developed the inferelator-prior accessory package that uses TF motif position-weight matrices to score TF binding within gene regulatory regions and build sparse prior networks (Figure 3A).These gene regulatory regions can be identified by ATAC, by existing knowledge from TF Chromatin Immunoprecipitation (ChIP) experiments, or from known databases (e.g.EN-CODE [46]).Here, we compare the inferelator-prior tool to the CellOracle package [32] that also constructs motif-based networks that can be constrained to regulatory regions, in Saccharomyces cerevisiae by using sequences 200bp upstream and 50bp downstream of each gene TSS as the gene regulatory region.The inferelatorprior and CellOracle methods produce networks that are similar when measured by Jaccard index but are dissimilar to the YEASTRACT literature-derived network (Figure 3B).These motif-derived prior networks from both the inferelator-prior and CellOracle methods perform well as prior knowledge for GRN inference using the Inferelator pipeline (Figure 3C).The source of the motif library has a significant effect on network output, as can be seen with the well-characterized TF GAL4.GAL4 has a canonical CGGN 11 CGG binding site; different motif libraries have different annotated binding sites (Supplemental Figure 2A) and yield different motif-derived networks with the inferelator-prior pipeline (Supplemental Figure 2B-C).

Network Inference using Single-Cell Expression Data
Single-cell data is undersampled and noisy, but large numbers of observations are collected in parallel and count data derived from unique molecular identifiers have some intrinsic advantages.In order to quantitatively evaluate network inference performance, we apply the Inferelator to Saccharomyces cerevisiae single-cell expression data [22,47], and score the model performance based on a previously-defined yeast gold standard [40].This data is split into 15 separate tasks, based on labels that correspond to experimental conditions from the original works (Figure 4A).A network is learned for each task separately using the YEASTRACT literature-derived prior, and aggregated into a final network for scoring on held-out genes from the gold standard.We test a combination of several preprocessing options with three network inference model selection methods (Figure 4B-D).
We find that network inference is generally sensitive to the preprocessing options chosen, and that these differences due to preprocessing generally outweigh the differences between different model selection methods (Figure 4B-D).A standard Freeman-Tukey or log 2 pseudocount transformation on raw count data yields the best performance, with notable decreases in recovery of the gold standard when count data is depth-normalized (such that each cell has the same total transcript counts).The performance of the randomly generated Noise control (N) is higher than the performance of the shuffled (S) control when counts per cell are not normalized, suggesting that total counts per cell provides additional information during inference.
Different model performance metrics, like AUPR, Matthews Correlation Coefficient (MCC), and F1 score correlate very well and identify the same optimal hyperparameters (Supplemental Figure 3).We apply StARS-LASSO to data that has been Freeman-Tukey transformed to generate a final network without holding out genes for cross-validation (Figure 4E).While we use AUPR as a metric for evaluating model performance, selecting a threshold for including edges in a GRN by precision or recall requires a target precision or recall to be chosen arbitrarily.
Alternatively, MCC and F1 score allow a threshold to be determined that maximizes similarity to the known prior or gold standard GRN.Choosing the Inferelator confidence score threshold to include edges in a final network that maximizes MCC is a simple heuristic way to select the size of a learned network that maximizes the overlap of the gold standard while minimizing links not in the gold standard (Figure 4F and section 5.6).Using the maximum F1 score is also an option, but gives a less conservative GRN as true negatives are not considered and will not diminish the score.Both metrics balance similarity to the gold standard with overall network size, and therefore represent straightforward heuristics that do not rely on arbitrary thresholds.

Large-scale Single-Cell Mouse Neuron Network Inference
To show scalability, we apply the Inferelator to a large-scale (1.3 million cells of scRNA-seq expression data) publicly available data set of mouse brain cells (10x genomics) that is accompanied by 15,000 single-cell ATAC (scATAC) measurements.
By using Dask to parallelize network inference, we are able to distribute work across multiple computational nodes, allowing networks to be rapidly learned from ¿10 5 cells (Figure 4A).We separate the expression and scATAC data into broad categories; Excitatory neurons, Interneurons, Glial cells and Vascular cells (Figure 5A-E).scRNA-seq data is further clustered within broad categories into clusters (Figure 5B) that are assigned to specific cell types based on marker expression (Figure 5C).scATAC data is aggregated into chromatin accessibility profiles for Excitatory neurons, Interneurons, and Glial cells (Figure 5D) based on accessibility profiles (Figure 5E), which are then used with the TRANSFAC mouse motif position-weight matrices to construct prior knowledge networks with the inferelator-prior pipeline.After initial quality control, filtering, and cell type assignment, 766,402 scRNA-seq and 7,751 scATAC observations remain (Figure 5F, Supplemental Figure 4B-D).Most scRNA-seq cell type clusters have thousands of cells, however a few clusters of rare cell types have as few as 42 (Figure 5G) After processing scRNA-seq into 36 cell type clusters and scATAC data into 3 broad (Excitatory neurons, Interneurons, and Glial) priors, we used the Inferelator to learn an aggregate mouse brain GRN.Each of the 36 clusters was assigned the most appropriate of the three prior networks and learned as a separate task using the AMuSR model selection framework.The resulting aggregate network contains 20,991 TF -gene regulatory edges, selected from the highest confidence predictions to maximize MCC (Figure 6A-B).1,909 of these network edges are present in every task-specific network, implying that they are a common regulatory core (Figure 6C).Task-specific networks from similar cell types tend to be highly similar, as measured by Jaccard index (Figure 6D).We learn very similar GRNs from each excitatory neuron task, and very similar GRNs from each interneuron task, although each of these broad categories yields different regulatory networks.There are also notable examples where glial and vascular tasks produce GRNs that are distinctively different from other glial and vascular GRNs.Finally, we can examine specific TFs and compare networks between cell type categories (Supplemental Figure 5).The TFs Egr1 and Atf4 are expressed in all cell types and Egr1 is known to have an active role at embryonic day 18 (E18) [48].In our learned network, Egr1 targets 103 genes, of which 20 are other TFs (Figure 6E-G).Half of these targets (49) are common to both neurons and glial cells, while 38 target genes are specific to neuronal GRNs and 16 target genes are specific to glial GRNs.We identify 14 targets for Atf4 (Figure 6H), the majority of which ( 8) are common to both neurons and glial cells, with only 1 target gene specific only to neuronal GRNs and 5 targets specific only to glial GRNs.

Discussion
We have developed the Inferelator v3.0 software package to address several key needs in single-cell gene regulatory network inference that have remained difficult to meet with existing solutions.First, this package is well-documented and straightforward to install and run on an individual computer, in the cloud, or on a large cluster.The Inferelator workflow can be scaled to match the size of the network inference problem and has no organism-specific requirements that preclude easy application to non-standard organisms.Second, different model selection methods can be compared with identical pre-and post-processing methods, including arbitrary methods implemented through the common scikit-learn estimator framework.
Model baselines can be easily established by setting flags to shuffle labels or generate noised data sets, and cross-validation and scoring on holdout genes is built directly into the pipeline.We believe this is particularly important, as many of the performance differences between gene regulatory network inference methods are not due to clever methods for model selection, but are instead the result of differences in data cleaning and preprocessing.Third, we have suggested a principled method for selecting regulatory edges to retain in a GRN.Many GRNs have been inferred by applying a collection of arbitrary heuristics to potential regulatory edges; here we propose ranking regulatory edges by the amount of target gene variance that they explain, and then selecting a threshold for inclusion that maximizes the MCC when scored against a known prior or gold standard network.Finally, we have evaluated the network inference performance on several model organisms that have been wellstudied, and for which a reasonable gold standard ground truth GRN can be created from experimental literature.Complex eukaryotes (e.g.mice or humans) lack a gold standard ground truth that can be used to determine real-world network inference performance.Many GRN inference methods instead benchmark on simulated or toy data, with limited experimental validation of a carefully selected tiny subset of an inferred real-world GRNs, making method comparisons difficult.
Multi-task modeling is also a key advantage for single-cell GRN inference.Unlike traditional RNA-seq that effectively measures the average gene expression of large number of cells, scRNA-seq can yield individual measurements for many different cell types that are implementing distinct regulatory programs.Learning GRNs from each of these cell types as a separate learning task in a multi-task framework allows cell type differences to be retained, while still taking advantage of the common regulatory programs.We demonstrate the use of this multi-task approach to simultaneously learn regulatory GRNs for a variety of mouse neuronal cell types from a very large (10 6 ) single-cell data set.This includes learning GRNs for rare cell types; by sharing information between cell types during regression, we are able to learn a core regulatory network while also retaining cell type specific interactions.As an example, the TFs Egr1 and Atf4 are broadly expressed and have multiple functions from memory formation and post synaptic development in neurons to cell migration and genome methylation in many cell types [48,49].We find a number of target genes regulated by Egr1 across all neuronal and glial cell types, like the RNA-binding protein Nova2 that regulates alternative splicing and axonal development [50] (Fig- ure 6F).The stress response TF Atf4 [51] is known to regulate neuronal GABA B receptor trafficking [52], and we identify it as a regulator of Rnf166, a RING-finger protein that promotes apoptotic cell death in neurons [53].We also determine that Atf4 regulates the highly conserved cytosolic sulfotransferase Sult4a1, which modulates neuronal branching complexity and dendritic spine formation, and has been linked to neurodevelopmental disorders [54].As the GRNs that have been learned for each cell type are sparse and consist of the highest-confidence regulatory edges, they are very amenable to exploration and experimental validation.
A number of limitations remain that impact our ability to accurately predict gene expression and cell states.Most important is a disconnect between the linear modeling that we use to learn GRNs and the non-linear biophysical models that incorporate both transcription and RNA decay.Modeling strategies that more accurately reflect the underlying biology will improve GRN inference directly, and will also allow prediction of useful latent parameters (e.g.RNA half-life) that are experimentally difficult to access.It is also difficult to determine if regulators are activating or repressing specific genes [32], complicated further by biological complexity that allows TFs to switch between activation and repression [55].Improving prediction of the directionality of network edges, and if directionality is stable in different contexts would also be a major advance.Many TFs bind cooperatively as protein complexes, or antagonistically via competitive binding, and explicit modeling of these TF-TF interactions would also improve GRN inference and make novel biological predictions.Finally, we note that core regulatory networks are likely to be conserved between related species, and further work to develop multi-species inference techniques can leverage evolutionary relationships to improve GRN inference [56].The modular Inferelator 3.0 framework will allow us to further explore these open problems in regulatory network inference without having to repeatedly reinvent and reimplement existing work.

Conclusion
The Inferelator 3.0 is a a state-of-the-art, easily deployable, and highly scalable network inference tool that is generally applicable to learning GRNs from both single-cell and traditional RNA-seq experiments in any organism of interest.With its accessory software packages, genome-wide expression data of any type can be integrated with chromatin accessibility data to construct and explore cell typespecific GRNs.We have established the reliability of this tool by benchmarking on real-world data in model organisms Bacillus subtilis and Saccharomyces cerevisiae with known gold standard GRNs, and demonstrated how it could be applied to large-scale network inference on many different cell types in the developing mouse brain.We expect this to be a valuable tool to build biologically-relevant GRNs for experimental follow-up, as well as a baseline for further development of computational methods in the network inference field.

TF Motif-Based Connectivity Matrix (inferelator-prior)
A prior knowledge matrix consists of a signed or unsigned connectivity matrix between regulatory transcription factors (TFs) and target genes.This matrix can be obtained experimentally or by mining regulatory databases.Scanning genomic sequence near promoter regions for TF motifs allows for the construction of motifderived priors which can be further constrained experimentally by incorporating information about chromatin accessibility [38].
We have further refined the generation of prior knowledge matrices with the python inferelator-prior package, which takes as input a gene annotation GTF file, a genomic FASTA file, and a TF motif file, and generates an unsigned connectivity matrix.It has dependencies on the common scientific computing packages NumPy [57], SciPy [58], and scikit-learn [59].In addition, it uses the BEDTools kit [60] and associated python interface pybedtools [61].The inferelator-prior package (v0.3.0 was used to generate the networks in this manuscript) is available on github (https://github.com/flatironinstitute/inferelator-prior)and can be installed through the python package manager pip.

Genomic regions of interest are identified by locating annotated Transcription Start
Sites (TSS) and opening a window that is appropriate for the organism.For microbial species with a compact genome (e.g.yeast), regions of interest are defined as 1000bp upstream and 100bp downstream of the TSS.For complex eukaryotes with large intergenic regions (e.g.mammals), regions of interest are defined as 50000bp upstream and 2500bp downstream of the TSS.This is further constrained by intersecting the genomic regions of interest with a user-provided BED file, which can be derived from a chromatin accessibility experiment (ATAC-seq) or any other method of identifying chromatin of interest.Within these regions of interest, motif locations are identified using the Find Original Motif Occurrences (FIMO) [67] tool from the MEME suite [68], called in parallel on motif chunks to speed up processing.Each motif hit identified by FIMO is then scored for information content (IC) [69].IC i , ranging between 0 and 2 bits, is calculated for each base i in the binding site, where p b,i is the probability of the base b at position i of the motif and p b,bg is the background probability of base b in the genome (Equation 1).Effective information content (EIC) (Equation 2) is the sum of all motif at position i is IC i penalized with the 2 -norm of the hit IC i and the consensus motif base at position i, IC i,consensus . (2)

Connectivity Matrix
A TF-gene binding score is calculated separately for each TF and gene.Each motif hit for a TF within the region of interest around the gene is identified.Overlapping

CellOracle Connectivity Matrix
CellOracle [32] was cloned from github (v0.6.5;https://github.com/morris-lab/CellOracle; a0da790).CellOracle was provided a BED file with promoter locations for each gene (200bp upstream of transcription start site to 50bp downstream of transcription start site) and the appropriate MEME file for each motif database.
Connectivity matrices were predicted using a false positive rate of 0.02 and a motif score threshold of 6.The inferelator-prior pipeline was run using the same promoter locations and MEME files so that the resulting networks are directly comparable, and the Jaccard index between each network and the YEASTRACT network was calculated.Each motif-based network was used as a prior for inferelator network inference on Saccharomyces cerevisiae, with the same 2577 genome-wide expression microarray measurements [40].20% of the genes were held out of the prior networks and used for scoring the resulting network inference.The motif-based network files have been included in Supplemental Data 1.

Network Inference (The Inferelator)
The Inferelator modeling of gene regulatory networks relies on three main modeling assumptions.First, because many transcription factors (TFs) are post transcriptionally controlled and their expression level may not reflect their underlying biological activity, we assume that the activity of a TF can be estimated using expression levels of known targets from prior interactions data [36,71].Second, we assume that gene expression can be modeled as a weighted sum of the activities of TFs [34,45].Finally, we assume that each gene is regulated by a small subset of TFs and regularize the linear model to enforce sparsity.
The Inferelator was initially developed and distributed as an R package [34,44,72,73].We have rewritten it as a python package with dependencies on the common scientific computing packages NumPy [57], SciPy [58], pandas [74], AnnData [29], and scikit-learn [59].Scaling is implemented either locally through python or as a distributed computation with the Dask [42] parallelization library.The inferelator package (v0.5.4 was used to generate the networks in this manuscript) is available on github (https://github.com/flatironinstitute/inferelator)and can be installed through the python package manager pip.The Inferelator takes as input gene expression data and prior information on network structure, and outputs ranked regulatory hypotheses of the relative strength and direction of each interaction with an associated confidence score.

Transcription Factor Activity
The expression level of a TF is often not suitable to describe its activity [75].
Transcription factor activity (TFA) is an estimate of the latent activity of a TF that is inducing or repressing transcription of its targets in a sample.A gene expression data set (X) is a matrix where X i,j is the observed mRNA expression level (i ∈ Samples and j ∈ Genes), measured either by microarray, RNA-seq, or single cell RNA sequencing (scRNA-seq).
We estimate TFA by solving (Equation 3) for activity (A i,k ), where k ∈ TFs, and P is a prior connectivity matrix where P k,j is 1 if gene j is regulated by TF k and 0 if it is not.In matrix notation, X = AP, and Â is estimated by minimizing ÂP − X 2 2 .This is calculated by taking the pseudoinverse of P and solving The resulting Â is a matrix where Âi,k is the estimated latent TFA for sample i and TF k.In cases where all values in P for a TF are 0, that TF is removed from P and the expression X of that TF is used in place of activity.

Inferelator Network Inference
Linear models (Equation 4) are separately constructed for each gene j.
In addition to the model selection methods described here, we have implemented a module which takes any scikit-learn regression object (for example, elastic net [76]).
Model selection and regularization techniques are applied to enforce the biological property of sparsity.If the coefficient β j,k is non-zero, it is evidence for a regulatory relationship between TF k and gene j.
For each gene j, the amount of variance explained by each regulatory TF k is calculated as the ratio between the variance of the residuals in the full model and the variance of the residuals when the linear model is refit by ordinary least squares (OLS) and k is left out (Equation 5).
In order to mitigate the effect of outliers and sampling error, model selection is repeated multiple times using input expression data X that has been bootstrapped (resampled with replacement).Predicted TF-gene interactions are ranked for each bootstrap by amount of variance explained and then rank-combined into a unified network prediction.Confidence scores are assigned based on the combined rank for each interaction, and the overall network is compared to a gold standard and performance is evaluated by area under the precision-recall curve.
The effects of setting hyperparameters can be tested by cross-validation on the prior and gold standard networks.This strategy holds out a subset of genes (rows) from the prior knowledge network P. Network inference performance is then evaluated on only those held-out genes, using the gold standard network.

Model Selection: Bayesian Best Subset Regression
Bayesian Best Subset Regression (BBSR) is a model selection method described in detail in [73].Initial feature selection for this method is necessary as best subset regression on all possible combinations of hundreds of TF features is computationally intractible.We therefore select ten TF features with the highest context likelihood of relatedness between expression of each gene and activity of each TF.This method is described in detail in [72].
First, gene expression and TF activity are discretized into equal-width bins (n=10) and mutual information is calculated based on their discrete probability distributions (Equation 6) to create a mutual information matrix M dyn .
M stat k1,k2 = p( Âk1 , Âk2 ) log Mutual information is also calculated between activity of each TF (Equation 7) to create a mutual information matrix M stat .
A mixed context likelihood of relatedness score is then calculated as a pseudo-zscore by calculating Z dyn (Equation 8) and Z stat (Equation 9).Any values less than 0 in Z dyn or Z stat are set to 0, and then they are combined into a mixed context likelihood of relatedness matrix Z mixed (Equation 10).For each gene j, the 10 TFs with the highest mixed context likelihood of relatedness values are selected for regression.
For best subset regression, a linear model is fit with OLS for every combination of the selected predictor variables.
We define β 0 as our null prior for the model parameters (zeros), β OLS as the model coefficients from OLS, SSR as the sum of squared residuals, and G as a g-prior diagonal matrix where the diagonal values represent a weight for each predictor variable.g-prior weights in G close to 0 favor β values close to β 0 .Large g-prior weights favor β values close to β OLS .By default, we select g-prior weights of 1 for all predictor variables.From the joint posterior distribution (Equation 11) we can calculate the marginal posterior distribution of σ 2 (Equation 12), where IG is the inverse gamma distribution.The Bayesian information criterion (BIC) is calculated for each model, where n is the number of observations and k is the number of predictors (Equation 13).
We calculate the expected posterior distribution of σ 2 (Equation 14) for each subset of predictors, and use it to determine the model BIC (Equation 15).We then select the model with the smallest E[BIC].The predictors in the selected subset model for gene j are TFs which regulate its expression.

Model Selection: StARS-LASSO
Least absolute shrinkage and selection operator (LASSO) [77] combined with the Stability Approach to Regularization Selection (StARS) [78] is a model selection method described in detail in [38].In short, the StARS-LASSO approach is to select the optimal λ parameter for (Equation 16).N random subsamples of X and Â without replacement subnetworks S n,λ are defined as the non-zero coefficients β n,λ after LASSO regression.Initially, λ is set large, so that each subnetwork S n is highly sparse, and is then decreased, resulting in increasingly dense networks.
Edge instability is calculated as the fraction of times subnetworks disagree about the presence of an network edge.As λ decreases, the subnetworks are expected to have increasing edge instability initially and then decreasing edge instability as λ approaches 0, as (Equation 16) reduces to OLS and each subnetwork becomes dense. min We choose the largest value of λ such that the edge instability is less than 0.05, which is interpretable as all subnetworks share > 95% of edges.This selection represents a balance between increasing the network size and minimizing the instability that occurs when data is sampled.

Multiple Task Network Inference
We separate biological samples which represent different states into separate tasks, learn networks from these tasks, and then combine task-specific networks into an ensemble network.One method of solving these states is to sequentially apply a single-task method for network inference (i.e.5.4.1 or 5.4.2).The networks generated for each task are then rank-combined into a unified network.The Adaptive Multiple Sparse Regression (AMuSR) method, described in detail in [45], uses a multi-task learning framework, where each task is solved together.
In (Equation 17), B is a block-sparse weight matrix in which the weights for any feature are the same across all tasks.S is a sparse weight matrix in which the weights for features can vary between tasks.The combination W of B and S (Equation 18) are model weights representing regulatory interactions between TFs and genes.In short, this method uses adaptive penalties to favor regulatory interactions shared across multiple tasks in B, while recognizing data set specific interactions in S.
Model hyperparameters λ s and λ b are identified by grid search, selecting the model that minimizes the extended Bayesian Information Criterion (eBIC) (Equation 19), where D is the number of task data sets, and for data set d, n d is the number of observations, X is gene expression for gene i, Â(d) is TF activity estimates, W * ,d is model weights, k d is the number of non-zero predictors, and p d is the total number of predictors.For this work, we choose to set the eBIC paramater γ to 1.

Network Performance Metrics
Prior work has used the area under the Precision (Equation 20) -Recall (Equation 21) curve to determine performance, by comparing to some known, gold-standard network.Here we add two metrics; Matthews correlation coefficient [79] (MCC) (Equation 22) and F1 score (Equation 23).MCC can be calculated directly from the confusion matrix True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) values.

P recision = T P T P + F P
M CC = T P * T N − F P * F N We compute an MCC and F1 score for each cutoff along ranked interactions in order to generate MCC and F1 scores for all possible networks in growing ranked order.The maximum MCC along ranked interactions gives the subnetwork that has maximum similarity to the comparison network, accounting for TP, FP, TN, and FN.The maximum F1 along ranked interactions gives the subnetwork that has maximum similarity to the comparison network accounting for TP, FP, and FN.

Network Inference in Bacillus subtilis
Microarray expression data for Bacillus subtilis was obtained from NCBI GEO; GSE67023 [36] (n=268) and GSE27219 [43] (n=266).The Inferelator (v0.5.4) learned GRNs using each expression data set separately in conjunction with a known prior network [36] (Supplemental Data 1).Performance was evaluated by AUPR on ten replicates by holding 20% of the genes in the known prior network out, learning the GRN, and then scoring based on the held-out genes.Baseline shuffled controls were performed by randomly shuffling the labels on the known prior network.
Multi-task network inference uses the same B. subtilis prior for both tasks, with 20% of genes held out for scoring.Individual task networks are learned and rankcombined into an aggregate network.Performance was evaluated by AUPR on the held-out genes.

Network Inference in Saccharomyces cerevisiae
A large microarray data set was obtained from NCBI GEO and normalized for a previous publication [40] (n=2,577).It is available on zenodo with DOI: 10.5281/zenodo.3247754.In short, this data was preprocessed with limma [80] and quantile normalized.A second microarray data set consisting of a large dynamic perturbation screen [16] was obtained from NCBI GEO accession GSE142864 (n=1,693).This data set is log 2 fold change of an experimental channel over a control channel which is the same for all observations.The Inferelator (v0.5.4) learned GRNs using each expression data set separately in conjunction with a known YEASTRACT prior network [65,66] (Supplemental Data 1).Performance was evaluated by AUPR on ten replicates by holding 20% of the genes in the known prior network out, learning the GRN, and then scoring based on the held-out genes in a separate gold standard [40].Baseline shuffled controls were performed by randomly shuffling the labels on the known prior network.
Multi-task network inference uses the same YEASTRACT prior for both tasks, with 20% of genes held out for scoring.Individual task networks are learned and rank-combined into an aggregate network, which is then evaluated by AUPR on the held-out genes in the separate gold standard.

Single-Cell Network Inference in Saccharomyces cerevisiae
Single-cell expression data for Saccharomyces cerevisiae was obtained from NCBI GEO (GSE125162 [22] and GSE144820 [47]).Individual cells (n=44,343) are organized into one of 14 groups based on experimental metadata and used as separate tasks in network inference.Genes were filtered such that any gene with fewer than than 2217 total counts in all cells (1 count per 20 cells) was removed.Data was used as raw, unmodified counts, was Freeman-Tukey transformed ( √ x + 1 + √ x − 1), or was log 2 pseudocount transformed (log 2 (x + 1)).Data was either not normalized, or depth normalized by scaling so that the sum of all counts for each cell is equal to the median of the sum of counts of all cells.For each set of parameters, network inference is run 10 times, using the YEASTRACT network as prior knowledge with 20% of genes held out for scoring.For noise-only controls, gene expression counts are simulated randomly such that for each gene i, x i ∼ N (µ xi , σ xi ) and the sum for each cell is equal to the sum in the observed data.For shuffled controls, the gene labels on the prior knowledge network are randomly shuffled.SCANPY was used to preprocess and cluster scRNAseq data set.Genes present in fewer than 2% of cells were removed.Cells were filtered out when fewer than 1000 genes were detected, the cell had more than 20,000 total gene counts, or the cell had more than 7% of gene counts assigned to mitochondrial transcripts.
Transcript counts were then log transformed and normalized and scaled.Cells were assigned to mitotic or post mitotic phase based on cell cycle marker genes using score genes cell cycle [81].In order to focus on neuronal cells, all 374,369 mitotic Aggregated chromatin accessibility profiles were used with TRANSFAC v2020.1 motifs and the inferelator-prior (v0.3.0)pipeline to create prior knowledge connectivity matrices between TFs and target genes for excitatory neurons, interneurons, and glial cells.Vascular cells were not present in the scATAC data sufficiently to allow construction of a vascular cell prior with this method, and so vascular cells were assigned the glial prior for network inference.
GRNs were learned using AMuSR on log 2 pseudocount transformed count data for each of 36 cell type specific clusters as separate tasks with the appropriate prior knowledge network.An aggregate network was created by rank-summing each cell type GRN.MCC was calculated for this aggregate network based on a comparison to the union of the three prior knowledge networks, and the confidence score which maximized MCC was selected as a threshold to determine the size of the final network.
5.11 Inferelator 3.0 Single-Cell Computational Speed Profiling 144,682 mouse cells from the mouse neuronal subcluster EXC IT 1 were used with the mouse excitatory neuron prior knowledge network to determine Inferelator 3.0 runtime.To benchmark the Dask engine, the Inferelator was deployed to 5 28-core (Intel R Xeon R E5-2690) nodes for a total of 140 cpu cores.To benchmark the python-based multiprocessing engine, the Inferelator was deployed to a single 28core (Intel R Xeon R E5-2690) node.Either all 144,682 mouse cells were used, or a subset was randomly selected for each run, and used to learn a single GRN.Runtime was determined by the length of workflow execution, which includes loading data, running all regressions, and producing output files.

Visualization
Figures were generated with R [85] and the common ggplot2 [86], umap [87], and tidyverse packages [88].Additional figures were generated with python using scanpy [29], matplotlib [89], and seaborn [90].Performance is evaluated by AUPR, scoring against genes held out of the prior adjacency matrix, based on inference using 2577 genome-wide microarray experiments.Experiments labeled with (S) are shuffled controls, where the labels on the prior adjacency matrix have been randomly shuffled.

100
The Inferelator produces an inferred network from a combination of gene expression 101 data and a prior network constructed from existing knowledge about known gene 102 regulation.Curated databases of regulator-gene interactions culled from domain-103 specific literature are an excellent source for prior networks.While some model 104 systems have excellent databases of known interactions, these resources are unavail-105 able for most organisms or cell types.In these cases, using chromatin accessibility 106 determined by a standard Assay for Transposase-Accessible Chromatin (ATAC) in 107 motif hits are resolved by taking the maximum IC for each overlapping base, penalized with the 2 -norm of differences from the motif consensus sequence.To account for cooperative TF binding effects, any motif hits within 100 bases (25 bases for yeast) are combined, and their EIC scores are summed.The TF-gene binding score is the maximum TF EIC after accounting for overlapping and adjacent TF motifs, and all TF-gene scores are assembled into a Genes x TFs score matrix.This unfiltered TF-gene score matrix is not sparse as motifs for many TFs are expected to occur often by chance, and TF-gene scores for each TF are not comparable to scores for other TFs as motif position-weight matrices have differing information content.Scores for each TF are clustered using the density-based k-nearest neighbors algorithm DBSCAN[70] (MinPts = 0.001 * number of genes, eps = 1).The cluster of TF-gene edges with the highest score values, and any high-score outliers, are retained in the connectivity matrix, and other TF-gene edges are discarded.
cells were removed.Remaining cells were clustered by Leiden clustering (Resolution = 0.5) using the first 300 principal components of the 2000 most highly variable genes.Broad cell types were assigned to each cluster based on the expression of marker genes Neurod6 for Excitatory neurons, Gad1 for Interneurons, and Apoe for glial cells.Cells from each broad cell type were then re-clustered into clusters based on the 2000 most highly variable genes within the cluster.Specific cell types were assigned to each subcluster based on the expression of marker genes[82].Ambiguous clusters were discarded, removing 151,765 cells, leaving resulting in 36 specific cell type clusters that consist of 766,402 total cells.Single-cell ATAC data from Mus musculus brain samples taken at E18 was obtained from 10x genomics; data sets are from samples prepared fresh (https:// support.10xgenomics.com/single-cell-atac/datasets/1.2.0/atac_v1_E18_ brain_fresh_5k), samples dissociated and cryopreserved (https://support.10xgenomics.com/single-cell-atac/datasets/1.2.0/atac_v1_E18_brain_cryo_ 5k), and samples flash-frozen (https://support.10xgenomics.com/single-cellatac/datasets/1.2.0/atac_v1_E18_brain_flash_5k).ChromA[83] and Snap-ATAC[84] were used to process the scATACseq data sets.Consensus peaks were called on the 3 data sets using ChromA.Each data set was then run through the SnapATAC pipeline using the consensus peaks.Cells were clustered and labels from the scRNAseq object were transferred to the scATAC data.Cells that did not have an assignment score ≥ .5 were discarded.Assigned barcodes were split by cell class( EXC, IN or GL).ChromA was run again for each cell class generating 3 sets of cell class specific peaks.

Figure 1 :
Figure 1: Learning Gene Regulatory Networks with the Inferelator (A) The response to the sugar galactose in Saccharomyces cerevisiae is mediated by the Gal4 and Gal80 TFs, a prototypical mechanism for altering cellular gene expression in response to stimuli.(B) Gal4 and Gal80 regulation represented as an unsigned directed graph connecting regulatory TFs to target genes.(C) Genome-wide Gene Regulatory Networks (GRNs) are inferred from gene expression data and prior knowledge about network connections using the Inferelator, and the resulting networks are scored by comparison with a gold standard of known interactions.

Figure 2 :
Figure 2: Network Inference Performance on Multiple Model Organism Data Sets (A) Schematic of Inferelator workflow and a brief summary of the differences between GRN model selection methods (B) Results from 10 replicates of GRN inference for each modeling method on (i) Bacillus subtilis GSE67023 (B1), GSE27219 (B2) and (ii) Saccharomyces cerevisiae GSE142864 (S1), and [40] (S2).Precisionrecall curves are shown for replicates where 20% of genes are held out of the prior and used for evaluation, with a smoothed consensus curve.AUPR is plotted for each cross-validation result in gray, with mean ± standard deviation in color.Experiments labeled with (S) are shuffled controls, where the labels on the prior adjacency matrix have been randomly shuffled.10 shuffled replicates are shown as gray dots, with mean ± standard deviation in black.(C) Results from 10 replicates of GRN inference using two data sets as two network inference tasks on (i) Bacillus subtilis and (ii) Saccharomyces cerevisiae.AMuSR is a multi-task learning method; BBSR and StARS-LASSO are run on each task separately and then combined into a unified GRN.Precision-recall curves and AUPR are plotted as in B.

Figure 3 :
Figure 3: Construction and Performance of Network Connectivity Priors Using TF Motif Scanning (A) Schematic of inferelator-prior workflow, scanning identified regulatory regions (e.g. by ATAC) for TF motifs to construct adjacency matrices (B) Jaccard similarity index between Saccharomyces cerevisiae prior adjacency matrices generated by the inferelator-prior package, by the CellOracle package, and obtained from the YEASTRACT database.Prior matrices were generated using TF motifs from the CIS-BP, JASPAR, and TRANSFAC databases with each pipeline.(C)The performance of Inferelator network inference using each motif-derived prior.Performance is evaluated by AUPR, scoring against genes held out of the prior adjacency matrix, based on inference using 2577 genome-wide microarray experiments.Experiments labeled with (S) are shuffled controls, where the labels on the prior adjacency matrix have been randomly shuffled.

Figure 4 :Figure 5 :
Figure 4: Network Inference Performance Using Single-Cell Saccharomyces cerevisiae Expression Data (A) Uniform Manifold Approximation and Projection (UMAP) plot of single-cell yeast data, colored by the experimental grouping of individual cells (tasks).(B) The effect of preprocessing methods on network inference using BBSR model selection on 14 task-specific expression data sets, as measured by AUPR.Colored dots represent mean ± standard deviation of all replicates.Data is either untransformed (raw counts), transformed by Freeman-Tukey Transform (FTT), or transformed by log 2 (x 1 ) pseudocount.Non-normalized data is compared to data normalized so that all cells have identical count depth.Network inference performance is compared to two baseline controls; data which has been replaced by Gaussian noise (N) and network inference using shuffled labels in the prior network (S).(C) Performance evaluated as in B on StARS-LASSO model selection.(D) Performance evaluated as in B on AMuSR model selection.(E) Precision-recall of the recovery of the prior on a final network constructed using FTT-transformed, nonnormalized AMuSR model selection.(F) Matthews Correlation Coefficient (MCC) of the same network as in E

Figure 6 :
Figure 6: Learned GRN For Mouse Brain (A) MCC for the aggregate network based on Inferelator prediction confidence.The dashed line shows the confidence score which maximizes MCC.Network edges at and above this line are retained in the final network.(B) Aggregate GRN learned.(C) Network edges which are present in every individual task.(D) Jaccard similarity index between each task network (E) Network targets of the EGR1 TF in neurons.(F) Network targets of the EGR1 TF in both neurons and glial cells.(G) Network targets of the EGR1 TF in glial cells.(H) Network of the ATF4 TF where blue edges are neuron specific, orange edges are glial specific, and black edges are present in both categories.
Network diagrams were created with the python package jp gene viz (https://github.com/simonsfoundation/jp_gene_viz).Schematic figures were created in Adobe Illustrator, and other figures were adjusted in Illustrator to improve panelling and layout.