Bayesian information sharing enhances detection of regulatory associations in rare cell types

Abstract Motivation Recent advances in single-cell RNA-sequencing (scRNA-seq) technologies promise to enable the study of gene regulatory associations at unprecedented resolution in diverse cellular contexts. However, identifying unique regulatory associations observed only in specific cell types or conditions remains a key challenge; this is particularly so for rare transcriptional states whose sample sizes are too small for existing gene regulatory network inference methods to be effective. Results We present ShareNet, a Bayesian framework for boosting the accuracy of cell type-specific gene regulatory networks by propagating information across related cell types via an information sharing structure that is adaptively optimized for a given single-cell dataset. The techniques we introduce can be used with a range of general network inference algorithms to enhance the output for each cell type. We demonstrate the enhanced accuracy of our approach on three benchmark scRNA-seq datasets. We find that our inferred cell type-specific networks also uncover key changes in gene associations that underpin the complex rewiring of regulatory networks across cell types, tissues and dynamic biological processes. Our work presents a path toward extracting deeper insights about cell type-specific gene regulation in the rapidly growing compendium of scRNA-seq datasets. Supplementary information Supplementary data are available at Bioinformatics online. Availability and implementation The code for ShareNet is available at http://sharenet.csail.mit.edu and https://github.com/alexw16/sharenet.

We compared the accuracy of dendritic cell networks inferred with Pearson correlation, GENIE3, and PIDC, with and without ShareNet. The dendritic cell population is downsampled to 50, 100, and 200 cells. We tested a range of mixture component numbers for ShareNet across the set of downsampling conditions. The inferred dendritic cell networks were compared with the corresponding cell type-specific ChIP-seq networks. While increasing the number of mixture components appears to improve the accuracy of networks inferred with Pearson correlation for population sizes of 50, the effect of ShareNet on network accuracy is generally robust to the number of mixture components used, as long as multiple mixture components are used.    The TreeMap plot was generated using REVIGO on the significant GO terms (FDR q < 0.01) identified in the enrichment analysis.

Derivation of ShareNet
In this section, we derive the parameter updates in the coordinate ascent variational inference procedure used in ShareNet.

Model Description
We first present the generative process for the Bayesian information sharing framework in full.
i,j represents the observed score for the edge connecting gene i to gene j in cell type c, and we model it as a sample from a univariate Gaussian N (e (c) i,j of the Gaussian is a latent variable that represents the true score of the edge if calculated in an ideal scenario absent of noise. The variance parameter σ (c)2 i,j captures the degree of variation in the observed edge score. We treat this variance parameter σ (c)2 i,j as a hyperparameter, estimating it by bootstrap sampling each cell type in the dataset of interest, applying a chosen network inference algorithm on the bootstrapped samples, and then calculating the standard error, which we use as the standard deviation in this part of the model.
Working up the hierarchical model, the C-dimensional vector of latent variables i,j ) containing the true edge scores for edge (i, j) across the C cell types is a sample drawn from a mixture of multivariate Gaussian distributions with K mixture components. Each of the K Gaussians is parameterized by a mean µ k and an inverse covariance Σ −1 k parameter that are drawn independently from the same Normal-Wishart prior. The set of K covariance matrices Σ 1 , ..., Σ K represents the set of cell type-to-cell type sharing patterns.
Taking all of these distributions into consideration, the joint density for the full hierarchical model can be written as follows. For simplicity of representation, we denote each (regulator gene i-target gene j) combination with a unique subscript n.
) is the diagonal matrix containing the reciprocal of the squared standard error terms for edge n across cell types 1 to C.

Variational Inference
As the latent variable z represents the true edge scores, we are interested in calculating the posterior distribution of this latent variable given the observed edge scores.
We are similarly interested in calculating the posterior distributions of the remaining latent variables (µ, Σ −1 , u) given the observed edge scores. However, exact computation of these posterior distributions is intractable, as computing each of these distributions requires computing the evidence term p(e) in the denominator.
We can see that computing this evidence term requires integrating over u, which is intractable to compute for even moderate sizes of N and K. As a result, we resort to variational inference as a tool for approximating these distributions.

Setup
In variational inference, we propose a family of approximating (variational) distributions and then seek to identify the member of this family of distributions that minimizes the KL divergence to the posterior distribution that we aim to approximate p(U |x).
Here, we define U = (µ, Σ −1 , u, z) to be the set of latent variables associated with our model and x = e are the observed edge scores. Unfortunately, computation of the KL divergence objective function is also intractable because it requires computing log p(e).
We can, however, calculate an alternative objective, the ELBO (evidence lower bound), which is equivalent to the negative KL divergence up to an added constant.
We, therefore, seek to maximize the ELBO as our objective.

Variational family of distributions
In our setup, we assume that the variational distribution factorizes as Given this family of distributions, our variational inference goal is to identify the member of this family that maximizes the ELBO.

Coordinate ascent variational inference: general
Let us first consider a general solution for an optimal factor in variational inference. For each factor in the variational distribution, its optimal distribution is proportional to the exponential of the expected log of the joint, where the expectation is taken with respect to the latent variables described by the remaining variational factors (18).
Here, each θ j represents the set of latent variables that are being described by a variational factor j. We can apply this solution to determine the form of each of the factors in our variational distribution.

Coordinate ascent variational inference: q(u n )
Starting with the q(u n ) factor in our variational distribution, we first state the general form of the optimal solution.
Writing out this general solution, we have the following. Terms that do not depend on u n are absorbed into a constant.
If we exponentiate both sides, our optimal solution takes on the form q(u n ) ∝ K k=1 ρ u n,k n,k , which when normalized is a categorical distribution.
From this result, we also have E q(u) [u n,k ] =φ n,k , which will be used in our derivations of the other variational factors below. We will also return to the expectation terms within Equation 1 in the derivations below.
2.2.5 Coordinate ascent variational inference: q(µ k , Σ −1 k ) We apply a similar procedure for determining the optimal solution for the q(µ k , Σ −1 k ) factor. The general solution for this factor is as follows.
Because our prior over (µ k , Σ −1 k ) is a Normal-Wishart distribution and the distribution over each z n in our hierarchical model is Gaussian, our optimal solution is a Normal-Wishart distribution parameterized as follows.
Similar to the above, we will return to the expectation terms expressed in these equations in a subsequent section.
2.2.6 Coordinate ascent variational inference: q(z n ) Because the prior over z n in the hierarchical model is Gaussian and the distribution over z n is also Gaussian, q * (z n ) is takes on the following Gaussian distribution.

Full set of coordinate ascent updates
Using the above expressions for the various factors in our variational distribution, we can write out the complete set of corresponding variational parameter update equations. For the variational factor q(µ k , Σ −1 k ), the update equations forã k andB −1 k are as follows.
For the q(u n ) factor, we had Equation 1, which contained the following expectation terms that we can now write out.
Using the above two equations, our revised expression for Equation 1 then becomes log ρ n,k = log We now have a complete set of coordinate ascent update equations for our variational parameters.

Model Description
Our derivation of the Bayesian variable selection with ShareNet as prior contains many of the same terms as above, as we seek to leverage a mixture of Gaussians to share information across cell types. As a result, the distributions that are used in this model overlap to some degree with the ones described earlier. In the Bayesian variable selection with ShareNet model, the generative process is as follows.
Here, β i,j represents the coefficient of the linear model describing effect of regulator gene i on target gene j, and γ (c) i,j is a corresponding binary indicator variable that indicates whether or not regulator gene i is to be included in the network model for cell type c. We also include "spike and slab" priors (17) on β

Variational Inference
Similar to the previous model, we seek to define a family of variational distributions from which we identify an optimal approximating distribution. In this scenario, the variational distribution factorizes as follows.
We also define the distributions for the following set of variational factors.

ELBO
Given our joint density and the general form of our variational distributions, we can write out the ELBO.
We split up the first and second terms in the ELBO and consider them separately at first for simplicity.

Expectation Terms
From the above expression of the ELBO, we observe that there are a number of expectation terms which we will need to calculate in order to calculate the ELBO. We write out this set of expectation equations, which will be useful for explicitly writing out the form of the ELBO.

Approximating the Expectation of Log Sigmoid
Here, we seek to approximate E q [log g(z (c) j,i )] by first defining a grid of samples from the factor in our variational distribution that approximates the distribution for z j,i . In our current setup, this factor corresponds to a multivariate Gaussian centered atm j,i with covarianceS j,i . Because we are focusing on approximating a single scalar value E q [log g(z (c) j,i )] here, we need only to consider the components of the multivariate Gaussian that directly correspond to element (c), namelym (c) j,i in the mean andS c,c j,i in the covariance. We refer to these values asm ands, respectively, below for simplicity.
We choose a grid of samplesm,m ± η √s ,m ± 2η √s , ...,m ± Kη √s , where η and K are hyperparameters. Using this grid of samples, we approximate our expectation term as E q [log g(z (c) )). With these approximations, let us now write out the ELBO.

Coordinate ascent variational inference:S j,i
Let us first write out the terms in the ELBO that containS j,i or any of its elements.
Now, we take the partial derivative with respect toS j,i .
We simplify each of the terms in the first line to clean up the calculation of the partial derivative, replacing S c,c j,i with s and making the following substitutions: a = e −m (c) j,i and k t = ηt. We also apply the sigmoid function g( Substituting these terms back into our expression for ∂L ∂Sj,i yields the following. j,i +ηts) Now that we have an expression for the gradient, we can use gradient ascent to optimizeS c,c j,i .

Coordinate ascent variational inference:m
Similar to before, we start out by isolating the terms in the ELBO that containm j,i .
Taking the partial derivative with respect tom j,i , we get the following.

∂L ∂m
We then calculate the remaining partial derivative terms in the equation above. Our expression for the partial derivative can now be written as follows.
As with the optimization scheme forS j,i , we can employ this expression of the gradient in a gradient ascent optimization procedure to identify an optimal value form j,i .

Runtime and Memory Usage
ShareNet's runtime and memory usage are dependent on a variety of factors, most notably the number of cell types and the number of putative edges in a network. The main bottleneck for both of these computational resources is related to theS variational parameter, an N × C × C matrix, where N is the number of putative edges in a network and C is the number of cell types. The memory associated with this parameter, therefore, scales with complexity O(N C 2 ). Meanwhile, the matrix inversion step in the coordinate ascent update forS (Eq. 8) scales with complexity O(N C 3 ). As a result, the number of network edges and cell types considered in the use of ShareNet can dramatically influence both the runtime and memory consumption, a phenomenon we observe across the three datasets, which spanned a broad range of these factors. For the BEELINE analyses, we considered 267,750 putative edges for each of the five cell types in the dataset. In the Tabula muris analyses, we inferred networks for 52 cell types, each of which contained 2,640,456 putative edges. Lastly, the mouse blood lineage dataset contained 24 cell types, and each network was associated with 1,207,152 putative edges. The average total runtime for ShareNet was 2 minutes, 116 minutes, and 411 minutes for the BEELINE, mouse blood lineage, and Tabula muris datasets, respectively. Peak memory usage was approximately 664 MB, 11.9 GB, and 111 GB for the same three datasets. These runtime and memory usage statistics do not include runs of the network inference algorithms required to obtain initial network estimates and standard deviation estimates, which ShareNet uses as input. All methods were run on a 2.40GHz Intel Xeon E5-2695v2 central processing unit.
While the runtime and memory usage statistics may appear sizeable for larger datasets, we note that both of these computational resources can be substantially reduced by leveraging prior biological knowledge. For example, the number of putative edges considered in a network can be drastically reduced to consider only edges supported by transcription factor binding motifs or chromatin accessibility instead of all possible transcription factor-target gene pairs as was done in this study. In addition, we observed that cell types that are dissimilar tend to engage in less information sharing with each other in ShareNet (Fig. 5a, 6b). Retaining only cell types that are more similar to each other for use in ShareNet rather than using an entire collection of varied cell types could, therefore, also reduce the computational burden for such large dataset scenarios without diminishing performance.