Bayesian inference on stochastic gene transcription from flow cytometry data

Abstract Motivation Transcription in single cells is an inherently stochastic process as mRNA levels vary greatly between cells, even for genetically identical cells under the same experimental and environmental conditions. We present a stochastic two-state switch model for the population of mRNA molecules in single cells where genes stochastically alternate between a more active ON state and a less active OFF state. We prove that the stationary solution of such a model can be written as a mixture of a Poisson and a Poisson-beta probability distribution. This finding facilitates inference for single cell expression data, observed at a single time point, from flow cytometry experiments such as FACS or fluorescence in situ hybridization (FISH) as it allows one to sample directly from the equilibrium distribution of the mRNA population. We hence propose a Bayesian inferential methodology using a pseudo-marginal approach and a recent approximation to integrate over unobserved states associated with measurement error. Results We provide a general inferential framework which can be widely used to study transcription in single cells from the kind of data arising in flow cytometry experiments. The approach allows us to separate between the intrinsic stochasticity of the molecular dynamics and the measurement noise. The methodology is tested in simulation studies and results are obtained for experimental multiple single cell expression data from FISH flow cytometry experiments. Availability and implementation All analyses were implemented in R. Source code and the experimental data are available at https://github.com/SimoneTiberi/Bayesian-inference-on-stochastic-gene-transcription-from-flow-cytometry-data. Supplementary information Supplementary data are available at Bioinformatics online.


S1.1 Unbiased approximation of products of integrals
We use a pseudo-marginal approach where the marginal probability of the i-th observation, P (Y i = y i |θ), is approximated by an unbiased estimate, f (y i |θ).
Given the independent and identically distributed (iid) nature of the data, the likelihood for a sample of N observations y = (y 1 , . . . , y N ), is defined by: (S1) The standard way to approximate (S1) is to generate, for each observation, a distinct iid sample of size S, z i = z (1) i , . . . , z (S) i for i = 1, . . . , N , from the latent model structure of X|θ in (11). These samples can be used to compute the following unbiased estimator: This estimator is highly inefficient though as it requires to draw S × N elements, called particles, which can be computationally prohibitive. At the same time though, using the same sample of S particles for all N observations would result in a biased estimator [Lee et al., 2017]. We use a recently developed estimator [Lee et al., 2017] which allows us to use the same sample of S elements to estimate (S1) while preserving unbiasedness, where it is required that S ≥ N . The algorithm proceeds as follows: 1. Sample z = (z 1 , . . . , z S ) from P (X|θ) in (11), and set l 0 = 1 and S = {1, . . . , S}.
Above, f N (.|µ, σ 2 ) denotes the density of the normal distribution with mean µ and variance σ 2 . At every iteration of 2), in step 2.c), one particle is sampled, and then removed from S, according to: where #S indicates the number of elements in S.

S1.2 MCMC algorithm
We infer the parameters of our model via a Metropolis-within-Gibbs algorithm [Metropolis and Ulam, 1949, Metropolis et al., 1953, Hastings, 1970: in each iteration of the MCMC, we alternately sample the hierarchical and hyperparameters from their conditional distributions, as illustrated below. The hyperparameters Θ 1 , . . . , Θ 4 are sampled from a Gibbs sampler, while Θ 5 is sampled from a Metropolis-Hasting due to the truncated prior for κ. Note that, for µ 5 and τ 5 , the proposal and target distributions are very close and so the respective acceptance rates are almost always 1. The measurement error hyperparameters are not sampled since we use a constant informative prior as explained in Section S2.2.
After initialising the parameters in Θ and θ, we update them according to the following scheme for R iterations.
For r = 1, . . . , R: Update Θ|θ: a) For j = 1, . . . , 4 (Gibbs sampler): with N (µ, σ 2 ) being the univariate normal distribution with mean µ and variance σ 2 ; with G(a, b) being the gamma random variable with shape and rate parameters a and b, respectively, i.e. with mean a b and variance a b 2 . b) For µ 5 (Metropolis-Hasting sampler), we target the density: We propose a move from and accept it with probability a(µ * , µ 5 ) = min 1, where c(µ, τ ) = 1/ ∞ 1 f log−N (x|µ, 1/τ )dx, with f log−N (.|µ, 1/τ ) being the density of the log-normal distribution with mean µ and variance 1/τ , and min(a, b) is the minimum between a and b. c) For τ 5 (Metropolis-Hasting sampler), we target the density: We propose a move from τ * ∼ G c 5 + 4/2, d 5 + 1/2 and accept it with probability a(τ * , τ 5 ) = min 1, Update θ|Θ, Y : For k = 1, . . . , 4 (Metropolis sampler): a) We propose a change for the first block of parameters, θ , log κ (k) , from the normal distribution: where Σ (k) rb1 is the covariance matrix for the first block of parameters at the r-th iteration of the algorithm. The acceptance probability is defined as: where f .|y (k) is the unbiased estimate of the marginal likelihood for the k-th replicate, π b1 (θ|Θ) = 5 j=1 f N (θ j |µ j , 1/τ j ) is the prior for the hierarchical parameters of the first block and θ * = (θ * b1 , θ b2 ), with θ b2 being the parameters of the second block, also in the logarithmic space. b) We propose a change for the second block of parameters, θ b2 = log µ (k) , log σ (k) , from the normal distribution: where Σ (k) rb2 is the covariance matrix for the second block of parameters at the r-th iteration of the algorithm. The acceptance probability is defined as: where π b2 (θ|Θ) = 7 j=6 f N (θ j |µ j , 1/τ j ) is the prior for the hierarchical parameters of the second block and θ * = (θ b1 , θ * b2 ).
In the first 200 iterations, we propose the hierarchical parameters from a simple random walk, where, for k = 1, . . . , 4, Σ (k) rb1 = diag(10 −6 , 5) and Σ (k) rb2 = diag(10 −6 , 2), with diag(a, n) indicating the n × n diagonal matrix with a on the main diagonal. We then compute the covariance matrices of the posterior chains in the first and second block which, for the k-th replicate and at the r-th iteration of the MCMC, we define as Cov rb 2 , respectively. After the first 10 5 iterations, which are excluded from the analyses as burn-in, we re-compute the covariance matrices excluding the values of the first 0.5×10 5 iterations. The covariances are re-computed every 100 iterations until the end of the MCMC, thus respecting the diminishing adaptation requirement [Haario et al., 2001]. The covariances are used to compute the respective covariances for the proposal values of the ARW: To estimate the marginal likelihood of the data, for N = 10 3 data points, we use S = 10 5 particles since we noticed, in simulation studies, that this value led to a good mixing of the posterior chains.
As specified in Sections 3 and 4, the algorithm is typically run for R = 6 × 10 5 iterations, of which the first 10 5 are discarded as burn-in. Figure S1: Densities of the simulated data from the six simulation studies.

S1.3 Simulation study set-up
In this Section, we illustrate the set-up of the simulation study and the parameters used to simulate the data. We carry out six independent simulation studies; in each we sample a dataset whose structure matches the experimental data available: four hierarchical replicates of 1,000 data points each.
In each simulation study, we initially choose a set of hyperprior values, Θ, and sample 4 independent parameter vectors, θ (k) ∼ p(.|Θ), k = 1, ..., 4. In .., 4 and j = 1, ...5. The last two elements of θ (k) , i.e. the measurement mean and standard deviation, are kept fixed to their prior mode, i.e. µ (k) = e µ (k) σ are shown in Table S3. In the k-th replicate, given we independently sample, via equation (11), We then sample the measurement error, .., 10 3 . By iterating this scheme on the 4 replicates, we obtain the simulated hierarchical The procedure is repeated independently on each one of the six simulations to obtain six such simulated datasets. The densities for the simulated data for the six simulations, are shown in Figure S1.
The hierchical and hyperparameter values of each simulation study are reported in Tables S1 and S2, respectively; the hyperparameters were chosen to reproduce similar densities to the experimental data.  Table S1: Randomly drawn hierarchical parameter values used in the simulation study.
S2 Experimental data S2.1 Measurement process In this Section we graphically illustrate the measurement process, as described in Section 4.1, via Figure S2.

S2.2 Exploratory analysis of the measurement error
This Section focuses on a preliminary study of background noise data, which allow us to infer the measurement error parameters and, in turn, to formulate an informative prior for the error mean and standard deviation in each replicate, µ (k) and σ (k) , k = 1, ..., 4. Since both experimental conditions belonging to the same replicate are measured in the same experiment, it is reasonable to assume, a priori, that they share the same measurement error distribution. In other words, the prior obtained on the k-th replicate for the background data is matched with the k-th replicate, for both experimental conditions, for k = 1, ..., 4. Indeed, a posteriori, we allow all hierarchical parameters to vary in each experimental condition and replicate. In order to obtain the background noise data, we use the CRISPR/Cas9 technology to enzymatically cut out the HIV-1 env gene from the DNA of the observed cells. Then the measurement collection process is performed, identically to the other experimental conditions, as described in Section 4.1. In this way, the population of mRNA is almost absent and the observations approximately correspond to the background noise only; therefore, the measurements are taken as a proxy for the measurement error. The numbers of observations available for the current analysis are 6,257, 3,536, 5,487 and 10,891, for replicates 1 to 4, respectively. It is reasonable to assume that the distribution of the observations for the background of the k-th replicate, which we define as Thus, the posterior densities for the parameters of Z (k) , inferred from the background data, are used to formulate informative priors for µ (k) and σ (k) . Figure S3 shows the histograms of the background data in the four replicates, which seem to be reasonably well approximated by a normal density.
We implement a Bayesian hierarchical model to infer the measurement error parameters. The conjugate prior distribution choices are the same ones as in the general analysis, as specified in Section S1.2. In particular, a priori, where logN (a, b) denotes the log-normal distribution with mean a and variance b. We assume vague prior distributions for the hyperparameters: µ a |τ a , ∼ N 0, 10 4 τ a , µ b |τ b , ∼ N 0, 10 4 τ b and τ a , τ b ∼ G(0.001, 0.001).
(S19) We define the hyperparameter vector for the analysis of this Section as Θ = (Θ a , Θ b ), where Θ a = (µ a , τ a ) T and Θ b = (µ b , τ b ) T . We also gather together the hierarchical parameters of all replicates in µ = µ (1) , ..., µ . Figure S4 represents the graphical model associated with the background data.
In order to infer the parameters of this model, we employ a Metropoliswithin-Gibbs algorithm [Metropolis and Ulam, 1949, Metropolis et al., 1953, Hastings, 1970, similar to the one described in Section S1.2, where the hierarchical and hyperparameters are alternately sampled from their conditional distributions. Owing to the conjugacy of the hyperprior choices, we use a Gibbs sampler to sample from the posterior distribution of the hyperparameters, conditional on the hierarchical ones: τ a /10 4 + 4 τ a , τ a /10 4 + 4 τ a −1   , τ a |µ a , µ , σ ∼ G 0.001 + 4/2, 0.001 + 1/2 Instead, the hierarchical parameters are sampled, conditional on the observations and on the hyperparameters, µ, σ|Θ , Z, from a Metropolis sampler Figure S4: Graphical model for the hierarchical measurement error analysis. On the top of the graph we have the hyperparameters, Θ , which generate the hierarchical parameters: from these, the observations Z (1) , . . . , Z (4) are sampled.
[ Metropolis andUlam, 1949, Metropolis et al., 1953] where the proposals follow an adaptive random walk (ARW) scheme [Haario et al., 2001]. Parameters are sampled, separately for each replicate, in the logarithmic space. In particular, the MCMC is run for 10 5 + 100 iterations from a random walk with a normal proposal before computing, in each replicate, the covariance of the chains for log µ (k) and log σ (k) , excluding the first 10 5 values, which, at the r-th iteration of the MCMC, we call Σ ARW . The covariances are then re-computed every 50 iterations, always excluding the initial 10 5 iterations, hence respecting the diminishing adaptation criterion [Haario et al., 2001]. These covariances are used to build the covariance matrix for the proposal values in the Metropolis algorithm, which, for the k-th replicate and at the r-th iteration is Σ (kr) prop = diag 10 −9 , 2 + 0.01 × Σ (kr) ARW , where diag(a, 2) indicates the 2 × 2 diagonal matrix with a on the main diagonal.
We apply this algorithm to our background data: chains are run for 5.5×10 5 iterations, of which the initial 2×10 5 iterations are discarded as burnin. The posterior densities for the hierarchical parameters, in the logarithmic space, log µ (k) and log σ (k) , are displayed in Figure S5. From each one of these densities, we compute the posterior mean and standard deviation: values are listed in Table S3, in particular µ These values are used, in the analysis in Section 4, as constant informative priors for the measurement error parameters, with a distinct prior for each replicate.

S2.3 Mean and variance of P, X and Y
In this Section we show how, from the parameters in θ, it is possible to derive the mean and variance of P , X and Y , which are very useful pieces of Replicate Parameter 1 2 3 4 µ (k) µ 6.65 6.69 6.87 6.79 µ (k) σ 5.80 5.74 5.98 5.72 σ (k) µ 10 −3 × 5.54 10 −3 × 6.45 10 −3 × 5.34 10 −3 × 3.37 σ (k) σ 10 −3 × 8.80 10 −3 × 11.85 10 −3 × 9.18 10 −3 × 6.91 Table S3: Posterior mean and standard deviation for log µ (k) and log σ (k) , in replicates k = 1, ..., 4. information. To keep notation simple, we drop the indicators i and k from parameters and random variables. Given the beta structure of P , as shown in (5), the mean and variance of P are equal to Considering equations (3)-(6), and owing to the independence of A and B, we can write the mean and variance of X as E(X) = E(A) + E(B) and V ar(X) = V ar(A) + V ar(B). Clearly E(B) = V ar(B) =α 0 follows from the Poisson distribution of B. While we know that [Johnson et al., 2005]: From the equations above, we can easily obtain the following formulations for the mean and variance of X: Finally, following the measurement equation in (12) and given the independence between the measurement error, , and the mRNA population, X, the mean and variance of the observations, Y , can be obtained as: (S27) Figure S6 shows the densities for the experimental data available, i.e. Y (k) , for both experimental conditions in all four replicates. Table S4 shows the average level of the observations,

S2.4 Additional Figures and Tables
i /N k , in every experimental condition, for each of the four replicates k = 1, . . . , 4, while Table S5 reports the 0.95 level highest posterior density (HPD) credible intervals (CIs) for the exponential of the hypermean parameters, which represents the posterior median of the hierarchical parameters.  Figure S6: Densities of the experimental data, Y (k) . The black solid and red dotted lines refer to cells stimulated with 5 and 10 ng/mL of tetracycline, respectively. The four densities per image refer to the four replicates of each experimental condition.
Tables S6-S7 contain the 0.95 level HPD CIs for the hierarchical parameters and informative re-parametrizations. In particular, for every replicate, tetracycline 5 ng/mL 10 ng/mL  Table S5: 0.95 level HPD CIs for the exponential of the hypermean parameters, in both experimental conditions. e µ 1 , . . . , e µ 5 refer to parameters α 0 ,α 1 ,k 1 ,k 0 , κ. "LB" and "UB" represent the limits of the HPD CIs and stand for lower bound and upper bound, respectively. k = 1, . . . , 4, we consider the following elements: the ratio between the transcription in the OFF and in the ON states, α Y , and the ratio between variance and mean of X, σ All HPD CIs are computed in R [R Core Team, 2016] via the HPDinterval function of the package coda [Plummer et al., 2016]. Figure S7 reports the posterior densities for the four reparametrizations of the hierarchical parameters: µ Y , with k = 1, . . . , 4. It is clear how the mean and standard deviation of Y (k) observed in the experimental data, denoted by the vertical dotted lines, are always in a central area of the respective posterior density. It is also interesting to note that the ratio between the variance and mean of X (k) , i.e. the mRNA latent population of molecules, is very far from 1, which is the assumption of the Poisson model; in particular, the lower bounds of all 0.95 level HPD CIs are bigger than 10. Figure S8 plots, for each replicate and experimental conditions, the densities obtained in 100 simulations from 100 posterior values of the parameters (represented in green). They closely match the densities of the real data from the corresponding experimental condition (in black), indicating that the parameter values inferred, and the model used, are able to reproduce very similar patterns to the experimental data. Note that the plotted curves are not CIs; nevertheless they provide a visual indication of how well the model can reproduce data compatible with the observed one. Figure S9 shows the posterior densities for the hyperprecision parameters, τ 1 , . . . , τ 5 , referring to the kinetic parametersα 0 ,α 1 ,k 1 ,k 0 and κ.
In Figures S10-S11, we plot, for both experimental conditions, the posterior chains, excluding burn-in, for the mean and standard deviation of Y (k) , i.e. µ      Figure S8: Densities for the data, Y (k) , in each replicate k = 1, . . . , 4: in green, the densities for data simulated from 100 posterior values of the parameters; in black, the density for the corresponding experimental data. The bottom and top four plots refer to the cells induced with 5 and 10 ng/mL of tetracycline, respectively.  Figure S9: Posterior densities for the hyperprecision parameters, τ 1 , . . . , τ 5 , referring to the kinetic parametersα 0 ,α 1 ,k 1 ,k 0 and κ. The black solid and red dotted lines refer to cells stimulated with 5 and 10 ng/mL of tetracycline, respectively.   Figure S11: Posterior chains for the mean and standard deviation of Y (k) , k = 1, . . . , 4, for cells induced with 10 ng/mL of tetracycline. The horizontal green lines represent the empirical mean and standard deviation observed in the respective sample. The chains are thinned: only one value every 100 is plotted.