Abstract

Motivation: Gene regulatory networks (GRNs) are stochastic, thus, do not have attractors, but can remain in confined regions of the state space, i.e. the ‘noisy attractors’, which define the cell type and phenotype.

Results: We propose a gamma-Bernoulli mixture model clustering algorithm (ΓBMM), tailored for quantizing states from gamma and Bernoulli distributed data, to determine the noisy attractors of stochastic GRN. ΓBMM uses multiple data sources, naturally selects the number of states and can be extended to other parametric distributions according to the number and type of data sources available. We apply it to protein and RNA levels, and promoter occupancy state of a toggle switch and show that it can be bistable, tristable or monostable depending on its internal noise level. We show that these results are in agreement with the patterns of differentiation of model cells whose pathway choice is driven by the switch. We further apply ΓBMM to a model of the MeKS module of Bacillus subtilis, and the results match experimental data, demonstrating the usability of ΓBMM.

Availability: Implementation software is available upon request.

Contact:andre.sanchesribeiro@tut.fi; xiaofeng.dai@tut.fi

Supplementary information:Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

In the deterministic framework, it was hypothesized that cell types are the attractors of gene regulatory network (GRN) dynamics (Kauffman, 1969). To be biologically plausible, the attractors need to be confined patterns of gene expression since even in the simple case of a Boolean network with 20 000 nodes, its state space size is 220000 (Kauffman, 1993). An important criticism to this hypothesis (Aldana et al., 2003) is that noise renders attractors a poor model of cell types. This is important given the stochasticity of gene expression (Arkin et al., 1998). To address this, the concept of ‘ergodic set’ was proposed as a set of states from which the GRN, once entering, does not leave for a time longer than the cell lifetime in the absence of external perturbations (Ribeiro and Kauffman, 2007). An ergodic set can be composed of one or several ‘noisy attractors’—the set of states where the GRN spends most time while on that ergodic set. Stochastic models of GRN (Ribeiro et al., 2006) can remain in confined regions of the state space for long period of time, in agreement with the hypothesis that ergodic sets corresponds to cell types (Ribeiro and Kauffman, 2007). One example of noise-driven transitions between noisy attractors might be the observed transient, and the probabilistic transition between two phenotypes in Bacillus subtilis which is controlled by the MeKS module (Suel et al., 2006).

While techniques to find noisy attractors and ergodic sets in deterministic Boolean networks are known (Ribeiro and Kauffman, 2007), in stochastic GRN this is not a simple task. In Ribeiro and Kauffman (2007), the ‘states’ composing the noisy attractors of a stochastic GRN were found by binarizing the proteins' temporal expression levels with K-means clustering algorithm (MacQueen, 1967). This is not the most appropriate approach since the number of clusters was predefined, possibly causing information loss. Stochastic GRN's ‘logic’ can be more complex than binary [e.g. the two genes of a toggle switch (TS) can express at half its maximum rate, simultaneously, for long periods of time (Ribeiro, 2008)]. Also, observing the proteins' level alone might not be sufficient to fully capture the behavior of GRN [e.g. miRNAs are known to have a regulatory role (Zhang et al., 2007)]. Hence, one problem that needs to be addressed is how to define noisy attractors of a stochastic GRN in a realistic biological setting. For this, the key questions are what features of the GRN dynamics should be taken into account to determine whether a GRN is in a noisy attractor, and how to reliably identify noisy attractors from the GRN dynamics given the choice of features to consider.

In the delayed stochastic model of GRN (Ribeiro et al., 2006), able to match gene expression at the single molecule level (Yu et al., 2006; Zhu et al., 2007), what changes with time are the number of proteins and RNAs, and genes' promoter occupancy state if transcription factors can bind to their target genes' promoters and affect transcription. Here, we use these three sources of information to determine noisy attractors. While proteins (Elowitz et al., 2002) and RNA levels (Golding et al., 2005) can be measured in real time, the promoters' occupancy states are not directly measurable, currently. However, it can be detected indirectly in vivo when a transcription factor is bound to a gene's promoter region (King et al., 2007; Thompson et al., 2004), which is one assumption made here.

Inspired by the joint mixture model clustering framework used in gene clustering (Dai et al., 2008), we built a gamma-Bernoulli mixture model (ΓBMM) to quantize and detect noisy attractors of stochastic GRNs from multiple data sources. The algorithm accounts gamma and Bernoulli distributed data simultaneously and quantizes their joint effort into state(s), assuming that protein and RNA levels are the summation (including subtraction) of several exponential processes, i.e. gamma distributions, and that the promoter occupancy state, which is composed of binary values, is of Bernoulli distribution. Each distribution for a gene is modeled as an independent event, and the quantization is done based on the multiple levels to make the detection of changes in the dynamics of the GRN more robust than when observing protein levels alone. The number of noisy attractors are determined by model selection, where four approximation-based criteria are compared: the Akaike information criterion (AIC) (Akaike, 1974), a modified AIC (AIC3) (Biernacki and Govaert, 1999; Bozdogan, 1987), Bayesian information criterion (BIC) (Pan, 2006; Schwarz, 1978) and integrated classification likelihood-BIC (ICL-BIC) (Ji et al., 2005). Since BIC and ICL performed better and in a similar way, they are employed here.

To define noisy attractors as biologically meaningful entities in cell dynamics, we assume that by being in a certain noisy attractor the cell will opt for a specific differentiation pathway, or adopt a phenotype out of several possible choices (Kauffman, 1969). We improve the previous work (Ribeiro and Kauffman, 2007) by introducing a more robust clustering algorithm that uses data fusion to detect noisy attractors from the GRN dynamics and validating its results by adding a stochastic model of cell differentiation.

In this study, we apply ΓBMM to the TS and GRN that drives differentiation in Bacillus subtilis, the MeKS module (Suel et al., 2006), and validate the results by confronting them with the cell-type distributions resulting from the differentiation. The results show that the clustering algorithm is robust to high levels of noise in GRN dynamics by using multiple data sources to determine the noisy attractors, and that the modeling strategy provides a good match with experimentally measured differentiation patterns. The results provide a better understanding of how stochastic GRNs can regulate their dynamics to determine cell population differentiation patterns and how multi-modal distributions of phenotypic expression can emerge from the stochastic nature of gene expression and regulation.

2 METHODS

2.1 Mixture model-based clustering and expectation maximization algorithm

In the model-based clustering methods, each observation oj, where j = 1,…, n and n is the number of genes, is drawn from a finite mixture distribution with the prior probability πi, component-specific distribution fi(g) and its parameters θi. The formula is (Mclachlan and Peel, 2000)  

formula
where θ = {(πi, θi) : i = 1,…, g} denotes all unknown parameters, with the restriction that 0 < πi ≤ 1 for any i and ∑i=1g πi = 1. Note that g is the number of components in this model. From here onwards, we ignore the superscript (g) from fi(g) for simplicity.

In ΓBMM, we define θ = [π, θ1, θ2]T, π = [π1,…, πg]T, θ1 = [α11,…, αgp1, β11,…, βgp1]T and θ2 = [q11,…, qgp2]T, where p1 and p2 each represents the dimension of the observations in gamma and Bernoulli mixture models, respectively. X and Y denote the observations of gamma and Bernoulli distributed data, respectively, function f(x, y) as the density function of gamma and Bernoulli distribution and o = [xT, yT]T.

ΓBMM is a joint mixture model of gamma and Bernoulli distributions, which assumes that for each component i the data of the two distributions are independent. Each component is assumed to be the product of p1 independent gamma distributions, whose probability density function is defined as  

(1)
formula
where θ1i = [αi1,…, αip1, βi1,…, βip1] and x = [x1,…, xp1]T. Each component is assumed to follow a Bernoulli distribution, with the probability density function for each gene defined as (θ2i = [qi1,…, qip2])  
(2)
formula

The expectation maximization (EM) algorithm is applied to estimate the parameters θ iteratively. Equation (3) is the data log-likelihood (natural logarithm is referred throughout the article), given O = {oj : j = 1,…, n} (its direct maximization, however, is difficult):  

(3)
formula

To make the maximization of Equation (3) tractable, the problem is casted in the framework of incomplete data. Since we assume that data from different distributions are independent, Lc can be factored as  

formula
If we define cj ∈ {1,…, g} as the clustering membership of oj, then the complete data log-likelihood can be written as  
formula
where χ(cj = i) is the indicator function of whether oj is from the i-th component or not. In the EM algorithm, E-step computes the expectation of the complete data log-likelihood  
(4)
formula
where θ(m) represents the parameters estimated in the m-th iteration. By computing the expectation, according to Bayes rule, Equation (4) becomes,  
(5)
formula
where τji(m) is the estimated posterior probability of oj from component i at iteration m 
formula
We can assign each oj to its components based on {i0ji0 = maxiτji}. Equations (4) and (5) show that our assumption of the gamma and Bernoulli distributed data being independent carries over to the expected log-likelihood as well. To derive the closed form or numerical optimization formula for updating parameters in ΓBMM, we used Lagrange multipliers to solve this constrained optimization problem, with the Lagrangian shown in Equation (6) as  
(6)
formula

Parameters of the ΓMM part, α and β, can be optimized by Newton–Raphson method. Let θ1i = (αi, βi), then the parameters are updated by  

formula
with the constraint θ1i1, where H−1(θ1i(m)) is the Hessian matrix evaluated at θ1i(m) and ℒ(θ1i(m)) is the Lagrangian function of Q(θ1i(m)). H−1 and ∇θ1 ℒ(θ) can be obtained by plugging Equation (1) in Equation (6) and taking the derivatives:  
formula
and then forming  
formula
Ψ and Ψ′ represent the digamma and trigamma functions, respectively, which are the first and second logarithmic derivatives of the gamma function. Parameter updating of q is done by plugging Equation (2) in Equation (6), and taking the derivative of q 
formula
After setting forumla, we have: forumla.

Optimization of the prior probability of each gene clustering membership, π, can be derived by  

formula
and since  
formula
we have forumla.

2.2 Model selection criteria

Four approximation-based model selection criteria, BIC (Pan, 2006; Schwarz, 1978), ICL (Ji et al., 2005), AIC (Akaike, 1974; Biernacki and Govaert, 1999) and AIC3 (Biernacki and Govaert, 1999; Bozdogan, 1987), are compared in ΓBMM, from which the best performing one is chosen. Calculations for the above criteria are defined as  

formula
where d is the number of free parameters, M is the total amount of data (M = ∑w=1WMw, Mw is the size of the dataset w and W is the number of input datasets) and −2 ∑j=1ni=1g τji log(τji) is the estimated entropy of the fuzzy classification matrix Cji = (τji) (Ji et al., 2005). The number of free parameters d in ΓBMM is dRB = 2p1g + p2g + g − 1, since there are p1g-free αius, p1g-free βius, p2g-free qivs and (g − 1)-free πis.

2.3 Data generation

We use SGN Sim (Ribeiro and Lloyd-Price, 2007) to simulate the models of GRNs. GRNs consist of genes coupled via protein–protein and protein–operator sites interactions. Transcription and translation are multiple time-delayed events (Ribeiro et al., 2006). The dynamics is driven by the ‘delayed SSA’ (Roussel and Zhu, 2006), a modified version of the Stochastic Simulation Algorithm (SSA) (Gillespie, 1977). The GRN models studied here (and how to implement them in SGN Sim) are given in the Supplementary Material.

The time series of proteins, RNAs and promoter binding states are used as the inputs of ΓBMM. The outputs are the number of noisy attractors and the noisy attractor membership and log-likelihood of each data point (Fig. 1).

Fig. 1.

Input and output of ΓBMM. NA, noisy attractor.

Fig. 1.

Input and output of ΓBMM. NA, noisy attractor.

3 RESULTS AND DISCUSSIONS

We first compared the performance of ΓBMM with its component models, ΓMM and BMM, using controllable gamma and Bernoulli distributed data (data and results in the Supplementary Material). After observing an advantage in using multiple data sources compared with using single data sources, we used ΓBMM to detect the noisy attractors of two GRNs, TS (Gardner et al., 2000) and the MeKS module, which is an excitable genetic circuit of B.subtilis (Suel et al., 2006).

3.1 The TS

The TS consists of reactions 7–12 (Ribeiro, 2008), where i ∈ {1, 2} (when only index i is present), or i, j ∈ {1, 2} with ij (when both indices are present). Transcription (reaction 7) and translation (reaction 8) are modeled by time-delayed reactions (Ribeiro et al., 2006) where Proi is the promoter of gene i, and Rp is an RNA polymerase. The delays (τi) account for the time duration of the processes involved in transcription and translation and model how long it takes for a product to be produced after the reaction has occurred. Each promoter Proi controls the expression of an RNA, Ri, i.e. once its ribosome binding site is formed, can start being translated by a ribosome Rib into a protein, Pi. Reaction 11 models the binding and unbinding of the repressor protein to the other gene promoter, defining the TS. Proteins decay via reactions 9, 10 and 12.  

(7)
formula
 
(8)
formula
 
(9)
formula
 
(10)
formula
 
(11)
formula
 
(12)
formula

Values for delays (fixed in all simulations) are extracted from measurements (Yu et al., 2006). τ1 accounts for the transition from the closed promoter complex to the open promoter complex which is included in the transcription initiation process (McClure, 1980). We set τ1 = 40 s, within the range from 10 s to several minutes (McClure, 1980). The length of the gene tsr-venus driven by a Lac promoter studied in Escherichia coli is 2500 nt. Since the average rate of transcriptional elongation in E.coli is ∼ 50 nt/s, it follows that τ2 is, on average, 90 s (= τ1 + 2500/50). The time of the ribosome binding site clearance in translation initiation, τ3, is set to 2 s, according to experimental estimations (Draper, 1996). The average translation rate is ∼15 amino acids/s, thus τ4 = τ3 + 2500 nt/(45 nt/s) = 58 s. The post-translational protein assembly process takes 420± 140 s in Yu et al. (2006), thus the delay of Pi is drawn from a Gaussian distribution with a mean of τ5 and SD of τ5std each time the reaction occurs. All delays, except for τ5, are assumed constant (Zhu et al., 2007) since the measurements suggest that they are not highly variable between transcription events of a single gene (McClure, 1980). For example, τ1, although varies from one transcription event to another, measurements on an unrepressed Lac promoter (Lutz et al., 2001) suggest that it follows a Gaussian distribution with a mean of 40 s and a SD of 4 s. We tested the effects of variable delays following a normal distribution with the same mean and comparably small SDs (with respect to the mean), and the TS dynamics was not significantly affected, which means, qualitatively, our conclusions hold when using a model with variable delays.

Reaction rates (in s−1) set following the TS model in Zhu et al. (2007), are built to reproduce the dynamics of the genetic TS synthesized in E.coli (Gardner et al., 2000). Specifically, ktr = 0.00042, rbsd = 0.01, krep = 1, and kunrep = 0.1. Given these values, the transcription rate kt and protein decay rate kd are varied as discussed later.

Each ‘cell’ simulation is initialized without proteins (P1, P2) or RNAs (R1, R2), and with one promoter of each gene (Pro1, Pro2), 40 RNA polymerases (Rp), and 100 ribosomes (Rib). Values are set according to Zhu et al. (2007).

The noise level of TS can be varied by multiplying the transcription rate kt and protein decay rate kd with a pair of factors, while imposing equal mean values of (P1 + P2) to ensure that the noise varies from varying the toggling rate alone (Ribeiro, 2008). Both these rates are evolvable and vary from gene to gene. The initiation rate is sequence dependent, while degradation rates are tunable and vary from one protein to another (Bernstein et al., 2002). The noise level affects the TS bistability. Excess of noise makes the TS monostable, with both genes expressing simultaneously (if transcription is more ‘propense’ than the reaction by which the promoter is repressed) (Ribeiro, 2008). Thus, the TS model allows testing the clustering algorithm for various levels of noise.

To test the ability of ΓBMM in quantizing gene expression levels, we simulate nine ‘cell populations’. Each cell consists of a TS model. Cells of different populations have TSs with identical mean level of proteins, but different noise level which is defined as the SD over the mean of (P1 + P2) time series. Each cell population has a unique pair of multiplication factors {ν1, ω1,…, ν9, ω9} of the values of kt and kd, respectively. These multiplication factor pairs cannot be varied by constant amounts each time to attain equal mean protein levels due to the delay τ1 at each transcription event, which accounts for the promoter occupancy time by an Rp (McClure, 1980). We first fix ν's, where ν = 1 is the ‘baseline’ and four inverse pairs are arranged on each side in ascending order; then ω's are searched by an heuristic method such that the mean value of (P1 + P2) does not differ >4% among the cells of different populations. The nine pairs of multiplicative values of kt and kd are (0.01, 0.2768), (0.1, 1.0935), (0.25, 1.25), (0.5, 1.2), (1, 1), (2, 0.9226), (4, 1.08), (10, 1.73) and (100, 3.4599), respectively. Originally, kt = 0.01 and kd = 0.001.

We generate 1000 cells for each population with SGN Sim. Given 106 s as the cell lifetime, we sampled the state with a sampling frequency of 1000 s. This unrealistic lifetime allows better statistics without affecting the results qualitatively (alternatively one could follow the dynamics of cells for many generations). For each population, we quantize protein, RNA and promoter binary binding data of a cell by ΓBMM, and use the log-likelihood [Equation (3)] of each cluster within each noisy attractor and their summation to visualize the number of noisy attractors and the ergodic set's stability.

Note that, for this study, we simulate nine cell populations, each of which contains 1000 identically initialized cells; and cells of different population differ. The variation of parameter values across cell populations allows testing the robustness of the clustering algorithm under various degrees of GRN's internal noise. The simulation of many initially identical cells, each with the same internal parameter values and amount of each substance, is necessary to estimate the GRN's differentiation pattern for a given set of parameter values, given the stochastic dynamics of the genetic networks driven by the delayed SSA. These patterns are then confronted with the quantization results for validation purposes.

Log-likelihood distribution of the ergodic set for nine cell populations are shown in Figure 2a, where each log-likelihood is the sum of each state of each noisy attractor. The higher the log-likelihood is the more stable the quantized states are. Also shown is the noise in cells of the nine populations (Fig. 2b). The TS noise increases as kt and kd increase due to the increase of toggling frequency, and then decreases due to the loss of bistability (Ribeiro, 2008). The turning point of log-likelihood (population 4, Fig. 2a) corresponds to the loss of bistability and maximum noise (Fig. 2b).

Fig. 2.

Log-likelihood and noise distributions for nine TSs. The x-axis represents the nine cell populations, y-axis shows the (a) summation of the log-likelihood of all the data points and (b) noise of each population.

Fig. 2.

Log-likelihood and noise distributions for nine TSs. The x-axis represents the nine cell populations, y-axis shows the (a) summation of the log-likelihood of all the data points and (b) noise of each population.

The quantization results of each noisy attractor are shown in Supplementary Figure S2, with the most representative ones (populations 1, 3, 4 and 9) shown in Figure 3. Note that there are two noisy attractors (of the same ergodic set) in population 1 (Fig. 3a), which starts toggling as noise increases (Fig. 3b), begins loosing its bistability in cells of population 4 (Fig. 3c) and becomes monostable beyond this point (Fig. 3d). Importantly, Figure 3c shows that cells of population 4 can be in three noisy attractors. The magenta and blue dots each represent one noisy attractor, and the red and green circles together represent a third noisy attractor. To observe the changes in the number of attractors of the TS as noise increases, we draw a transition graph (Fig. 4) by defining that two noisy attractors are distinct (and not a single noisy attractor) if the difference between their medians is larger than the mean of their SDs. The noisy attractors, defined as the set of states where the GRN spends most time when on an ergodic set, correspond to the states where sufficient data points share similar log-likelihood as shown in Figure 4. The result, which agrees with Figure 3, shows that as noise increases, the TS changes from bistable to tristable and to monostable, verifying the claim in Ribeiro (2008) that the TS dynamical regime changes with noise level, and showing the suitability of ΓBMM to characterize the dynamics of stochastic GRNs.

Fig. 3.

Quantization results of ΓBMM for cell populations (a) 1, (b) 3, (c) 4 and (d) 9, respectively. NA, noisy attractor. ‘Ensemble’ is the ensemble of the time series of proteins, RNAs and promoter binding states. The x-axis is time (unit is × 103), and y-axis is log-likelihood of each quantized data point.

Fig. 3.

Quantization results of ΓBMM for cell populations (a) 1, (b) 3, (c) 4 and (d) 9, respectively. NA, noisy attractor. ‘Ensemble’ is the ensemble of the time series of proteins, RNAs and promoter binding states. The x-axis is time (unit is × 103), and y-axis is log-likelihood of each quantized data point.

Fig. 4.

Noisy attractors detected by ΓBMM in cells of each population. The number of bars in each of the nine populations equals the number of different noisy attractors detected (x-axis). The summation of the log-likelihood of the data points in each noisy attractor is shown in the y-axis.

Fig. 4.

Noisy attractors detected by ΓBMM in cells of each population. The number of bars in each of the nine populations equals the number of different noisy attractors detected (x-axis). The summation of the log-likelihood of the data points in each noisy attractor is shown in the y-axis.

To verify the existence of these regimes, we consider the TS as a differentiation pathway switch (Monod and Jacob, 1961) by adding in model cells reactions 13–16. These are activated at the end of a cell's lifetime and the amount of P1 and P2 in the cell at that moment will stochastically determine the pathway (‘Cell0’ represents an undifferentiated cell). For example, if only P1 is present when these reactions become active, the cell will most likely differentiate into cell type 1 via reaction 13, if P1 and P2 are both present with significant amounts, Cell0 will most likely differentiate into cell type 3 (reaction 15) and if both proteins are absent, the cell chooses pathway 4 (reaction 16).  

(13)
formula
 
(14)
formula
 
(15)
formula
 
(16)
formula

Figure 5 shows the distribution of differentiation pathway choices of 1000 cells of four of the nine populations (complete results in Supplementary Fig. S3). As seen in Figure 5b and c, cells of populations 4 and 5 each have three almost equally likely differentiation pathways, which accord well with the three noisy attractors found in cells of these populations (Fig. 4). Also in agreement, cells differentiate mostly into one or two type(s) when the TS only has one or two noisy attractor(s), respectively (Fig. 5d and a).

Fig. 5.

Distribution of the differentiated cell types after 1 000 000 s of cell populations (a) 1, (b) 4, (c) 5 and (d) 9. The x-axis represents the cell types, e.g. ‘1’ means ‘cell type 1’, and y-axis shows the number of cells differentiated into each cell type. Each cell population has 1000 cells.

Fig. 5.

Distribution of the differentiated cell types after 1 000 000 s of cell populations (a) 1, (b) 4, (c) 5 and (d) 9. The x-axis represents the cell types, e.g. ‘1’ means ‘cell type 1’, and y-axis shows the number of cells differentiated into each cell type. Each cell population has 1000 cells.

While the bistable and monostable regimes are visually detectable from the protein time series, the tristable regime is not, as seen in Supplementary Figure S4c and d, where there is no visible difference between the TS dynamics of cells of populations 3 and 4 although the transition from bistable to tristable occurs at that point (Fig. 3b and c). This is because the ‘state’, where both proteins express simultaneously, is unstable in the deterministic setting. Interestingly, for a given noise level, this unstable state is the most commonly occupied region of the state space by the TS (Fig. 5c).

3.2 An excitable circuit of B.subtilis

Under nutrient limitations, a minority of the cells of a population of B.subtilis becomes competent for DNA uptake, while the majority commits to sporulation (Suel et al., 2006). Cells can switch probabilistically and transiently between competence and vegetative growth via a ‘MeKS’ module (Fig. 6) (Suel et al., 2006). In this module, protein ComK activates genes needed for competence. Protein ComS induces competence by allowing accumulation of ComK protein that indirectly enables the self-activation of gene comK by inhibiting ComK degradation by protein complex MecA.

Fig. 6.

MeKS module. Square and oval nodes represent proteins (or genes). The arrow head represents activation and the inverse arrow repression.

Fig. 6.

MeKS module. Square and oval nodes represent proteins (or genes). The arrow head represents activation and the inverse arrow repression.

We model the ‘MeKS’ module in the delayed stochastic modeling framework following the ordinary differential equation (ODE) model proposed in Suel et al. (2006) (both models are described in detail in the Supplementary Material). The rate constants and the initial amounts of all substances are heuristically tuned to match both the ODE model dynamics and the measurements of the time series of proteins PS and PG reported in Suel et al. (2006). Time delays are set as previously described for the transcription and translation reactions in the TS, since such information is not available for this genetic circuit.

Each cell is initialized with 100 MecA, 1 Proi but not Pi or Ri, where ‘i’ stands for ‘S’, ‘K’ and ‘G’ for proteins, and represents ‘S’ and ‘K’ for RNAs and promoters. We simulated 103 cells for 100 h and sampled each time series at each 103 s. The time series of proteins ComS and ComG of one cell is shown in Figure 7a, where a negative correlation is apparent between them, in agreement with the experimental and modeling results in Suel et al. (2006).

Fig. 7.

Validation of the MeKS model. (a) Time series of ComS and ComG in one cell. (b) Distribution of differentiated cell types (of a total of 1000 cells). The x-axis is (a) time (unit is × 105) and (b) cell types after differentiation. The y-axis are (a) protein expression levels, and (b) number of cells.

Fig. 7.

Validation of the MeKS model. (a) Time series of ComS and ComG in one cell. (b) Distribution of differentiated cell types (of a total of 1000 cells). The x-axis is (a) time (unit is × 105) and (b) cell types after differentiation. The y-axis are (a) protein expression levels, and (b) number of cells.

To validate the model we add reactions 18 and 17 to each cell to drive differentiation. Sporulation (incompetent cells) depends on the amounts of proteins ComS and ComG, and all cells differentiate during their lifetime. CellS and CellG represent the incompetent and competent cells, respectively.  

(17)
formula
 
(18)
formula

Figure 7b shows the distribution of competent and incompetent cells (1000 cells in total). Of the cells, 4% are competent as shown in the experiments (3.6± 0.7%) (Suel et al., 2006). We applied ΓBMM to protein, RNA and promoter binding time series of comS and comK, since these are the genes regulating the transient differentiation. The quantization of one of the time series is shown in Figure 8, where two noisy attractors are observed (in this case the competent state takes 4.04% of the sampled data). From a sample of 20 cells, 3.06–4.52% of the time is spent by the cells on the competent state, in the range of the experimental results, i.e. 2.9–4.3% (Suel et al., 2006), and in accord with the fraction of competent cells in the differentiation model.

Fig. 8.

Quantization results from ΓBMM for the MeKS module. Time in x-axis, unit is × 103, and log-likelihood of each data point in y-axis.

Fig. 8.

Quantization results from ΓBMM for the MeKS module. Time in x-axis, unit is × 103, and log-likelihood of each data point in y-axis.

4 CONCLUSIONS

In pluricellular organisms, different cell types have identical GRNs. These, although being stochastic (Arkin et al., 1998) and subject to external perturbations, can stably maintain their type once fully differentiated. Reverse differentiation and metaplasias, while possible, are rare, indicating that cell types are ergodic sets. If the hypothesis of ergodic sets as cell types holds true, then from the dynamics of GRN, one ought to be able to predict patterns of differentiation and distributions of phenotypic expression of cell populations.

Here, we used a clustering method, ΓBMM, to detect noisy attractors of GRN from their dynamics. ΓBMM uses multiple time series of the GRN, and its applications to the TS and the MeKS module matched the patterns of cell differentiation. Data fusion improved the matching, suggesting that the proteins levels alone might be insufficient to characterize the dynamics of stochastic GRNs.

The method is applicable to the analysis of real cells since the data required is experimentally attainable and the modeling strategy of GRNs (Ribeiro et al., 2006) used to test it matches gene expression measurements at the single RNA and protein levels (Yu et al., 2006; Zhu et al., 2007) and the model of differentiation of the MeKS module matched the measurements as well. While 1000 time series were used to test the accuracy of the algorithm, once tested, this amount of data is not needed in subsequent applications. The algorithm can then be applied to real time series directly, requiring only moderate amounts of experimental data.

ACKNOWLEDGEMENTS

We thank Härri Lähdesmaki for reviewing the clustering algorithm.

Funding: Tampere Graduate School in Information Science and Engineering (X.D.); Academy of Finland, projects no. 213462 (Finnish Centre of Excellence program 2006-11) and project no. 126803 (to A.S.R.); the FiDiPro programme of Finnish Funding Agency for Technology and Innovation (40284/08, to X.D. and A.S.R.).

Conflict of Interest: none declared.

REFERENCES

Akaike
H
A new look at the statistical identification model
IEEE Trans. Automat. Control
 , 
1974
, vol. 
19
 (pg. 
716
-
723
)
Aldana
M
, et al.  . 
Kaplan
E
Boolean dynamics with random couplings
Perspectives and Problems in Nonlinear Science
 , 
2003
Berlin
Springer Applied Mathematical Sciences Series
Arkin
A
, et al.  . 
Stochastic kinetic analysis of developmental pathway bifurcation in phage λ-infected Escherichia coli cells
Genetics
 , 
1998
, vol. 
149
 (pg. 
1633
-
1648
)
Bernstein
J
, et al.  . 
Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays
Proc. Natl Acad. Sci. USA
 , 
2002
, vol. 
99
 (pg. 
9697
-
9702
)
Biernacki
C
Govaert
G
Choosing models in model-based clustering and discriminant analysis
J. Stat. Comput. Simul.
 , 
1999
, vol. 
64
 (pg. 
49
-
71
)
Bozdogan
H
Model selection and Akaike information criterion (AIC): the general theory and its analytic extensions
Psychometrika
 , 
1987
, vol. 
52
 (pg. 
345
-
370
)
Dai
XF
, et al.  . 
BGMM: a Beta-Gaussian mixture model for clustering genes with multiple data sources
WCSB 2008
 , 
2008
(pg. 
25
-
28
)
Draper
DE
Translation initiation
Escherichia coli and Salmonella
 , 
1996
Washington, DC
ASM Press
(pg. 
902
-
908
)
Elowitz
MB
, et al.  . 
Stochastic gene expression in a single cell
Science
 , 
2002
, vol. 
297
 (pg. 
1183
-
1186
)
Fraley
C
Raftery
AE
Model-based clustering, discriminant analysis, and density estimation
J. Am. Stat. Assoc.
 , 
2002
, vol. 
97
 (pg. 
611
-
631
)
Gardner
T
, et al.  . 
Construction of a genetic toggle switch in Escherichia coli
Nature
 , 
2000
, vol. 
403
 (pg. 
339
-
342
)
Golding
I
, et al.  . 
Real-time kinetics of gene activity in individual bacteria
Cell
 , 
2005
, vol. 
123
 (pg. 
1025
-
1036
)
Gillespie
DT
Exact stochastic simulation of coupled chemical reactions
J. Phys. Chem.
 , 
1977
, vol. 
81
 (pg. 
2340
-
2361
)
Huang
S
, et al.  . 
Cell fate as a high-dimensional attractor of a complex gene regulatory network
Phys. Rev. Lett.
 , 
2005
, vol. 
94
 pg. 
128701
 
Kauffman
SA
Metabolic stability and epigenesis in randomly constructed genetic nets
J. Theor. Biol.
 , 
1969
, vol. 
22
 (pg. 
437
-
467
)
Kauffman
SA
The Origins of Order
 , 
1993
New York
Oxford University Press
King
K
, et al.  . 
A high-throughput microfluidic real-time gene Expression living cell array
Lab. Chip.
 , 
2007
, vol. 
7
 (pg. 
77
-
85
)
Ji
Y
, et al.  . 
Applications of beta-mixture models in bioinformatics
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
2118
-
2122
)
Lutz
R
, et al.  . 
Dissecting the functional program of Escherichia coli promoters: the combined mode of action of Lac repressor and AraC activator
Nucleic Acids Res.
 , 
2001
, vol. 
29
 (pg. 
3873
-
3881
)
MacQueen
JB
Some methods for classification and analysis of multivariate observations
Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley
 , 
1967
, vol. 
1
 
Berkeley and Los Angeles, CA
University of California Press
(pg. 
281
-
297
)
McClure
WR
Rate-limiting steps in RNA chain initiation
Proc. Natl Acad. Sci. USA
 , 
1980
, vol. 
77
 (pg. 
5634
-
5638
)
Mclachlan
G
Peel
D
Finite Mixture Models
 , 
2000
New York
John Wiley & Sons
Monod
J
Jacob
F
Teleonomic mechanisms in cellular metabolism, growth, and differentiation
Cold Spring Harb. Symp. Quant. Biol.
 , 
1961
, vol. 
26
 (pg. 
389
-
401
)
Pan
W
Incorporating gene functions as priors in model-based clustering of microarray gene expression data
Bioinformatics
 , 
2006
, vol. 
22
 (pg. 
795
-
801
)
Ribeiro
AS
, et al.  . 
A general modelling strategy for gene regulatory networks with stochastic dynamics
J. Comput. Biol.
 , 
2006
, vol. 
13
 (pg. 
1630
-
1639
)
Ribeiro
AS
Kauffman
SA
Noisy attractors and ergodic sets in models of gene regulatory networks
J. Theor. Biol.
 , 
2007
, vol. 
247
 (pg. 
743
-
755
)
Ribeiro
AS
Lloyd-Price
J
SGN Sim, a Stochastic Genetic Networks Simulator
Bioinformatics
 , 
2007
, vol. 
23
 (pg. 
777
-
779
)
Ribeiro
AS
Dynamics and evolution of stochastic bistable gene networks with sensing in fluctuating environments
Phys. Rev. E
 , 
2008
, vol. 
78
 pg. 
061902
 
Roussel
M
Zhu
R
Validation of an algorithm for delay stochastic simulation of transcription and translation in prokaryotic gene expression
Phys. Biol.
 , 
2006
, vol. 
3
 (pg. 
274
-
284
)
Schwarz
G
Estimating the dimension of a model
Ann. Stat.
 , 
1978
, vol. 
6
 (pg. 
461
-
464
)
Suel
GM
, et al.  . 
An excitable gene regulatory circuit induces transient cellular differentiation
Nature
 , 
2006
, vol. 
440
 (pg. 
545
-
550
)
Thompson
D
, et al.  . 
Dynamic gene expression profiling using a microfabricated living cell array
Anal. Chem.
 , 
2004
, vol. 
76
 (pg. 
4098
-
4103
)
Yu
J
, et al.  . 
Probing gene expression in live cells, one protein molecule at a time
Science
 , 
2006
, vol. 
311
 (pg. 
1600
-
1603
)
Zhang
B
, et al.  . 
MicroRNAs and their regulatory roles in animals and plants
J. Cell Physiol.
 , 
2007
, vol. 
210
 (pg. 
279
-
289
)
Zhu
R
, et al.  . 
Studying genetic regulatory networks at the molecular level: delayed reaction stochastic models
J. Theor. Biol.
 , 
2007
, vol. 
246
 (pg. 
725
-
745
)

Author notes

Associate Editor: Trey Ideker

Comments

0 Comments