## Abstract

**Motivation:** Gene regulatory networks (GRNs) are stochastic, thus, do not have attractors, but can remain in confined regions of the state space, i.e. the ‘noisy attractors’, which define the cell type and phenotype.

**Results:** We propose a gamma-Bernoulli mixture model clustering algorithm (ΓBMM), tailored for quantizing states from gamma and Bernoulli distributed data, to determine the noisy attractors of stochastic GRN. ΓBMM uses multiple data sources, naturally selects the number of states and can be extended to other parametric distributions according to the number and type of data sources available. We apply it to protein and RNA levels, and promoter occupancy state of a toggle switch and show that it can be bistable, tristable or monostable depending on its internal noise level. We show that these results are in agreement with the patterns of differentiation of model cells whose pathway choice is driven by the switch. We further apply ΓBMM to a model of the MeKS module of *Bacillus subtilis*, and the results match experimental data, demonstrating the usability of ΓBMM.

**Availability:** Implementation software is available upon request.

**Contact:**andre.sanchesribeiro@tut.fi; xiaofeng.dai@tut.fi

**Supplementary information:**Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

In the deterministic framework, it was hypothesized that cell types are the attractors of gene regulatory network (GRN) dynamics (Kauffman, 1969). To be biologically plausible, the attractors need to be confined patterns of gene expression since even in the simple case of a Boolean network with 20 000 nodes, its state space size is 2^{20000} (Kauffman, 1993). An important criticism to this hypothesis (Aldana *et al*., 2003) is that noise renders attractors a poor model of cell types. This is important given the stochasticity of gene expression (Arkin *et al*., 1998). To address this, the concept of ‘ergodic set’ was proposed as a set of states from which the GRN, once entering, does not leave for a time longer than the cell lifetime in the absence of external perturbations (Ribeiro and Kauffman, 2007). An ergodic set can be composed of one or several ‘noisy attractors’—the set of states where the GRN spends most time while on that ergodic set. Stochastic models of GRN (Ribeiro *et al*., 2006) can remain in confined regions of the state space for long period of time, in agreement with the hypothesis that ergodic sets corresponds to cell types (Ribeiro and Kauffman, 2007). One example of noise-driven transitions between noisy attractors might be the observed transient, and the probabilistic transition between two phenotypes in *Bacillus subtilis* which is controlled by the MeKS module (Suel *et al*., 2006).

While techniques to find noisy attractors and ergodic sets in deterministic Boolean networks are known (Ribeiro and Kauffman, 2007), in stochastic GRN this is not a simple task. In Ribeiro and Kauffman (2007), the ‘states’ composing the noisy attractors of a stochastic GRN were found by binarizing the proteins' temporal expression levels with K-means clustering algorithm (MacQueen, 1967). This is not the most appropriate approach since the number of clusters was predefined, possibly causing information loss. Stochastic GRN's ‘logic’ can be more complex than binary [e.g. the two genes of a toggle switch (TS) can express at half its maximum rate, simultaneously, for long periods of time (Ribeiro, 2008)]. Also, observing the proteins' level alone might not be sufficient to fully capture the behavior of GRN [e.g. miRNAs are known to have a regulatory role (Zhang *et al*., 2007)]. Hence, one problem that needs to be addressed is how to define noisy attractors of a stochastic GRN in a realistic biological setting. For this, the key questions are what features of the GRN dynamics should be taken into account to determine whether a GRN is in a noisy attractor, and how to reliably identify noisy attractors from the GRN dynamics given the choice of features to consider.

In the delayed stochastic model of GRN (Ribeiro *et al*., 2006), able to match gene expression at the single molecule level (Yu *et al*., 2006; Zhu *et al*., 2007), what changes with time are the number of proteins and RNAs, and genes' promoter occupancy state if transcription factors can bind to their target genes' promoters and affect transcription. Here, we use these three sources of information to determine noisy attractors. While proteins (Elowitz *et al*., 2002) and RNA levels (Golding *et al*., 2005) can be measured in real time, the promoters' occupancy states are not directly measurable, currently. However, it can be detected indirectly *in vivo* when a transcription factor is bound to a gene's promoter region (King *et al*., 2007; Thompson *et al*., 2004), which is one assumption made here.

Inspired by the joint mixture model clustering framework used in gene clustering (Dai *et al*., 2008), we built a gamma-Bernoulli mixture model (ΓBMM) to quantize and detect noisy attractors of stochastic GRNs from multiple data sources. The algorithm accounts gamma and Bernoulli distributed data simultaneously and quantizes their joint effort into state(s), assuming that protein and RNA levels are the summation (including subtraction) of several exponential processes, i.e. gamma distributions, and that the promoter occupancy state, which is composed of binary values, is of Bernoulli distribution. Each distribution for a gene is modeled as an independent event, and the quantization is done based on the multiple levels to make the detection of changes in the dynamics of the GRN more robust than when observing protein levels alone. The number of noisy attractors are determined by model selection, where four approximation-based criteria are compared: the Akaike information criterion (AIC) (Akaike, 1974), a modified AIC (AIC3) (Biernacki and Govaert, 1999; Bozdogan, 1987), Bayesian information criterion (BIC) (Pan, 2006; Schwarz, 1978) and integrated classification likelihood-BIC (ICL-BIC) (Ji *et al*., 2005). Since BIC and ICL performed better and in a similar way, they are employed here.

To define noisy attractors as biologically meaningful entities in cell dynamics, we assume that by being in a certain noisy attractor the cell will opt for a specific differentiation pathway, or adopt a phenotype out of several possible choices (Kauffman, 1969). We improve the previous work (Ribeiro and Kauffman, 2007) by introducing a more robust clustering algorithm that uses data fusion to detect noisy attractors from the GRN dynamics and validating its results by adding a stochastic model of cell differentiation.

In this study, we apply ΓBMM to the TS and GRN that drives differentiation in *Bacillus subtilis*, the MeKS module (Suel *et al*., 2006), and validate the results by confronting them with the cell-type distributions resulting from the differentiation. The results show that the clustering algorithm is robust to high levels of noise in GRN dynamics by using multiple data sources to determine the noisy attractors, and that the modeling strategy provides a good match with experimentally measured differentiation patterns. The results provide a better understanding of how stochastic GRNs can regulate their dynamics to determine cell population differentiation patterns and how multi-modal distributions of phenotypic expression can emerge from the stochastic nature of gene expression and regulation.

## 2 METHODS

### 2.1 Mixture model-based clustering and expectation maximization algorithm

In the model-based clustering methods, each observation **o**_{j}, where *j* = 1,…, *n* and *n* is the number of genes, is drawn from a finite mixture distribution with the prior probability π_{i}, component-specific distribution *f*_{i}^{(g)} and its parameters θ_{i}. The formula is (Mclachlan and Peel, 2000)

**θ**= {(π

_{i}, θ

_{i}) :

*i*= 1,…,

*g*} denotes all unknown parameters, with the restriction that 0 < π

_{i}≤ 1 for any

*i*and ∑

_{i=1}

^{g}π

_{i}= 1. Note that

*g*is the number of components in this model. From here onwards, we ignore the superscript (

*g*) from

*f*

_{i}

^{(g)}for simplicity.

In ΓBMM, we define **θ** = [π, **θ**_{1}, **θ**_{2}]^{T}, π = [π_{1},…, π_{g}]^{T}, **θ**_{1} = [α_{11},…, α_{gp1}, β_{11},…, β_{gp1}]^{T} and **θ**_{2} = [*q*_{11},…, *q*_{gp2}]^{T}, where *p*_{1} and *p*_{2} each represents the dimension of the observations in gamma and Bernoulli mixture models, respectively. *X* and *Y* denote the observations of gamma and Bernoulli distributed data, respectively, function *f*(**x**, **y**) as the density function of gamma and Bernoulli distribution and **o** = [**x**^{T}, **y**^{T}]^{T}.

ΓBMM is a joint mixture model of gamma and Bernoulli distributions, which assumes that for each component *i* the data of the two distributions are independent. Each component is assumed to be the product of *p*_{1} independent gamma distributions, whose probability density function is defined as

_{1i}= [α

_{i1},…, α

_{ip1}, β

_{i1},…, β

_{ip1}] and

**x**= [

*x*

_{1},…,

*x*

_{p1}]

^{T}. Each component is assumed to follow a Bernoulli distribution, with the probability density function for each gene defined as (θ

_{2i}= [

*q*

_{i1},…,

*q*

_{ip2}])

The expectation maximization (EM) algorithm is applied to estimate the parameters **θ** iteratively. Equation (3) is the data log-likelihood (natural logarithm is referred throughout the article), given *O* = {**o**_{j} : *j* = 1,…, *n*} (its direct maximization, however, is difficult):

To make the maximization of Equation (3) tractable, the problem is casted in the framework of incomplete data. Since we assume that data from different distributions are independent, *L*_{c} can be factored as

*c*

_{j}∈ {1,…,

*g*} as the clustering membership of

**o**

_{j}, then the complete data log-likelihood can be written as where χ(

*c*

_{j}=

*i*) is the indicator function of whether

**o**

_{j}is from the

*i*-th component or not. In the EM algorithm, E-step computes the expectation of the complete data log-likelihood where θ

^{(m)}represents the parameters estimated in the

*m*-th iteration. By computing the expectation, according to Bayes rule, Equation (4) becomes, where τ

_{ji}

^{(m)}is the estimated posterior probability of

**o**

_{j}from component

*i*at iteration

*m*We can assign each

**o**

_{j}to its components based on {

*i*

_{0}|τ

_{ji0}= max

_{i}τ

_{ji}}. Equations (4) and (5) show that our assumption of the gamma and Bernoulli distributed data being independent carries over to the expected log-likelihood as well. To derive the closed form or numerical optimization formula for updating parameters in ΓBMM, we used Lagrange multipliers to solve this constrained optimization problem, with the Lagrangian shown in Equation (6) as

Parameters of the ΓMM part, α and β, can be optimized by Newton–Raphson method. Let **θ**_{1i} = (α_{i}, β_{i}), then the parameters are updated by

**θ**

_{1i}≥

**1**, where

*H*

^{−1}(

**θ**

_{1i}

^{(m)}) is the Hessian matrix evaluated at

**θ**

_{1i}

^{(m)}and ℒ(

**θ**

_{1i}

^{(m)}) is the Lagrangian function of

*Q*(

**θ**

_{1i}

^{(m)}).

*H*

^{−1}and ∇

_{θ1}ℒ(θ) can be obtained by plugging Equation (1) in Equation (6) and taking the derivatives: and then forming Ψ and Ψ′ represent the digamma and trigamma functions, respectively, which are the first and second logarithmic derivatives of the gamma function. Parameter updating of

*q*is done by plugging Equation (2) in Equation (6), and taking the derivative of

*q*After setting , we have: .

Optimization of the prior probability of each gene clustering membership, π, can be derived by

and since we have .### 2.2 Model selection criteria

Four approximation-based model selection criteria, BIC (Pan, 2006; Schwarz, 1978), ICL (Ji *et al*., 2005), AIC (Akaike, 1974; Biernacki and Govaert, 1999) and AIC3 (Biernacki and Govaert, 1999; Bozdogan, 1987), are compared in ΓBMM, from which the best performing one is chosen. Calculations for the above criteria are defined as

*d*is the number of free parameters,

*M*is the total amount of data (

*M*= ∑

_{w=1}

^{W}

*M*

_{w},

*M*

_{w}is the size of the dataset

*w*and

*W*is the number of input datasets) and −2 ∑

_{j=1}

^{n}∑

_{i=1}

^{g}τ

_{ji}log(τ

_{ji}) is the estimated entropy of the fuzzy classification matrix

*C*

_{ji}= (τ

_{ji}) (Ji

*et al*., 2005). The number of free parameters

*d*in ΓBMM is

*d*

_{RB}= 2

*p*

_{1}

*g*+

*p*

_{2}

*g*+

*g*− 1, since there are

*p*

_{1}

*g*-free α

_{iu}s,

*p*

_{1}

*g*-free β

_{iu}s,

*p*

_{2}

*g*-free

*q*

_{iv}s and (

*g*− 1)-free π

_{i}s.

### 2.3 Data generation

We use *SGN Sim* (Ribeiro and Lloyd-Price, 2007) to simulate the models of GRNs. GRNs consist of genes coupled via protein–protein and protein–operator sites interactions. Transcription and translation are multiple time-delayed events (Ribeiro *et al*., 2006). The dynamics is driven by the ‘delayed SSA’ (Roussel and Zhu, 2006), a modified version of the Stochastic Simulation Algorithm (SSA) (Gillespie, 1977). The GRN models studied here (and how to implement them in *SGN Sim*) are given in the Supplementary Material.

The time series of proteins, RNAs and promoter binding states are used as the inputs of ΓBMM. The outputs are the number of noisy attractors and the noisy attractor membership and log-likelihood of each data point (Fig. 1).

## 3 RESULTS AND DISCUSSIONS

We first compared the performance of ΓBMM with its component models, ΓMM and BMM, using controllable gamma and Bernoulli distributed data (data and results in the Supplementary Material). After observing an advantage in using multiple data sources compared with using single data sources, we used ΓBMM to detect the noisy attractors of two GRNs, TS (Gardner *et al*., 2000) and the MeKS module, which is an excitable genetic circuit of *B.subtilis* (Suel *et al*., 2006).

### 3.1 The TS

The TS consists of reactions 7–12 (Ribeiro, 2008), where *i* ∈ {1, 2} (when only index *i* is present), or *i*, *j* ∈ {1, 2} with *i* ≠ *j* (when both indices are present). Transcription (reaction 7) and translation (reaction 8) are modeled by time-delayed reactions (Ribeiro *et al*., 2006) where *Pro*_{i} is the promoter of gene *i*, and *R*_{p} is an RNA polymerase. The delays (τ_{i}) account for the time duration of the processes involved in transcription and translation and model how long it takes for a product to be produced after the reaction has occurred. Each promoter *Pro*_{i} controls the expression of an RNA, *R*_{i}, i.e. once its ribosome binding site is formed, can start being translated by a ribosome *Rib* into a protein, *P*_{i}. Reaction 11 models the binding and unbinding of the repressor protein to the other gene promoter, defining the TS. Proteins decay via reactions 9, 10 and 12.

Values for delays (fixed in all simulations) are extracted from measurements (Yu *et al*., 2006). τ_{1} accounts for the transition from the closed promoter complex to the open promoter complex which is included in the transcription initiation process (McClure, 1980). We set τ_{1} = 40 s, within the range from 10 s to several minutes (McClure, 1980). The length of the gene *tsr-venus* driven by a *Lac* promoter studied in *Escherichia coli* is 2500 nt. Since the average rate of transcriptional elongation in *E.coli* is ∼ 50 nt/s, it follows that τ_{2} is, on average, 90 s (= τ_{1} + 2500/50). The time of the ribosome binding site clearance in translation initiation, τ_{3}, is set to 2 s, according to experimental estimations (Draper, 1996). The average translation rate is ∼15 amino acids/s, thus τ_{4} = τ_{3} + 2500 nt/(45 nt/s) = 58 s. The post-translational protein assembly process takes 420± 140 s in Yu *et al*. (2006), thus the delay of *P*_{i} is drawn from a Gaussian distribution with a mean of τ_{5} and SD of τ_{5std} each time the reaction occurs. All delays, except for τ_{5}, are assumed constant (Zhu *et al*., 2007) since the measurements suggest that they are not highly variable between transcription events of a single gene (McClure, 1980). For example, τ_{1}, although varies from one transcription event to another, measurements on an unrepressed *Lac* promoter (Lutz *et al*., 2001) suggest that it follows a Gaussian distribution with a mean of 40 s and a SD of 4 s. We tested the effects of variable delays following a normal distribution with the same mean and comparably small SDs (with respect to the mean), and the TS dynamics was not significantly affected, which means, qualitatively, our conclusions hold when using a model with variable delays.

Reaction rates (in *s*^{−1}) set following the TS model in Zhu *et al*. (2007), are built to reproduce the dynamics of the genetic TS synthesized in *E.coli* (Gardner *et al*., 2000). Specifically, *k*_{tr} = 0.00042, rbsd = 0.01, *k*_{rep} = 1, and *k*_{unrep} = 0.1. Given these values, the transcription rate *k*_{t} and protein decay rate *k*_{d} are varied as discussed later.

Each ‘cell’ simulation is initialized without proteins (*P*_{1}, *P*_{2}) or RNAs (*R*_{1}, *R*_{2}), and with one promoter of each gene (*Pro*_{1}, *Pro*_{2}), 40 RNA polymerases (*R**p*), and 100 ribosomes (*Rib*). Values are set according to Zhu *et al*. (2007).

The noise level of TS can be varied by multiplying the transcription rate *k*_{t} and protein decay rate *k*_{d} with a pair of factors, while imposing equal mean values of (*P*_{1} + *P*_{2}) to ensure that the noise varies from varying the toggling rate alone (Ribeiro, 2008). Both these rates are evolvable and vary from gene to gene. The initiation rate is sequence dependent, while degradation rates are tunable and vary from one protein to another (Bernstein *et al*., 2002). The noise level affects the TS bistability. Excess of noise makes the TS monostable, with both genes expressing simultaneously (if transcription is more ‘propense’ than the reaction by which the promoter is repressed) (Ribeiro, 2008). Thus, the TS model allows testing the clustering algorithm for various levels of noise.

To test the ability of ΓBMM in quantizing gene expression levels, we simulate nine ‘cell populations’. Each cell consists of a TS model. Cells of different populations have TSs with identical mean level of proteins, but different noise level which is defined as the SD over the mean of (*P*_{1} + *P*_{2}) time series. Each cell population has a unique pair of multiplication factors {ν_{1}, ω_{1},…, ν_{9}, ω_{9}} of the values of *k*_{t} and *k*_{d}, respectively. These multiplication factor pairs cannot be varied by constant amounts each time to attain equal mean protein levels due to the delay τ_{1} at each transcription event, which accounts for the promoter occupancy time by an *R**p* (McClure, 1980). We first fix ν's, where ν = 1 is the ‘baseline’ and four inverse pairs are arranged on each side in ascending order; then ω's are searched by an heuristic method such that the mean value of (*P*_{1} + *P*_{2}) does not differ >4% among the cells of different populations. The nine pairs of multiplicative values of *k*_{t} and *k*_{d} are (0.01, 0.2768), (0.1, 1.0935), (0.25, 1.25), (0.5, 1.2), (1, 1), (2, 0.9226), (4, 1.08), (10, 1.73) and (100, 3.4599), respectively. Originally, *k*_{t} = 0.01 and *k*_{d} = 0.001.

We generate 1000 cells for each population with *SGN Sim*. Given 10^{6} s as the cell lifetime, we sampled the state with a sampling frequency of 1000 s. This unrealistic lifetime allows better statistics without affecting the results qualitatively (alternatively one could follow the dynamics of cells for many generations). For each population, we quantize protein, RNA and promoter binary binding data of a cell by ΓBMM, and use the log-likelihood [Equation (3)] of each cluster within each noisy attractor and their summation to visualize the number of noisy attractors and the ergodic set's stability.

Note that, for this study, we simulate nine cell populations, each of which contains 1000 identically initialized cells; and cells of different population differ. The variation of parameter values across cell populations allows testing the robustness of the clustering algorithm under various degrees of GRN's internal noise. The simulation of many initially identical cells, each with the same internal parameter values and amount of each substance, is necessary to estimate the GRN's differentiation pattern for a given set of parameter values, given the stochastic dynamics of the genetic networks driven by the delayed SSA. These patterns are then confronted with the quantization results for validation purposes.

Log-likelihood distribution of the ergodic set for nine cell populations are shown in Figure 2a, where each log-likelihood is the sum of each state of each noisy attractor. The higher the log-likelihood is the more stable the quantized states are. Also shown is the noise in cells of the nine populations (Fig. 2b). The TS noise increases as *k*_{t} and *k*_{d} increase due to the increase of toggling frequency, and then decreases due to the loss of bistability (Ribeiro, 2008). The turning point of log-likelihood (population 4, Fig. 2a) corresponds to the loss of bistability and maximum noise (Fig. 2b).

The quantization results of each noisy attractor are shown in Supplementary Figure S2, with the most representative ones (populations 1, 3, 4 and 9) shown in Figure 3. Note that there are two noisy attractors (of the same ergodic set) in population 1 (Fig. 3a), which starts toggling as noise increases (Fig. 3b), begins loosing its bistability in cells of population 4 (Fig. 3c) and becomes monostable beyond this point (Fig. 3d). Importantly, Figure 3c shows that cells of population 4 can be in three noisy attractors. The magenta and blue dots each represent one noisy attractor, and the red and green circles together represent a third noisy attractor. To observe the changes in the number of attractors of the TS as noise increases, we draw a transition graph (Fig. 4) by defining that two noisy attractors are distinct (and not a single noisy attractor) if the difference between their medians is larger than the mean of their SDs. The noisy attractors, defined as the set of states where the GRN spends most time when on an ergodic set, correspond to the states where sufficient data points share similar log-likelihood as shown in Figure 4. The result, which agrees with Figure 3, shows that as noise increases, the TS changes from bistable to tristable and to monostable, verifying the claim in Ribeiro (2008) that the TS dynamical regime changes with noise level, and showing the suitability of ΓBMM to characterize the dynamics of stochastic GRNs.

To verify the existence of these regimes, we consider the TS as a differentiation pathway switch (Monod and Jacob, 1961) by adding in model cells reactions 13–16. These are activated at the end of a cell's lifetime and the amount of *P*_{1} and *P*_{2} in the cell at that moment will stochastically determine the pathway (‘*Cell*_{0}’ represents an undifferentiated cell). For example, if only *P*_{1} is present when these reactions become active, the cell will most likely differentiate into cell type 1 via reaction 13, if *P*_{1} and *P*_{2} are both present with significant amounts, *Cell*_{0} will most likely differentiate into cell type 3 (reaction 15) and if both proteins are absent, the cell chooses pathway 4 (reaction 16).

Figure 5 shows the distribution of differentiation pathway choices of 1000 cells of four of the nine populations (complete results in Supplementary Fig. S3). As seen in Figure 5b and c, cells of populations 4 and 5 each have three almost equally likely differentiation pathways, which accord well with the three noisy attractors found in cells of these populations (Fig. 4). Also in agreement, cells differentiate mostly into one or two type(s) when the TS only has one or two noisy attractor(s), respectively (Fig. 5d and a).

While the bistable and monostable regimes are visually detectable from the protein time series, the tristable regime is not, as seen in Supplementary Figure S4c and d, where there is no visible difference between the TS dynamics of cells of populations 3 and 4 although the transition from bistable to tristable occurs at that point (Fig. 3b and c). This is because the ‘state’, where both proteins express simultaneously, is unstable in the deterministic setting. Interestingly, for a given noise level, this unstable state is the most commonly occupied region of the state space by the TS (Fig. 5c).

### 3.2 An excitable circuit of *B.subtilis*

Under nutrient limitations, a minority of the cells of a population of *B.subtilis* becomes competent for DNA uptake, while the majority commits to sporulation (Suel *et al*., 2006). Cells can switch probabilistically and transiently between competence and vegetative growth via a ‘MeKS’ module (Fig. 6) (Suel *et al*., 2006). In this module, protein ComK activates genes needed for competence. Protein ComS induces competence by allowing accumulation of ComK protein that indirectly enables the self-activation of gene *comK* by inhibiting ComK degradation by protein complex MecA.

We model the ‘MeKS’ module in the delayed stochastic modeling framework following the ordinary differential equation (ODE) model proposed in Suel *et al*. (2006) (both models are described in detail in the Supplementary Material). The rate constants and the initial amounts of all substances are heuristically tuned to match both the ODE model dynamics and the measurements of the time series of proteins *P*_{S} and *P*_{G} reported in Suel *et al*. (2006). Time delays are set as previously described for the transcription and translation reactions in the TS, since such information is not available for this genetic circuit.

Each cell is initialized with 100 MecA, 1 *Pro*_{i} but not *P*_{i} or *R*_{i}, where ‘*i*’ stands for ‘S’, ‘K’ and ‘G’ for proteins, and represents ‘S’ and ‘K’ for RNAs and promoters. We simulated 10^{3} cells for 100 h and sampled each time series at each 10^{3} s. The time series of proteins ComS and ComG of one cell is shown in Figure 7a, where a negative correlation is apparent between them, in agreement with the experimental and modeling results in Suel *et al*. (2006).

To validate the model we add reactions 18 and 17 to each cell to drive differentiation. Sporulation (incompetent cells) depends on the amounts of proteins ComS and ComG, and all cells differentiate during their lifetime. *Cell*_{S} and *Cell*_{G} represent the incompetent and competent cells, respectively.

Figure 7b shows the distribution of competent and incompetent cells (1000 cells in total). Of the cells, 4% are competent as shown in the experiments (3.6± 0.7%) (Suel *et al*., 2006). We applied ΓBMM to protein, RNA and promoter binding time series of *comS* and *comK*, since these are the genes regulating the transient differentiation. The quantization of one of the time series is shown in Figure 8, where two noisy attractors are observed (in this case the competent state takes 4.04% of the sampled data). From a sample of 20 cells, 3.06–4.52% of the time is spent by the cells on the competent state, in the range of the experimental results, i.e. 2.9–4.3% (Suel *et al*., 2006), and in accord with the fraction of competent cells in the differentiation model.

## 4 CONCLUSIONS

In pluricellular organisms, different cell types have identical GRNs. These, although being stochastic (Arkin *et al*., 1998) and subject to external perturbations, can stably maintain their type once fully differentiated. Reverse differentiation and metaplasias, while possible, are rare, indicating that cell types are ergodic sets. If the hypothesis of ergodic sets as cell types holds true, then from the dynamics of GRN, one ought to be able to predict patterns of differentiation and distributions of phenotypic expression of cell populations.

Here, we used a clustering method, ΓBMM, to detect noisy attractors of GRN from their dynamics. ΓBMM uses multiple time series of the GRN, and its applications to the TS and the MeKS module matched the patterns of cell differentiation. Data fusion improved the matching, suggesting that the proteins levels alone might be insufficient to characterize the dynamics of stochastic GRNs.

The method is applicable to the analysis of real cells since the data required is experimentally attainable and the modeling strategy of GRNs (Ribeiro *et al*., 2006) used to test it matches gene expression measurements at the single RNA and protein levels (Yu *et al*., 2006; Zhu *et al*., 2007) and the model of differentiation of the MeKS module matched the measurements as well. While 1000 time series were used to test the accuracy of the algorithm, once tested, this amount of data is not needed in subsequent applications. The algorithm can then be applied to real time series directly, requiring only moderate amounts of experimental data.

## ACKNOWLEDGEMENTS

We thank Härri Lähdesmaki for reviewing the clustering algorithm.

*Funding*: Tampere Graduate School in Information Science and Engineering (X.D.); Academy of Finland, projects no. 213462 (Finnish Centre of Excellence program 2006-11) and project no. 126803 (to A.S.R.); the FiDiPro programme of Finnish Funding Agency for Technology and Innovation (40284/08, to X.D. and A.S.R.).

*Conflict of Interest*: none declared.

## Comments