## Abstract

**Motivation:** Signaling events that direct mouse embryonic stem (ES) cell self-renewal and differentiation are complex and accordingly difficult to understand in an integrated manner. We address this problem by adapting a Bayesian network learning algorithm to model proteomic signaling data for ES cell fate responses to external cues. Using this model we were able to characterize the signaling pathway influences as quantitative, logic-circuit type interactions. Our experimental dataset includes measurements for 28 signaling protein phosphorylation states across 16 different factorial combinations of cytokine and matrix stimuli as reported previously.

**Results:** The Bayesian network modeling approach allows us to uncover previously reported signaling activities related to mouse ES cell self-renewal, such as the roles of LIF and STAT3 in maintaining undifferentiated ES cell populations. Furthermore, the network predicts novel influences such as between ERK phosphorylation and differentiation, or RAF phosphorylation and differentiated cell proliferation. Visualization of the influences detected by the Bayesian network provides intuition about the underlying physiology of the signaling pathways. We demonstrate that the Bayesian networks can capture the linear, nonlinear and multistate logic interactions that connect extracellular cues, intracellular signals and consequent cell functional responses.

**Availability:** Datasets and software are available online from http://sysbio.engin.umich.edu/~pwoolf/mouseES/

**Contact:**pwoolf@umich.edu

**Supplementary information:**http://sysbio.engin.umich.edu/~pwoolf/mouseES/

## INTRODUCTION

Large datasets for biological systems are becoming increasingly available due to ongoing improvements in high-throughput measurement methods. However, similarly improved computational methods are needed if we are to gain useful insights from these datasets. Among prominent examples of this current challenge in biology is in understanding how extracellular cues influence highly interconnected and complex cell signaling pathways to yield cell behavioral responses (Kitano, 2002).

Current high-throughput experimental protocols allow us to measure an increasingly large number of variables describing cells or cellular populations. For example, large-scale phosphorylation screens (Lund-Johansen *et al*., 2000) protein–protein binding screens (Cagney *et al*., 2000) and migration assays (Maliakal, 2002) have demonstrated that it is possible to simultaneously measure the activity of tens to thousands of variables in a biological system. Future advances in microfluidics (Burns *et al*., 1998) and high-throughput mass spectrometry (Morand *et al*., 2001) for example, promise even more measurements will be possible at a fraction of their current cost.

Analysis of these data has generally focused on identifying a small number of pathway components involved in governing a particular cell behavior. Growing interest is focusing on how to use these multivariate data to determine integrated pathways in, for instance, signal transduction (Gardner *et al*., 2003) and metabolic pathways (Price *et al*., 2003). Early examples of these computational algorithms, however, were often hampered by unrealistic assumptions about the underlying biological system, e.g. linear interactions and processes.

In this work, we have chosen to use a more flexible modeling tool called a Bayesian network (BN) to reverse engineer protein-level signal transduction cascades. BNs can be broadly described as probabilistic models that depict causal relationships between variables (Jensen, 1998; Pearl, 1988; Pearl, 2000; Neapolitan, 2003). These networks were able to capture both linear and nonlinear interactions between groups of variables, while greatly reducing the number of parameters required to describe the model in comparison with other modeling approaches. The reason for this reduction in parameters is due to the Markov condition. The Markov condition states that in order to predict the value of some variable, *X*, we only need to know the values of the variables that directly influence *X*. Other, non-descendant variables to *X* are independent and as such can be treated separately.

An additional advantage of BNs is that they are probabilistic. Probabilistic type models treat measured quantities as uncertain estimates, and as such are able to tolerate measurement noise in an automated and systematic way (Neapolitan, 2003).

Here, BN network structures representing cellular processes were learnt from experimental data. In general, learning from the data has been proven to be an NP-complete problem, meaning that computation time increases exponentially with problem size (Chickering, 1996). The result of learning a BN from the data is a directed graph that is often directly interpretable and can be used to make quantitative predictions of outcomes and error estimates. General reviews of BN analysis can be found elsewhere (Jensen, 1998; Heckerman, 1999). In other fields of science, technology and business, BNs have demonstrated their ability to uncover useful causal relationships in spite of large and noisy datasets (Heckerman, 1999). For analysis of biological signaling data, although, the BN-based modeling approach is relatively new (Hartemink *et al*., 2002; Friedman, 2003; Beer and Tavazoie, 2004). Earlier efforts have been hampered by insufficient data, difficulties in data preprocessing and the large computational demands. Here, we partially overcome the first two issues by presenting a novel resampling/discretization scheme that allows us to integrate data from many experiments into a single, coherent dataset. We address computational speed issues via parallelization. Recent proof of principle studies applying BNs to the analysis of protein signaling pathways suggested that BNs may be capable of addressing this challenging class of reverse engineering problems (Sachs *et al*., 2002).

Accordingly, in this work we show how BNs can be gainfully used to analyze a proteomic dataset characterizing signaling events that affect stem cell differentiation and proliferation responses to extracellular stimuli. Our problem offers an excellent test for this approach, since some but not all of the key signaling pathway influences are known. Therefore, this dataset simultaneously provides us with a way to validate our BN predictions and also produces novel hypotheses for future experiments.

As an experimental system, we focus here on mouse embryonic stem (ES) cell self-renewal and differentiation. ES cells are pluripotent, meaning that they are able to differentiate into any cell type in the adult body. However, there is currently a little understanding of how to control ES cell fate decisions. A small number of important factors have been identified, but an integrated picture of how different factors work together to govern fate decisions has not emerged.

Mouse ES cells can be maintained undifferentiated indefinitely in culture by adding leukemia inhibitory factor (LIF) to the growth media. LIF binds to the heterodimer of LIFR and gp130 to activate both the JAK/signal transducer and activator of transcription (STAT) and MAPK pathways (Ernst *et al*., 1996). Although the roles of these signaling pathways are not fully understood, it has been demonstrated that STAT3 is required for ES cell self-renewal (Niwa *et al*., 1998; Matsuda *et al*., 1999; Raz *et al*., 1999). Upon removal of LIF, these cells rapidly differentiate, implying that LIF is sufficient to maintain these cells in an undifferentiated state. In contrast, LIF does not maintain human ES cells in an undifferentiated state. Clearly, it is thus necessary to go beyond identifying single cytokines and/or matrix cues that are effective under some conditions and not others, and pursue an objective of identifying signaling pathways that proximally govern cell behavioral responses.

To facilitate our identification of which network interactions are responsible for ES cell fate decisions, we previously generated an experimental dataset that includes 49 measurements of the phosphorylation states of 28 cellular signaling proteins, along with cell proliferation and differentiation responses, across a landscape of stimulatory extracellular matrix and cytokine cues (Prudhomme *et al*., 2004a). In addition to the effects of LIF, this dataset includes accompanying effects of fibroblast growth factor-4 (FGF4), fibronectin (FN) and/or laminin (LAM). This multivariate ‘cue-signal-response’ approach is well suited for applying a BN methodology toward the goal of more comprehensive understanding which network pathway interactions and influences are involved in maintaining mouse ES cells in an undifferentiated state.

## SYSTEMS AND METHODS

### Experimental procedures

A primary objective of this work is to determine the signaling pathway influences that are responsible for mouse ES cell self-renewal versus differentiation fate responses to extracellular stimuli. Toward this end, a factorial screen was conducted to test the effects of LIF, FN, LAM and FGF4 on ES cell growth and differentiation. For each condition, differentiation was measured using Oct4 as a differentiation marker, and proliferation was measured by seeding single cells and counting total cell numbers. Measurement of most signaling protein phosphorylation states were performed using western blot analysis from the KPSS 1.1 screen from Kinexus (Vancouver, Canada). These data have been reported previously (Prudhomme *et al*., 2004a). We also undertook additional western blot measurement of STAT3 phosphorylation on tyrosine 705 by standard methods (Prudhomme, 2003). Complete datasets are available online at http://sysbio.engin.umich.edu/pwoolf/mouseES/data.

Using the model by (Prudhomme *et al*., 2004b) we determined rate constants for differentiated and undifferentiated cell growth and for differentiation by fitting our data to the following two state model:Here U is the undifferentiated cell population, D is the differentiated cell population, *k*_{U} and *k*_{D} are proliferation rate constants for undifferentiated and differentiated cells and *k*_{C} is the rate constant governing differentiation. This model is expressed in the following system of coupled ordinary differential equations:

*k*

_{U},

*k*

_{C}and

*k*

_{D}) and calculated the error associated with that value via nonlinear regression. Error estimates were later used to define a sampling distribution for each parameter. Overall, the differential equation model fit the experimental data with an average

*r*

^{2}value of 0.61. These parameter values in turn were then used to derive instantaneous rates of undifferentiated and differentiated cell growth and differentiation (

*R*

_{U},

*R*

_{D}and

*R*

_{C}, respectively) for each time point and condition. These instantaneous rates were then used as the primary readouts of cellular behavior. Other readouts, such as the total number of cells (

*N*

_{f}) and the fraction of differentiated cells (D) were used as secondary measures of cellular behavior.

The dataset is summarized in Table 1 and in further detail in Supplementary Table 1.

### Data preprocessing

To make subsequent analysis computationally tractable, the raw data were converted from continuous to discrete values by using an approach similar to resampling. In this approach, each measurement is first converted from its putative ‘precise’ central value to a more representative probability distribution that also describes the uncertainty in the value. The data were found to be approximately normally distributed via Lilliefors' test, thereby allowing us to model the error distributions using only the mean and standard deviation of a measurement. Because the protein phosphorylation experiments were not run in replicates, we take the standard deviation of the measurement according to Kinexus specifications to be 25% of the mean value. After converting each measurement into a distribution, we then populate the dataset by sampling 1000 points from each distribution. In this way, we convert a single observation into 1000 individually sampled points that reflect the information contained in the original distribution. We propose that this approach reliably represents the data and its associated errors, and is accordingly most useful for making the relative comparisons required for the BN analysis. Finally, the expanded dataset is discretized by distributing all of the data into 25 bins of the same size. Because we have no information a priori about the locations or sizes of the bins we set them all to be of the same size. Computationally, same sized bins have the advantage that they protect against data sparsity, and thereby prevent spurious correlations (Kozlov and Koller, 1997). In addition, discretizing into same sized bins reduces the influence of outliers, resulting in less sensitivity to noise.

The sampling rate and number of bins used in this dataset were chosen empirically. In the limit of an infinite sampling rate and infinite number of bins, this discretization scheme would reproduce the given probability density exactly. However, increasing the sampling rate or increasing the number of bins increases the time and memory required to analyze the dataset. By varying both parameters, we found that a sampling rate of 1000 points per measurement and 25 bins allowed us to find the same high-scoring BN (described below) as those found using datasets with larger sampling rates and more bins. In contrast, data discretized with less than 10 bins produced an unsatisfactorily coarse representation of the raw data.

Although the datasets do have a time component (e.g. measurements taken at 0, 2, 3 and 5 days), we chose instead to model the dataset as an unordered series of steady states, and thereby omit the time measure. We chose not to include time as a variable for two reasons. First, many of the known biochemical processes underlying the measured take place at a timescale of minutes or hours, thus we do not expect kinetic connections between measurements made on the scale of days to be meaningful. Second, the available dataset is not sufficiently large or structured to model as an unrolled temporal BN. To be modeled as a kinetic BN, the dataset would have to have many more measures as a function of time and ideally measures would be made at evenly spaced time intervals (Takikawa *et al*., 2002). Because our dataset satisfies neither of these criteria, we opted to use a steady-state model.

### Bayesian network analysis

A goal of this project is to assess the suitability of BNs for identifying signal transduction pathways. For this purpose, we constructed a search algorithm that attempts to find a network structure that satisfies the Markov condition and fits the data as well as possible.

The Markov condition states that the value of a variable is conditionally independent of its non-descendants given its parents (Neapolitan, 2003). By way of an example of the Markov condition, consider a set of three random variables, *A*, *B* and *C*. The joint probability distribution of these variables can be written in general as:

*C*alone determines or influences the state of both

*A*and

*B*, then the joint probability distribution simplifies to

*A*is conditionally independent of

*B*(a variable non-descendant of

*A*) if we know the value of the parent,

*C*. This reduced version of the joint probability distribution can then be represented using a directed acyclic graph (DAG) with two arrows pointing from

*C*to

*A*and

*B*.

The Markov condition is at the heart of a BN analysis for it allows us to reduce a complex joint distribution down to a smaller, and more tractable set of conditional probability distributions. This smaller set of conditional probability distributions is desirable because it means that we need fewer parameters to describe our model, and hence need less data to characterize those parameters.

Applications of the Markov condition are commonly encountered in other fields of biology. For example, in genetics the Markov condition says that the genetic state of an individual is independent of the genetic state of the individual's grandparents if the genetic state of the parents and children are known. For the signal transduction pathways studied in this work, the Markov condition also holds. In signal transduction, the phosphorylation state of a protein, for example, is directly governed by a finite set of other proteins. Thus, knowing the activity states of these other proteins should be sufficient to make predictions.

To find a BN model that fits the data as well as possible, we need to calculate the probability of a model given data. In Bayesian terms, the probability that a model is correct given data can be written as

*P*(Model | Data). The term

*P*(Model) is the prior probability of the model,

*P*(Data) is the prior probability of the data and

*P*(Data | Model) is the probability of the data given a model. Each of these terms is discussed below.

To explain how *P*(Data | Model) is calculated, let us first introduce some notation. Assume we have a dataset, *D*, with *m* entries. Each entry describes a state of *n* discrete valued variables. For each variable, *i*, let *r*_{i} equal the number of states (i.e. arity) of the variable. Given a network model, each node represents a variable, and arrows represent apparently causal connections between nodes. Thus, each node, *i*, will have a list of parents called π_{i}, which can take on a total of *q*_{i} possible combinations of values. Finally, let *N*_{ijk} equal the total number of cases in the dataset where variable *i* is in state *k* and the parent nodes are in state *j*. Similarly, let *N*_{ij} equal the number of cases in the dataset where parents of the variable *i* are in state *j*. Given these terms, we can write a closed form expression for *P*(Data | Model) as was first derived elsewhere (Cooper and Herskovits, 1992):

*i*in a particular state,

*k*, given parents in some state,

*j*. Combinations of parents that are more informative, or predictive, of the child values will have a higher probability. A more detailed discussion of Equation (5) can be found elsewhere (Cooper and Herskovits, 1992).

Biologically, each node in the network is a value that we can measure, such as the phosphorylation level of a particular protein or the rate of cell differentiation. Arrows between nodes indicate apparent causal relationships uncovered within the data. It is important to note that arrows between nodes do not differentiate among positive, negative or more complicated interactions, but only indicate a directional relationship between variables.

We have defined the prior probability of a model, *P*(Model), such that it can take on values of either 1 or 0, signifying that the network is or is not allowed. For a model to be allowed [*P* (Model) = 1], it must satisfy the following four conditions:Condition 1 is required for a BN to maintain logical consistency. Violation of this constraint would lead to circular reasoning (e.g. *A* causes *B*, *B* causes *C* and *C* causes *A*), which is not meaningful in a causal network. Condition 2 limits the number of parents and children and serves two purposes. First, by limiting the number of connections a node may have, we limit the search space, thereby speeding up the network optimization algorithm. Second, networks with fewer strong connections are more interpretable than a strictly more accurate network with more connections. A motivation for constructing a BN representation of an experimental dataset is to produce a human interpretable result. By limiting the number of connections, we select for the strongest connections between nodes, as these configurations describe the data with the highest probability. Such capping conditions are commonly employed for a BN analysis, as described elsewhere (Heckerman, 1999; Friedman, 2003).

Graph is acyclic.

No node has more than three parents and/or children.

Nodes designated as cue have no parents.

Nodes designated as response have no children.

Conditions 3 and 4 further limit the network search space by imposing structural limitations on certain nodes based on physical and biological insight. Every node in the network is assigned as a cue, signal or response. Cue nodes are nodes that are controlled externally, and as such cannot be altered by the system via parent connections. Cue nodes include the experimental addition or removal of growth factors or changes to the matrix onto which the cell are plated. Signal nodes are defined as nodes that can influence or be influenced by other nodes. An example of a signal node is the concentration of a phosphorylated protein inside a cell. Response nodes are high-level, phenomenological outputs of a system. We assume that these outputs do not cause signals, but are instead the outcome of signals. Examples of response nodes include the rates of cell proliferation and differentiation. A list of the nodes used in this work and their associated structural limitations is provided in Supplementary Table 1.

The procedure for searching for high-scoring networks is shown conceptually in Figure 1a. To generate the networks in this work, we randomly initialized approximately 10^{9} different seed networks. This initial Monte Carlo sampling step provides a wide coverage of the network space, and thereby increases the chances of finding a globally optimal network structure. Each of these networks was then subject to 10^{3} rounds of local optimization. Of the resulting set of networks, the top 100 networks were selected and subjected to additional local optimization runs until no further changes could be made to increase the probability of the network. From this resulting list, the top-scoring network was analyzed. While this approach does not guarantee that we will find the network with the global maximum score, it will result in a network that is likely and as such captures many, if not all of the relationships present in the original dataset. A sample analysis with a small network is shown in Figure 1b.

The software program used for searching and scoring the network is a heavily modified version of the open source BNJ libraries (Hsu *et al*., 2003, http://bndev.sourceforge.net) and is available from the authors upon request. Network models were visualized using the program dot (Gansner and North, 2003, http://www.research.att.com/sw/tools/graphviz/). Model searching and scoring took approximately two weeks of computing time on a 64-processor IBM pSeries 655 server.

### Model validation via data shuffling

To assess the confidence level of our model in predicting our data, we generated 10^{4} randomly shuffled datasets and calculated the *P*(Model | Data) for each set. By shuffling our data for each variable, we destroy correlations between variables, while retaining characteristics of the original set (e.g. ranges and histogram distribution of values). In this way, we predict the probability that our model structure would be chosen from a similar, yet randomly structured dataset. If this resulting score difference is small, this indicates that the underlying relationships in the data are weak or that the model does not discriminate signal from noise. In contrast, large score differences between shuffled and unshuffled datasets indicate that the model structure is optimized for a particular data configuration and can make reliable predictions. Similar data shuffling algorithms have been used elsewhere for model validation (Mavroudi *et al*., 2002; Janes *et al*., 2004).

### Model visualization

An advantage of using a BN to represent data is that it can capture linear, nonlinear and multistate interactions between variables. However, a directed graph representation of a BN gives no indication of the underlying interactions between variables, thus we have developed additional visualization tools to describe these interactions.

In the simplest case of a node with only one parent, we visualize the interaction between the child and parent node by plotting their probability densities on the *x* and *y* coordinates, respectively. In this way, high-probability parent/child configurations can be easily distinguished from regions of low probability.

For a child node with two parents, we visualize the interaction by plotting the maximum likelihood of the child node as a function of the two parent nodes on the *x* and *y* coordinates. Although this method does not present the full distribution of values of the child node it does provide a fairly accurate representation of the underlying distribution. Children with three parents were not visualized.

### Model predictions

Given a BN, it is possible to perform *what if* experiments to make quantitative predictions about conditions not tested in the original dataset. For example, given a network and a dataset, we can predict how setting or observing one part of the network would influence our beliefs about the rest of the network.

We make this prediction by evaluating the likelihood of various instantiations of the unobserved entries using a Gibbs sampling approach. Briefly, Gibbs sampling is a Markov chain Monte Carlo method, and has been used extensively for evaluating missing or unobserved entries in BNs (Heckerman, 1999; Ghahramani, 2001). Using a Gibbs sampling model, all entries with unobserved values are initially instantiated at random. Next, each unobserved entry is taken in turn and its value is updated by sampling from the conditional probability distribution dictated by its parents. This process is repeated for 10 000 iterations as a preliminary *burn in* period, after which approximately 400 samples of the states of the unobserved entry are taken at 1000 iteration intervals for each condition. Averages of these observations are then taken as expectation values for the entry given the available data. A schematic representation of this algorithm is shown in Figure 2.

Note that we differentiate between data entries that are observed and entries that are set. Entries that are observed (e.g. by a biological assay) can be scored normally as our observation does not change the system's ability to influence the entry. In contrast, if a data entry is known because we have intervened and set it (e.g. by adding an inhibitor or knocking out a gene), then that entry must be scored as if its associated node had no parents, for the state of the system has no influence over the value of the entry. This approach for handling observational versus interventional data in BNs has been well studied and is elaborated elsewhere (Cooper and Yoo, 1999; Tong and Koller, 2001).

To test the accuracy of our model, we predicted the effects of inhibiting the phosphorylation of PKCε and RAF1 (S259). The readout of this *in silico* experiment was the rate of differentiated cell proliferation *R*_{D}. Simulations were run under conditions where FN present, LAM present, and both LAM and FN absent (e.g. gelatin only).

## RESULTS

### Best fit network model

The output of the BN learning procedure is a single network that has the highest observed probability of describing the data, as shown in Figure 3. When this top-scoring network is compared to the second highest scoring network, we find that they differ by only one connection between RAF (S259) and RAF (S338), which is present only in the top-scoring network. Resampling the data and repeating the network learning procedure resulted in identical highest and second highest scoring networks, indicating that the network structure learning algorithm is robust to variations in sampling. In addition, enumeration of all permitted arc removals, additions or reversals on the highest scoring network failed to produce a score improvement, demonstrating that this network structure is at least a local maximum.

When the data were shuffled, the new *P*(Model | Data) value for the model in Figure 3 changed significantly in comparison to the probability score for unshuffled data (*P* ≪ 0.001). This difference in probability between shuffled and unshuffled data indicates that the BN in Figure 3 does detect signals within the data well above the noise level. This data shuffling approach also implies that our model can effectively predict features within the dataset.

### Visualization of one- and two-parent interactions

A key advantage of a BN is that it can detect complicated relationships between variables. To illustrate the types of possible interactions that are detected by this algorithm, we have plotted all of the predicted one- and two-parent interactions for our dataset as described in the Systems and methods section.

For one-parent interactions (Fig. 4a–n), we have plotted the probability density for the phosphorylation state of each of the variables. Such plots demonstrate that the BN can detect a variety of linear, nonlinear and multistate interactions. Examples of near linear relationships include the connections between ERK1 and ERK2 (Fig. 4k), and between GSK3β and MEK1/2 (Fig. 4n). Nonlinear relationships are also detected such as the biphasic relationships between STAT3 (S727) and MEK3/6 (Fig. 4c), or the relationship between AKT1 (T308) and p38MAPK (Fig. 4a). Multistate interactions represent a particularly difficult area for other modeling paradigms; however, the BN is able to uncover a number of possible two-state (Fig. 4e and j) and three-state (Fig. 4d and g) interactions.

Interestingly, a few of the connections appear to describe faint, noisy relationships, such as Figure 4l. If we compare the relative contribution of such noisy relationships to the total network, we find that their presence only slightly improves the probability of the network (data not shown). This connection between the strength of the pattern in the interaction plot and relative contribution to the probability of the network highlights how such a visualization tool can help to improve our intuition about connections in the network.

For two-parent interactions (Fig. 4aa–nn), we instead plot the maximum-likelihood phosphorylation value of the child node as a function of the two parents. These plots also reveal linear, nonlinear and multistate behavior. For example, Figure 4hh shows a near linear increase in the expected rate of undifferentiated cell proliferation (*R*_{U}) as a function of p38α MAPK, but only in the presence of LIF. Note that the linear relationship does not hold in the absence of LIF (bottom half of Fig. 4hh). Similar near-linear behavior is observed for the phosphorylation level of PKCα as a function of PKCε, but only in the absence of FN (Fig. 4jj). Consistent nonlinear relationships between the phosphorylation data of various proteins are also uncovered, such as the effects of RAF1 (S338) and GSK3β on RAF1 (S259) (Fig. 4nn), or PKCε and RAF1 (S259) on AKT1 (S472) (Fig. 4kk). Multistate behaviors are also detected, such as the two states of SRC (Y529) phosphorylation as a function of AKT1 (S473) phosphorylation in the presence of FN (Fig. 4dd). Similarly, Figure 4ll and mm show almost three possible states of *N*_{f} and *R*_{C}, respectively.

### Comparison to known signaling pathways

In the following sections, we highlight two signal transduction pathways and compare their known biochemical topologies to the BN topologies predicted from data alone. These comparisons serve as an external validation of the BN model, but also identify some important differences between biochemical and BN representations of signal transduction.

### Comparison to known signaling pathways: LIF signaling via STAT

When LIF is added to the growth medium of mouse ES cells, the cells can be grown in an undifferentiated state indefinitely (Smith *et al*., 1988; Williams *et al*., 1988). Based on this information, we hypothesized that the signal transduction pathways that are activated by LIF should also be implicated in either undifferentiated stem cell growth or inhibition of differentiation.

Two primary routes of LIF signal transduction are via the protein STAT3 and the AKT pathway. Earlier work suggested that STAT3 phosphorylation is both necessary and sufficient for mouse ES cell self-renewal (Matsuda *et al*., 1999) implying that LIFs primary effect on ES cells is via STAT3. A schematic representation of LIF's effect on STAT3 and AKT is shown in Figure 5a and is compared to the Bayesian network prediction in Figure 5b. Because the experimental dataset did not contain all members of the signaling pathway, we only compared the observed species.

Briefly, LIF is thought to bind to a heterodimer of gp130 and LIFR, which in turn induces the phosphorylation of JAK. In its phosphorylated form, JAK then phosphorylates the tyrosine 705 residue of STAT3, which induces STAT3 dimerization and translocation to the nucleus. Once in the nucleus, STAT3 is phosphorylated at the serine 727 residue (Decker and Kovarik, 2000) possibly by kinases in the MAPK pathway. This serine phosphorylated form of STAT3 is then active and able to regulate gene transcription. In a parallel pathway, JAK is also thought to activate AKT. AKT in turn activates p70 S6, which directly regulates gene expression (Schiemann *et al*., 1995; Oh *et al*., 1998; Noda *et al*., 2000).

When the literature pathway is compared to the BN, we see good agreement. The BN directly connects LIF to STAT3, although the causal pattern of phosphorylation is reversed. The BN predicts that the presence of LIF determines the phosphorylation state of STAT3 at the S727 site, which in turn determines the phosphorylation state of the Y705 site. Experimental evidence suggests that STAT3 phosphorylation on the Y705 and S727 sites takes place in the order of tens of minutes after the addition of LIF (data not shown), while the measurements taken in our datasets were taken in the order of days after the addition of LIF. Given this disparity of time scales, it is not surprising that the BN is unable to resolve the sequence of STAT3 phosphorylation events as the BN was constructed using data that are essentially at steady state, while the signal transduction pathway describes a transient event.

In line with experimental findings the BN also predicts a relationships between AKT1 (T308) and P70 S6K phosphorylation, the data for which are visualized in Figure 4b. However, the BN finds no direct or indirect connection between LIF and AKT1 phosphorylation. This lack of connection could be because the phosphorylation state of JAK (the signaling intermediate between LIF and ATK1) was not measured. Another possibility for this lack of connection is that other signaling pathways played a more dominant role in determining AKT1 phosphorylation.

### Comparison to known signaling pathway: MAPK/ERK signaling

The MAPK/ERK pathway encompasses a large family of signal transduction pathways that have been implicated in a variety of cellular processes. In Figure 6a and b, we show what is known about the MAPK/ERK signaling proteins in our dataset along with the predictions made by the BN.

When we compare the BN and the known biochemical pathway, we find that the sequence of signaling events is preserved. For example, the canonical MAPK/ERK pathway proceeds from RAF to MEK to ERK—an order that the BN also extracts from the proteomic dataset. However, the BN also predicts unusual lateral connections between species, such as a relationship between the phosphorylation states of both sites of RAF1, or the connection between ERK1 and ERK2.

Interestingly, the BN shows a connection between MEK3/6 and MEK1/2. Although there is no experimental evidence that MEK3/6 directly affects MEK1/2, there is evidence that both targets are phosphorylated by a common kinase MEKK1 (Xu *et al*., 1995; Balasubramanian *et al*., 2002). Given this common connection, it is not surprising that MEK3/6 and MEK1/2 are connected in the network as their phosphorylation appears to be co-regulated by an unobserved species.

One area of disagreement between the BN and the biochemical network is the upstream parent of ERK1 and ERK2. Biochemical data suggest that ERK1 and ERK2 are both regulated by MEK1/2 (Johnson and Lapadat, 2002) while the BN predicts that ERK1 is regulated by MEK3/6 and ERK2 is regulated by ERK1. The density plot for ERK1 and ERK2 shows that ERK1 and ERK2 are almost linearly related to each other, and often are either both on or both off (Fig. 4k). This finding suggests that ERK1 and ERK2 are nearly interchangeable and as such explains why the BN connected them. The impact of MEK3/6 on ERK1 is more complicated, as MEK3/6, GSK3α (S21) and RB (S807/S811) are all parents of ERK1. From this list of parent nodes, it appears that ERK1 is an integrating node for many signals—only one of which is a MAPK pathway element.

One additional way to judge the agreement between the BN and the experimental conditions is by examining which connections are not present. For example, it is known that MEK6 does not activate ERK1 or ERK2, but instead activate ERK5 or ERK6 (Pearson *et al*., 2001) neither of which were represented in our original dataset. Similarly, in our model we do not find any connections between MEK6 to ERK1 or ERK2, which agrees with experimental findings.

### Primary effectors of cell proliferation and differentiation

One goal of this work is to determine what signaling pathways influence ES cell proliferation and differentiation. By creating a BN of this dataset, we can find the most direct effectors of proliferation and differentiation by looking at the direct parent nodes of the undifferentiated cell proliferation rate (*R*_{U}), differentiated cell proliferation rate (*R*_{D}) and differentiation rate (*R*_{C}), as shown in Figure 7.

The rate of differentiated cell proliferation, *R*_{D}, is found to be governed by LAM, and the phosphorylation states of RAF1 and p38αMAPK. LAM and RAF have been shown to influence both proliferation and differentiation in a number of cell types (Grant *et al*., 1990; Vaudry *et al*., 2002; Park *et al*., 2003) making them both good candidates for predictors of *R*_{D}. In addition, RAF has been suggested to play an anti-apoptotic role in the differentiated cells (Cox and Der, 2003)—a role that could also influence the apparent differentiated cell proliferation rate. Similarly, p38αMAPK has been linked to proliferation of a variety of cell types, including B lymphocytes and melanoma cells (Craxton *et al*., 1998; Recio and Merlino, 2002; Ogasawara *et al*., 2003).

The rate of undifferentiated cell proliferation, *R*_{U}, is found to be best predicted by LIF and p38αMAPK phosphorylation. This two-parent interaction is visualized in Figure 4hh. For similar reasons as stated above, we were not surprised to find that p38αMAPK is implicated in undifferentiated cell proliferation, as p38αMAPK’s role in proliferation in general is well established. However, we were surprised to find a direct connection between LIF and *R*_{U}. This connection could be interpreted as meaning that LIF activates another, unknown or unmeasured pathway, and in doing so affects the rate of undifferentiated cell growth. Although it has been suggested by other groups that LIF's primary role is to act as an anti-differentiation agent (Zandstra *et al*., 2000) our analysis suggests that LIF influences undifferentiated cell growth, but in a p38αMAPK-dependent manner. Examination of the raw data (Fig. 4hh) reveals that increasing p38αMAPK phosphorylation has a dose-dependent effect on *R*_{U}, but only in the presence of LIF. This finding suggests a synergistic interaction between LIF and p38αMAPK that has not been recognized previously.

Finally, the rate of conversion from undifferentiated to differentiated cells, *R*_{C}, is found to be best predicted by the phosphorylation state of Adducin α and ERK2. A plot of this two-parent interaction is shown in Figure 4mm. The mechanistic role of Adducin α and ERK2 in governing ES cell differentiation is unclear. Adducin α is a cytoskeletal protein that is known to bind with high affinity to Ca^{2+}/calmodulin and is a substrate for the kinases PKC and PKA (Suriyapperuma *et al*., 2000). ERK2 is a member of the MAPK pathway and is activated in response to RAS and RAF activation (Johnson and Lapadat, 2002). Both proteins act as integration points for a variety of signaling pathways, making it unclear which specific pathways are responsible for this differentiation process. One possible reason for this generality is that the experimental conditions used in this study do not induce differentiation into a specific cell type, but instead induce differentiation into a mixture of distinct cell types. Because the signaling pathways leading to distinct cell types are likely different, the network might have chosen a point of most conservation between pathways—namely the signal integration points of Adducin α and ERK2. Examination of the density plots for the influence of Adducin α and ERK2 on *R*_{C} (Fig. 4mm) reveal three distinct regions of {low Adducin α; low ERK2}, {high Adducin α; low ERK2} and {low Adducin α; high ERK2}. One interpretation of this three-state model is that each of these regions corresponds to different cell fates via distinct pathways, all of which have Adducin α and ERK2 as a common signal integration point.

### Model predictions of differentiated cell proliferation

A BN representation of a signaling network allows us to infer predictions about outcomes that were not observed in the original dataset. As a test, we predicted the rate of differentiated cell proliferation (*R*_{D}) for a number of different conditions and compared this result to earlier experimental measurements (Prudhomme *et al*., 2004a). The result of this analysis is shown in Figure 8.

Overall the experimentally observed trends are largely consistent, at least qualitatively, with the *a priori* model predictions. From Figure 3 RAF1 (S259) is a parent node of *R*_{D}, and therefore is expected to have a strong influence on the state of *R*_{D}. Indeed, sampling from the BN predicts that inhibiting RAF1 should lead to a decrease in *R*_{D}, independent of the growth substrate. This same trend is seen experimentally, although the decrease in *R*_{D} is small when cells are grown on FN. The BN predicts that untreated cells should have the lowest differentiated growth rate when grown on FN, which we also observe experimentally, although the effect is more dramatic in the experimental data.

In addition, we predicted the influence of PKCε on *R*_{D}. In the BN, PKCε is not located near *R*_{D}, and as such is predicted to act indirectly via other signaling pathways. The BN predicts when PKCε is inhibited, differentiated cell growth rates will be lowest in the presence of LAM, then FN, and highest when grown on gelatin alone. This same trend is observed experimentally for FN and gelatin, however no data were available for LAM. When we compare the proliferation rates of cells on a single substrate, such as gelatin, with and without PKCε, the BN predicts that the rate will not change. However, the experimental results do not confirm this prediction, and instead show that PKCε inhibition results in a slight increase in proliferation on gelatin and a slight decrease in proliferation on FN.

## DISCUSSION

The preceding results have shown that BNs can provide useful insight into complex, biological processes. In generating a BN of the signaling processes active in ES cells, we were able to identify which pathways most govern self-renewal and differentiation. Furthermore, we demonstrated that other connections made by the signaling pathway map onto known signal transduction pathways. In the following sections we differentiate between BNs and biochemical networks, and discuss how BNs can be further applied to systems biology and drug discovery.

### Bayesian networks versus biochemical networks

Although both BNs and biochemical networks are portrayed as directed graphs, they represent different yet related concepts. As described in earlier sections, connections between nodes in a BN denote apparent causal relationships between variables at a single point in time. As such, a BN answers questions such as *What is the probability that a node is in a particular state given observations of other nodes?* or *To predict the state of this node, what other nodes should be observed?* In contrast, connections between nodes in a biochemical pathway imply a kinetic, or time varying process. Thus, biochemical pathways answer the question *What happens next?*

From these differences in BNs and biochemical pathways, we expect to find some overlap between the two approaches. In general, when trying to predict the state of a target protein, a reasonable guess would be to look at the states of the proteins nearby on the signaling pathway. However, if the data were taken at a time scale that differed from that of the signaling pathway, we would not expect there to be a close relationship. For the data in this work, measurements were taken on the period of days, while some events, such as LIF-induced STAT3 phosphorylation, take place over a time scale of minutes to hours. Similarly, if our protein of interest were a common element of many signaling events, then the relationship between the phosphorylation state of that protein and that of its immediate neighbors in the signal transduction pathway may be difficult to determine.

Biochemical and Bayesian representations do often align in cases such as when a reaction is at or near steady state. For example, take the case of a biochemical reaction between two proteins, A and B, to form a complex, AB. Before steady state has been reached, the concentration of AB cannot be predicted by the concentrations of A or B, because the concentration of AB is overwhelmed by its initial concentration. This kind of unsteady state relationship would likely not be detected by a BN, even though the underlying biochemical network is simple. At steady state, however, the concentration of AB increases proportionally to the product of A and B. This steady-state relationship would result in reproducible patterns of concentrations, much like those shown in Figure 4 and could be detected using a BN. The result is that at steady state, biochemical pathways and BNs both produce very similar networks.

Given these differences, there may be some applications where BNs may be a more useful representation of a cell’s signaling state than a traditional biochemical signaling pathway. Because BNs describe apparent causal relationships between variables, they provide a natural route to suggest control points, or lever points (Holland, 1995) in the signaling pathway. When given a biochemical pathway, it is difficult to know which proteins are the most logical points for intervention because nonlinear effects such as feedback loops and signal amplification can produce a wide variety of behaviors. In contrast, a BN automatically filters the data to find causal relationships between variables, no matter how nonlinear the connection. In addition, BNs can also be used when only steady-state data are available—a common situation where kinetic biochemical networks are less useful.

### Impact of restricting parent and child connections

In choosing to restrict the number of parent and child connections a node can have, we are attempting to balance the accuracy of the BN model with its interpretability. It has been shown that many biological networks are scale-free, meaning that most nodes have few connections while a few nodes have many connections (Jeong and Tombor, 2000). For the BN models presented in this work, this scale-free property means that restricting the number of connections a node can have should yield accurate predictions for most nodes, but will be inaccurate for rare, highly connected nodes. One logical way to increase the accuracy of a BN would be to increase the number of connections allowed for a node, or better yet to remove the restriction altogether.

However, a key advantage of representing a biological dataset as a BN is that the resulting network is human interpretable. More empirically based methods such as spline fits, neural networks and principal component analysis are able to numerically fit a given dataset at least as well as a BN. However, these other methods do not easily discriminate between weaker, possibly spurious relationships in the data and stronger, clearer relationships. Also, in fitting the data these algorithms provide no insight into the physical or biochemical relationships that drive the system, but instead provide only a black box description.

By restricting the number of parent and child connections in a BN, we overcome both of these problems of interpretability while maintaining reasonable accuracy. First, when the network search algorithm restricts the number of possible connections for each node, it then selects only the networks that illustrate strong relationships in the data. These strong relationships are selected because networks that contain them have higher probability scores. Second, restricting the number of allowed connections per node results in a simplified, and therefore more interpretable network. By way of a counterexample, imagine if a network contained a node with five parents. How can we interpret the contribution of these parents? Admittedly, it is possible that all five parents do play a mechanistic role. However, the mechanistic relationship may be extremely difficult to intuitively understand if the underlying relationship is as complex and nonlinear as we already see for two and three parent cases in Figure 4. Another alternative is that only a few of the five parents may have any real mechanistic importance, while the remaining nodes were included because of weak and or spurious relationships in the data.

Restricting the number of connections to and from a node is a simple method to select relatively accurate, yet interpretable networks. Finding algorithms that balance between accuracy and interpretability of a BN is an active area of research. Other methods, such as network structure averaging (Neapolitan, 2003) and bootstrap confidence estimation (Friedman *et al*., 1999) approaches have also been used to varying degrees of success.

### Applications of Bayesian network approach to cell signaling

BNs as we have implemented them have a number of key advantages as a way of storing, representing, and analyzing large and complex biological networks. Data for these networks are stored as an individual measurement coupled with the error of that measurement. Because errors are included in these measurements, we automatically place more weight on accurate measurements, thereby properly combining data from different sources. In addition to the ability to store errors with data, BNs also allow us to include entries that were not measured (Heckerman, 1999).

By representing our data as a BN, we gain flexibility. For example, a goal in this work was to represent only the most important connections between phosphorylation targets, therefore we limited the number of children and parents that can be connected to any given node. If we were to relax this assumption, we could obtain a more complete representation of the data, but at the cost of requiring a more complex model. Because BNs impartially find connections between proteins, the method allows the data to speak for itself. The model makes no assumptions of linearity or any particular functional form, therefore we are not constrained to describe only relationships that can be easily cast as algebraic or differential equations.

The main disadvantage to BNs is that they are computationally expensive to run. Searching for high-scoring networks is an NP-hard problem (Heckerman, 1999) meaning that for large problems there is no way to be sure when the calculation is done. However, with approximations like those used in this work and access to large parallel computing resources, the computing limitation can, in part, be overcome to make BNs practical. We expect that future advances in both algorithms and computational speed will rapidly overcome this limitation, and thereby make BNs more widely accessible.

From a practical standpoint, BN generated hypotheses provide a logical starting point for drug target selection. For example, the BN in this work could be used to predict the effect of various kinase inhibitors on high-level physiological processes such as proliferation and differentiation. Because the network automatically accounts for noise, the predictions would also be produced as a distribution of possible responses based on the uncertainty of the original data. If an inhibitor were selected and tested, then the results of this test could be included into the original dataset used to generate the BN and the results re-derived to see how the experiment alters our understanding. By cycling from model to experiment and back again, we can then generate a closed-loop system for predicting drug effects and incorporating in new findings as they become available.

## CONCLUSION

This work shows that BNs can be productively employed for the analysis of cell signaling problems, in application to multivariate proteomic datasets. Many of the current directions in biology are focused on gathering more measurements of an increasingly complex set of conditions. We have shown that BNs are able to organize these kinds of data at two levels of abstraction. At the first level, the BN itself is a directed graph that reflects many known, physiological connections in the original source data. In addition, by visualizing data as a directed graph, large and noisy biological datasets can be summarized in a human interpretable form. On a second level of abstraction, the connections between nodes can, in some cases, be plotted. These plots yield reveal linear, nonlinear or multistate, and in doing so allow us to better understand the underlying biochemical mechanisms. Finally, given a network, we have shown that we are able to make credible predictions of new conditions. By using such an inference approach, we can perform experiments *in silico* first to identify combinations of input parameters likely to yield interesting and useful results.

**Table 1**

Time (days) | 0 | 2 | 3 | 5 |
---|---|---|---|---|

# Conditions | 1 | 16 | 16 | 16 |

LIF | + | ± | ± | ± |

FGF | — | ± | ± | ± |

Fibronectin | — | ± | ± | ± |

Laminin | — | ± | ± | ± |

Time (days) | 0 | 2 | 3 | 5 |
---|---|---|---|---|

# Conditions | 1 | 16 | 16 | 16 |

LIF | + | ± | ± | ± |

FGF | — | ± | ± | ± |

Fibronectin | — | ± | ± | ± |

Laminin | — | ± | ± | ± |

Cells were cultured under a combination of different cytokine and matrix stimuli, and data taken at one of four time points. In total, 49 measurements were made. For each measurement, the phosphorylation level of 28 different proteins was measured and 8 different metrics of cell proliferation and differentiation. A more detailed description of each measurement can be found in Supplementary Table 1.

**Fig. 1**

**Fig. 1**

**Fig. 2**

**Fig. 2**

**Fig. 3**

**Fig. 3**

**Fig. 4**

**Fig. 4**

**Fig. 5**

**Fig. 5**

**Fig. 6**

**Fig. 6**

**Fig. 7**

**Fig. 7**

**Fig. 8**

**Fig. 8**

We would like to thank Wing Yung and Christopher Vincent from IBM for their help in parallelizing the BN search code. We would also like to thank Peter Zandstra and Sampsa Hautaniemi for helpful discussions. This work has been supported by the NIGMS Cell Migration Consortium Computational Modeling Initiative and the NSF Biotechnology Process Engineering Center ERC program. Computing resources were provided by the MIT Computational and Systems Biology (CSBi) high-performance computing center.

## REFERENCES

*in silico*models: the constraints-based approach.

## Comments