-
PDF
- Split View
-
Views
-
Cite
Cite
Jongrae Kim, Declan G. Bates, Ian Postlethwaite, Pat Heslop-Harrison, Kwang-Hyun Cho, Linear time-varying models can reveal non-linear interactions of biomolecular regulatory networks using multiple time-series data, Bioinformatics, Volume 24, Issue 10, May 2008, Pages 1286–1292, https://doi.org/10.1093/bioinformatics/btn107
- Share Icon Share
Abstract
Motivation: Inherent non-linearities in biomolecular interactions make the identification of network interactions difficult. One of the principal problems is that all methods based on the use of linear time-invariant models will have fundamental limitations in their capability to infer certain non-linear network interactions. Another difficulty is the multiplicity of possible solutions, since, for a given dataset, there may be many different possible networks which generate the same time-series expression profiles.
Results: A novel algorithm for the inference of biomolecular interaction networks from temporal expression data is presented. Linear time-varying models, which can represent a much wider class of time-series data than linear time-invariant models, are employed in the algorithm. From time-series expression profiles, the model parameters are identified by solving a non-linear optimization problem. In order to systematically reduce the set of possible solutions for the optimization problem, a filtering process is performed using a phase-portrait analysis with random numerical perturbations. The proposed approach has the advantages of not requiring the system to be in a stable steady state, of using time-series profiles which have been generated by a single experiment, and of allowing non-linear network interactions to be identified. The ability of the proposed algorithm to correctly infer network interactions is illustrated by its application to three examples: a non-linear model for cAMP oscillations in Dictyostelium discoideum, the cell-cycle data for Saccharomyces cerevisiae and a large-scale non-linear model of a group of synchronized Dictyostelium cells.
Availability: The software used in this article is available from http://sbie.kaist.ac.kr/software
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Biomolecular regulatory networks can contain complicated non-linear interactions amongst many different molecules, where the result is not simply the sum of multiples of the inputs. Functional regulation in such systems involves multiple interactions between inputs and intermediates, forming a non-linear network. Depending on the number of different interactions involved, such networks can be modelled using stochastic equations, ordinary differential equations, or undirected/directed graphs with nodes and edges (Bornholdt, 2005). The problem of inferring biomolecular networks from time-series expression profiles so as to obtain a quantitative understanding of the underlying biomolecular mechanisms is a fundamental one for Systems Biology, and has been geared up by the ever increasing amount of high-throughput experimental data from real biomolecular networks (Cho et al., 2007).
Many different methods for inferring biomolecular networks have been proposed in the recent literature. They can roughly be divided into two main categories. The first uses perturbation experiments about a steady state to generate a Jacobian matrix which is then used to infer the network interactions (Bansal et al., 2006; Cho et al., 2005; Kholodenko et al., 2002; Schmidt et al., 2005; Sontag et al., 2004; Tegner et al., 2003). The perturbation-based approach requires several assumptions to be made about the system, such as the existence of a steady state and the possibility of conducting experiments which achieve just the right level of perturbation in the system. This means that such approaches are not suitable for application to biomolecular regulatory networks which do not have a stable steady state, such as those which display oscillatory dynamics, e.g. limit cycles. It is also well known that the size of the perturbation is critical for the success of these methods. If the perturbation is too large, the resulting kinetics are no longer near the original steady state and the inferred network may display completely different characteristics. On the other hand, if it is too small, then the resulting data may not contain sufficient information for the algorithm to infer the network interactions correctly. In addition, each additional perturbation experiment entails additional time and effort, and in some cases, multiple experiments may not even be possible.
The second category of inference methods employ probabilistic approaches, such as Bayesian methods, which uncover the network interactions using causal relations between the states. However, many such approaches can only infer network structures correctly when there are no feedback loops in the networks, i.e. the resulting networks are directed acyclic graphs (Chen et al., 2006; Friedman, 2004). Although dynamical Bayesian methods can overcome these problems, the resulting problem formulation needs to be solved using global optimization algorithms such as greedy search, simulated annealing and genetic algorithms (Yu et al., 2004), which are computationally expensive. This problem was partially addressed in Zhao et al. (2006) where dynamical Bayesian networks and probabilistic Boolean networks are combined to reduce the parametric search space for inferring the network. However, the amount of calculation required still quickly expands as the number of nodes increases, thus dividing large networks into several relatively small networks.
Since most biomolecular networks are inherently non-linear and the average interactions of individual molecules generally behave in a continuous fashion, it would seem appropriate to formulate such problems as a non-linear system identification problem using ordinary differential equations. However, unlike the identification problem for linear time-invariant systems, non-linear system identification problems are ill-posed in general, because not only the coefficients in the system but also the system structure must be estimated. System identification using linear time-invariant models, on the other hand, simply involves finding an optimal set of coefficients for a given model, since the model structure is fixed in advance. In the literature, there are many well-developed theoretical results for the linear system identification problem (Ljung, 1999). Any model of the system derived using purely linear approaches will, however, not be able to accurately represent the true non-linear behaviour of the real biomolecular network. For example, many biomolecular regulatory networks produce robust and stable limit cycle dynamics, circadian rhythms (Goldbeter, 1995), cAMP oscillation in Dictyostelium discoideum (Laub and Loomis, 1998), Ca2 + oscillations (Dellen et al., 2005). Such oscillatory dynamics cannot be produced by purely linear time-invariant systems, and therefore the underlying network can never be accurately identified using linear time-invariant models.
In previous work, Cho et al. (2006) proposed a simple but highly effective algorithm to infer biomolecular networks from just one set of time-series data using phase portrait analysis, where the phase portrait is a state trajectory on a plane spanned by the state values. Because of its simplicity it can be applied to very large-scale networks, but non-linear actions still cause difficulty for this approach as it is based purely on the use of linear models. In addition, any self interactions in the network cannot be identified using this approach. In order to address the above issues, in this article we adopt linear time-varying systems as our model for inferring biomolecular networks. Although linear time-varying systems are still in a linear form, they have a much richer range of dynamic responses than linear time-invariant systems. Hence, a wider range of time-series expression profiles, including oscillatory trajectories, can be approximated by such models.
Using linear time-varying models, the network inference problem is formulated as an optimization problem. The cost, which is the sum of the error between the measurement data and the states generated by the model, is minimized using an optimization algorithm and the optimal set of coefficients for the model is obtained. To the authors' knowledge, this is the first time that the use of linear time-varying systems for inferring biomolecular networks has been proposed. Moreover, to generate the cost function for the optimization algorithm, we use only time-series expression profiles that may be obtained by a single experiment. To reduce the possible number of solutions for the network, we propose an algorithm based on random perturbations, where the perturbations are numerical ones to the measurement data. Previous efforts to address this problem have attempted to exploit the sparsity property of biomolecular networks to reduce the number of alternative solutions. While many biomolecular networks including gene networks, protein–protein interactions, and metabolic networks are indeed sparsely connected (Gardner et al., 2005; Yeung et al., 2002), the levels of sparsity will vary from network to network, and moreover, measurement data from biological experiments will include noise whose statistical properties are not generally known. In addition, it is impossible to infer all interactions correctly using a single set of data. Therefore, robust inferring algorithms must give consistent answers for lightly perturbed measurements. In our approach, phase portraits are used to quantify the relative effect of each element in the network, and based on the results from several random perturbation the inconsistent network interactions are identified and declared as unidentifiable nodes. Inferring is then performed for the remaining consistent network connections.
Three biomolecular network systems are used as examples in the article for testing the proposed inferring algorithm. The first is an in silico example, the Laub–Loomis model for D.discoideum cAMP oscillation (Laub and Loomis, 1998). This model captures many of the most important features of several classes of biomolecular networks that oscillate with constant frequencies. The second is an in vivo example, the cell cycle of the budding yeast cell, Saccharomyces cerevisiae, for which we use the actual experimental data for inferring the cell-cycle network (Spellman et al., 1998). The third is a large size network example, a group of D.discoideum cells. Note that the biomolecular network underlying cAMP oscillations in D.discoideum cannot be inferred from any of the current perturbation-based approaches found in the literature, since, during this particular phase of its development it does not have steady states but displays limit cycle oscillations. Hence, the Jacobian, which gives the local network interactions, is not constant but periodically time varying (Kim et al., 2006; Ma and Igiesias, 2002), and there is currently no method that can identify a time-varying Jacobian from time-series data. For the second example, the cell-cycle data is generated from time-series expression profiles of gene levels for which there are no experimental perturbation data provided. Hence, perturbation-based approaches are also not applicable for this system.
2 METHODS
2.1 Approximating non-linear interactions using linear time-varying models






2.2 Optimization problem and reducing the problem dimension and measurement noise effects


Note that the cost function is normalized by the maximum value of the measurements of each state. The optimization problem is formulated separately for each dxi(t)/dt in order to reduce the rate of increase in the number of parameters to be estimated as the dimension of xi(t) increases. If the problem is formulated for all xi(t), the number of parameters increases according to 3n² + n + 1. The price to be paid for reducing the number of parameters to be computed in this way is the increased effect of noise, since in order to solve (4) all the states except xi(t) have to be interpolated from the measurements and this will necessarily introduce the direct effect of noise on the estimate. To reduce this effect some elements of the noise could be filtered out before the data are used in the interpolation, e.g. by using Principal Component Analysis (Sanguinetti, et al., 2005), or the noisy measurements could be replaced by the solution of the differential equation, (4). After the optimal solution is obtained for (6), if the optimal cost is smaller than a certain bound, for example 10% of the maximum of the measurements, the measurements are replaced by the solution of the differential equations, under the assumption that the model gives less noisy data without deteriorating the original measurements significantly. As the optimization problem is solved from x1 to xn, more measurements may be replaced. At the final stage, all measurements except the measurements for xn could be replaced by the filtered states. To remove the unbalanced noise effect, the same procedure is repeated in the opposite direction, i.e. starting from xn − 1 to x1 since the earlier states may be affected more by the noise. The cost bound should be a bigger number when the measurement noise is expected to be large. However, if the measurement is too noisy, then the inferring results cannot be good. On the other hand, if the measurement noise is mild, the high frequency noise can be removed by replacing the measurement by the model output. The 10% bound is the reasonable level for removing the mild noise effect and it is fixed to 0.1.
Note that (6) is a very flexible expression. It can cope with cases where the sampling time is not evenly distributed, and weighting can also be used in the cost when the error bar at each sampling time is given. The optimization problem is, however, non-linear, and hence it may have many local solutions. If biologically plausible ranges for the parameters are known, these can be used in choosing a better initial guess for the optimization, in order to improve the chances of finding the globally optimal solution. Initial values for the parameters may also be chosen by inspecting the finite difference magnitude of , since, although the data are corrupted by noise, the rate of change of xi(t) should still not be very different from the magnitude of the finite difference. The initial guess for ω can be obtained by calculating the dominant frequency of the measurement data using Fourier transforms (Proakis and Manolakis, 2006).
2.3 Measuring the strength of interactions using phase portraits
The phase portrait of a system is a trajectory of states drawn on a plane. Phase portraits have long been used as a useful graphical tool in dynamical systems theory. Usually, a 2D phase portrait is drawn from two states and several dynamic properties such as the location and the stability of equilibriums, the region of attraction of equilibrium points, etc, are inspected (Khalil, 2002). The idea of inferring biomolecular interactions using phase portraits was first presented in Cho et al. (2006). In the article, the slope index of two states, xi(t) and xj(t), was defined as the average slope of the curve drawn by xi(t) and xj(t) on the phase portrait. By inspecting the sign of the slope for two given time points, i.e. [xj(tk + 1) − xj(tk)] / [xi(tk + 1) − xi(tk)], it can be decided whether the slope is tilted to the right or left. The average value is then calculated from multiple points on the curve. Depending on the sign of the slope index, it can hence be deduced whether xi(t) activates or inhibits xj(t).



Let dx1(t)/dt be the interaction to be identified and assume there are three molecules involved in the dynamic reaction, namely, x1(t), x2(t) and x3(t). The linear time-varying model for the first state is given by , where a1j(t) = α 1j sin (ω t + φ 1j) + β 1j for j = 1, 2, 3 and α1j, β1j, φ1j and ω are obtained by solving the optimization problem, (6). The slopes for dx1(t)/dt by x1(t) and x2(t) are positive and negative, respectively. On the other hand, the slope for dx1(t)/dt by x3(t) is almost zero. How much each state affects dx1(t)/dt is indicated by ρ11,ρ12 and ρ13.

2.4 Inferring the biomolecular interactions using random perturbations





if IM(i , j) is empty, it is declared as unidentifiable,
if sign[IM(i,j)] > 0, xj(t) activates xi(t),
if sign[IM(i,j)] < 0, xj(t) inhibits xi(t) and
if IM(i, j) is zero, there is no direct interaction, from xi(t) to xj(t).
3 RESULTS
The algorithm is tested using two small size examples: a D.discoideum cAMP oscillations model and the S.cerevisiae cell-cycle data, and one large size example: 15 synchronized Dictyostelium cells. Dictyostelium and Saccharomyces were also used to test an inferring algorithm based on the use of epistasis analysis in Van Driessche et al. (2005) and St Onge et al. (2007). Further details about the method of epistasis analysis can be found in Huges (2005). In the following results, the optimization problem, (6), was solved using a simplex search method in the MATLAB (Mathworks, 2007).
3.1 In silico example: Dictyostelium discoideum cAMP oscillations
Dictyostelium discoideum cells signal each other by emitting stable oscillations of cAMP at the beginning of the aggregation phase of their development. The oscillations continue during chemotaxis towards higher gradients of cAMP concentration as D.discoideum aggregate to survive. A model for the biomolecular network underlying the stable oscillations in cAMP was postulated by Laub and Loomis (1998). The corresponding non-linear differential equations are given in the Supplementary Material.
The Laub–Loomis model exhibits a stable limit cycle, i.e. whenever the states are perturbed, they immediately converge back into the limit cycle. This type of trajectory cannot be modelled using a linear time-invariant system. In fact, the Jacobian along the limit cycle is periodically time varying, and hence the estimation of the Jacobian is also a very difficult problem. Here, without any perturbation experiment, a total of 16 measurements for each state are generated with a sampling time interval of 2 min, where the measurement noise vi(tk) is the zero-mean Gaussian distribution with the variance equal to the 5% of the mean value of {xi(t1),xi(t2),…,xi(tp)}. Note that the statistical properties of vi(tk) are unknown in general. Some measurement examples are shown in Figure S1 (Supplementary Material). Using this set of measurements, the non-linear system is assumed to be in the linear time-varying form given by (4) with the coefficients (5) and the optimization problem, (6), is then solved with the numerically perturbed measurement given by (10). The artificial perturbation wi(tk) is added, which is the zero-mean Gaussian and the variance equal to the 5% of the mean value of . After the optimization problem is solved, the amount of effect from each xj(t) to xi(t), i.e. ρij, and the slope index is identified. This procedure is repeated 10 times (N = 10) and the inferring procedure is performed with 10-IS matrices.
The variation in the success ratios with respect to the thresholds are shown in Figure 2 after performing 100 Monte Carlo simulations with a different set of measurements for each simulation. The first column shows the success ratio between the number of nodes that are inferred as activation and the nodes that are indeed activation. The second column shows the success ratio for the inhibition and the third column shows the success ratio for no connection. Each row gives the mean, the worst and the best results for the 100 Monte Carlo simulations. The worst is the 3σ bound, which is three times the SD of the success ratios for the Monte Carlo simulations, and the corresponding threshold values guarantee that the success ratio is higher than the worst bound with the probability 99.73%. The true positive and negative success ratios from the dynamic Bayesian approaches are 22% and 76%, where activation and inhibition are not distinguished but only the existence of connections are (Cho et al., 2006). These boundaries are indicated by the solid lines in the first row. For a wide range of the threshold values, the success ratios could be higher than those generated by the Bayesian approaches.

Success ratios of the inferring algorithm for true activation, true inhibition and true negative with respect to two threshold values, γs and γm, are shown in three different metrics, i.e. the mean, the worst and the best performance, where the worst is given by the 3σ bound. The results are obtained from 100 Monte Carlo simulations of the algorithms for the given measurement data. The true negative ratio is relatively less sensitive to the threshold values because of the sparse nature of the interaction network. The solid lines on the first row are the boundary of success ratio of the Bayesian algorithm for the best rate. To display the results more clearly, the boundary lines are shown in the first row rather than the third row.
3.2 In vivo example: Saccharomyces cerevisiae cell cycle
The cell cycle, in which a cell duplicates its contents and divides into two, is an essential activity of all life and involves some highly elaborate regulatory mechanisms. The cell cycle is divided into four phases, which are genome duplication (S-phase), nuclear division (mitosis or M-phase) and two gap phases (G1- and G2-phase) (Bähler, 2005). In this example, S.cerevisiae cell-cycle microarray data is used to infer the underlying biomolecular network interactions (Spellman, 1998). The measurements include a total of 17 gene-expression levels with 16 samples each, and the complete network interactions are shown in Figure S2 (Supplementary Material). This network is chosen because their interactions are well studied (Bähler, 2005). The samples are from α Factor-based synchronization experiment. As noted by Friedman (2004), this is a very difficult problem, due to the fact that we are attempting to infer the interaction networks solely from the gene-expression levels, without any information on protein activity levels. This is a problem because many activations and inhibitions are performed through post-transcriptional protein modifications. Such information is, however, usually unavailable from current measurement technology, as in the case of the cell-cycle data used here. This represents a fundamental limitation when biomolecular networks are to be inferred from gene-expression levels only (Friedman, 2004).
The inferring is performed for the given experimental data, where N = 10 and wi(tk) is the zero-mean Gaussian with the variance equal to 5% of the mean of . The inferring result with γs = 0.6 and γm = 0.18 is shown in Figure S2 (Supplementary Material) and a part of the inferring matrix is shown in Table 1. The true activation/inhibition and the true negative ratios for different threshold values are shown in Figure S3 (Supplementary Material) The figure is generated with 50 Monte Carlo simulations of the algorithm for the same measurement set. On average, the true activation ratio is above a 20% success ratio for a wide region of the threshold values. The true inhibition ratio has a smaller region where the ratio is above 20%. On the other hand, because of the sparse property of the network, the true negative ratio remains high for most of the threshold values.
A part of the IM is shown for γs = 0.6 and γm = 0.18, where 0 denotes no connection, 1 denotes activation, −1 denotes inhibition and Ø denotes no conclusion
Destination . | Sources . | ||||
---|---|---|---|---|---|
. | Ace2 . | Clb1 . | Clb2 . | Clb6 . | Cln1 . |
Ace2 | Ø | 1 | 1 | Ø | Ø |
Clb1 | 0 | 0 | 0 | 1 | 0 |
Clb2 | Ø | Ø | Ø | 1 | Ø |
Clb6 | Ø | 0 | 0 | 0 | 0 |
Cln1 | 0 | 0 | 0 | Ø | Ø |
Destination . | Sources . | ||||
---|---|---|---|---|---|
. | Ace2 . | Clb1 . | Clb2 . | Clb6 . | Cln1 . |
Ace2 | Ø | 1 | 1 | Ø | Ø |
Clb1 | 0 | 0 | 0 | 1 | 0 |
Clb2 | Ø | Ø | Ø | 1 | Ø |
Clb6 | Ø | 0 | 0 | 0 | 0 |
Cln1 | 0 | 0 | 0 | Ø | Ø |
The incorrect inferring results are given in grey and the correct ones are given in bold. The matrix is also used for designing some perturbation experiments. For example, since the first and the fifth columns have the most number of inconclusive elements, we may want to perturb either Ace2 or Cln1 to maximize the possible information to be obtained.
A part of the IM is shown for γs = 0.6 and γm = 0.18, where 0 denotes no connection, 1 denotes activation, −1 denotes inhibition and Ø denotes no conclusion
Destination . | Sources . | ||||
---|---|---|---|---|---|
. | Ace2 . | Clb1 . | Clb2 . | Clb6 . | Cln1 . |
Ace2 | Ø | 1 | 1 | Ø | Ø |
Clb1 | 0 | 0 | 0 | 1 | 0 |
Clb2 | Ø | Ø | Ø | 1 | Ø |
Clb6 | Ø | 0 | 0 | 0 | 0 |
Cln1 | 0 | 0 | 0 | Ø | Ø |
Destination . | Sources . | ||||
---|---|---|---|---|---|
. | Ace2 . | Clb1 . | Clb2 . | Clb6 . | Cln1 . |
Ace2 | Ø | 1 | 1 | Ø | Ø |
Clb1 | 0 | 0 | 0 | 1 | 0 |
Clb2 | Ø | Ø | Ø | 1 | Ø |
Clb6 | Ø | 0 | 0 | 0 | 0 |
Cln1 | 0 | 0 | 0 | Ø | Ø |
The incorrect inferring results are given in grey and the correct ones are given in bold. The matrix is also used for designing some perturbation experiments. For example, since the first and the fifth columns have the most number of inconclusive elements, we may want to perturb either Ace2 or Cln1 to maximize the possible information to be obtained.
3.3 In silico large size example: multiple Dictyostelium discoideum cells cAMP oscillations
To test the algorithm for large size problems, the extended model for Dictyostelium cAMP oscillations is used (Kim et al., 2007). The model includes the effect of interactions between multiple cells (See Supplementary Material). For 15 cells, the number of states is 105 (15 × 7) and the number of edges to be inferred is 11 025 (105 × 105). The algorithm calculation time for N = 10 is about 64 hours in Windows XP, Intel Core2 Quad CPU, 2.66 GHz and 2.75 GB of RAM and the simplex method is used for solving the optimization problem. The calculation time will be reduced significantly by using some parallel computing technique and/or randomization approach for the optimization. The inferring results is shown in Figure S4 (Supplementary Material). The true activation and the true inhibition rate are lower compared to the previous examples as we expected since the number of possible network connections are increased. However, the true negative ratio stays high since the network satisfies the sparse property, which can be applied for many biomolecular networks.
4 DISCUSSION
We presented a novel algorithm for inferring biomolecular networks, which uses linear time-varying models. The algorithm uses phase portraits to construct a systematic filtering method to reject the inconsistent elements in the network, based on numerical perturbations. The algorithm was tested on three examples: the Laub–Loomis model for D.discoideum cAMP oscillation, the cell cycle of budding yeast cell, S.cerevisiae and the group of Dictyostelium cells. The success ratios are sensitive to two threshold values in the algorithm, which are for finding nodes that are not connected or inconsistent with the perturbation. If the first threshold is too small, most of interactions will be considered to be connected. On the other hand, if it is too large, most of them will be considered to be disconnected. Hence, ideally, the threshold for finding no connection has to be between 0.1 and 0.5 given that in general biological networks are not fully connected. The value of the threshold for rejection depends on the richness of the given measurement data and the number of types of genes. It should be inversely proportional to the richness of the data and proportional to the number of types of genes. Since there is no clear way to quantify the richness of signals, the threshold can in practice be decided purely based on the number of types of genes, since the greater the number of types of genes, the higher is the number of possible network connections. Hence, the threshold must be larger as the number of types of genes increases. This can be clearly seen in three examples considered in this article. Dictyostelium has seven different type of molecules while Saccharomyces has 17 and the group of Dictyostelium has 105. As shown in the figures, the highest success ratios for Saccharomyces and the group of Dictyostelium are achieved with a higher value of the rejection bound than was seen for Dictyostelium.
The proposed algorithm has many practical advantages and potential extensions. First, the error bars or noise in the measurement can be weighted appropriately when the cost function is formulated. Second, the measurements do not need to be evenly spaced in time. Third, the algorithm can infer not only feedback paths but also self activations or inhibitions in networks. Fourth, the number of unknowns to be estimated by the optimization algorithm does not increase exponentially with the number of nodes in the network. Fifth, only time-series expression profiles are used and no perturbation experiments are required. Sixth, inferring results can be used to design some optimal perturbation experiments, Finally, the linear time-varying models employed by the algorithm can effectively capture many non-linear interactions, which cannot be approximated by linear time-invariant models, including limit cycle oscillations.
ACKNOWLEDGEMENTS
This study was initiated during a visit of the the first author to the K.-H.Cho's; Systems Biology Laboratory in Korea. The first author appreciates the hospitality and many helpful and interesting discussions with the lab members.
Funding: This work was carried out under UK EPSRC Research Grant GR/S61874/01 and BBSRC Research Grant BB/D015340/1. It was supported by a Korea Science and Engineering Foundation (KOSEF) grant funded by the Korean government (MOST) (M10503010001-07N030100112) and also supported by the Korea Ministry of Science and Technology through a Nuclear Research Grant (M20708000001-07B0800-00110) and the 21C Frontier Microbial Genomics and Application Center Program.
Conflict of Interest: none declared.
REFERENCES
Author notes
Associate Editor: Martin Bishop