Partially non-homogeneous dynamic Bayesian networks based on Bayesian regression models with partitioned design matrices

Abstract Motivation Non-homogeneous dynamic Bayesian networks (NH-DBNs) are a popular modelling tool for learning cellular networks from time series data. In systems biology, time series are often measured under different experimental conditions, and not rarely only some network interaction parameters depend on the condition while the other parameters stay constant across conditions. For this situation, we propose a new partially NH-DBN, based on Bayesian hierarchical regression models with partitioned design matrices. With regard to our main application to semi-quantitative (immunoblot) timecourse data from mammalian target of rapamycin complex 1 (mTORC1) signalling, we also propose a Gaussian process-based method to solve the problem of non-equidistant time series measurements. Results On synthetic network data and on yeast gene expression data the new model leads to improved network reconstruction accuracies. We then use the new model to reconstruct the topologies of the circadian clock network in Arabidopsis thaliana and the mTORC1 signalling pathway. The inferred network topologies show features that are consistent with the biological literature. Availability and implementation All datasets have been made available with earlier publications. Our Matlab code is available upon request. Supplementary information Supplementary data are available at Bioinformatics online.


Part A -Deriving the full conditional distributions
A sample from the posterior distribution in Equation (1) can be generated by Markov Chain Monte Carlo (MCMC) simulations. In this subsection we derive the full conditional distributions (FCDs) for the model parameters: β B , λ 2 , and λ 2 . For σ 2 and µ we implement collapsed Gibbs sampling moves. In collapsed Gibbs sampling steps some of the other variables are integrated out analytically from the FCDs. Collapsed Gibbs sampling steps are known to be more efficient than standard Gibbs sampling steps. Within a Gibbs MCMC sampling scheme all parameters are iteratively resampled from their FCDs or by a collapsed Gibbs sampling step.
The densities of the FCDs are proportional to the factorized joint density in Equation (1). From the shape of the densities we conclude what the full conditional distributions (FCDs) are.
For the full conditional distribution of β B we obtain: and from the shape of the latter density we conclude: (3) For the full conditional distributions of λ 2 and λ 2 we get: and from the shapes of the densities it follows for the FCDs: For the noise variance parameter σ 2 we implement a collapsed Gibbs sampling step with β B integrated out. We have: From a standard rule for Gaussian integrals (see, e.g., Section 2.3.2 in Bishop (2006)): This yields: The shape of the density implies the collapsed Gibbs sampling step: For the FCD of µ we also use a collapsed Gibbs sampling step with β B integrated out (cf. Equation (5)), and we use that The latter density is proportional to the density of a Gaussian, so that it follows for the FCD: where An important model property is that the marginal likelihood, with β B and σ 2 integrated out, can be computed. The marginalization rule from Section 2.3.7 of Bishop (2006) yields: where T := K k=1 (T k − 1), and C := I + X BΣ X T B .

Part B -Blocked Metropolis Hastings moves for inferring the covariate set and the covariate types
We note that the indicator vector δ depends on the covariate set Π, as it contains one indicator variable for each covariate in Π. Moreover, the expression X B , µ,Σ,μ and C all depend on both Π and δ, though we do not make that explicit in our notation. As described in the main paper, given the covariate set Π and the corresponding indicator vector δ, the partitioned design matrix X B = X B,Π,δ can be built, and the marginal likelihood p(y|λ 2 , λ 2 , µ, Π, δ) can be computed with Equation (9). Based on the marginal likelihood (with σ 2 and β B integrated out), we obtain as posterior distribution for the extended model: where p(µ|Π, δ) is a Gaussian, whose dimension is the number of component-specific coefficients, p(Π), is a uniform distribution, truncated to the maximal cardinality |Π| ≤ 3, and p(δ|Π) has been specified in Section 3.4 of the main paper.
Now it can be seen why the marginal likelihood from Equation (9) is of great importance for model inference. Because of the analytic expression for the marginal likelihood, we can avoid the notorious Reversible Jump Markov Chain Monte Carlo (RJMCMC) issues (Green, 1995). The continuous regression coefficient vector β B , whose elements correspond to the covariates in Π, has been integrated out, and therefore the vector β B does not appear in the posterior distribution in Equation (10). Consequently, the acceptance probabilities of the Metropolis-Hastings criterion can be formulated in terms of the marginal likelihood and thus do not depend on specific β B vectors. When the Metropolis-Hastings acceptance probabilities are based on marginal likelihoods, they always take all possible regression coefficient vector instantiations into account, and there is no need for specific regression coefficient instantiations for newly introduced covariates.
To formulate that differently: In our new partially NH-DBN there is a one-to-one correspondence between the covariates in Π and the regression coefficients in β B . For a given covariate set Π, the marginal likelihood value can be thought of as a weighted average across all possible β B vectors for Π. A change of the discrete covariate set Π implies a change of the continuous regression coefficient vector β B , but w.r.t. the marginal likelihood there is no need for any specific regression coefficients. The marginal likelihood always takes all possible β B vectors for Π into account. The resulting MCMC algorithm, presented below, is therefore effectively a Reversible Jump Markov Chain Mone Carlo (RJMCMC) scheme (Green, 1995) in the discrete space of covariate sets, where the Jacobian in the Metropolis-Hastings acceptance probabilities is always equal to 1 and can be omitted (Green, 1995).
We use the marginal likelihood from Equation (9) to implement two Metropolis Hastings moves, in which we make use of the concept of blocking (Liang et al. (2010)). Blocking is a technique by which variables are not sampled separately, but are merged into blocks that are sampled together. We form two blocks, grouping δ with µ, and grouping Π with δ and µ. The vector δ is then always sampled jointly with µ, and the covariate set Π is always sampled jointly with δ and µ.
In Bayesian statistics, the concept of blocking is a widely applied technique to improve convergence of MCMC inference algorithms, see, e.g., Liang et al. (2010). Convergence problems often result from correlations between variables. To resolve this problem, the key idea is to merge correlated variables into blocks and to sample the variables within each block together, rather than to sample them separately. 1

First Metropolis Hastings move:
Each move on [δ, µ] randomly selects one element δ i of δ and proposes to switch its value, i.e to replace δ i by 1 − δ i . This yields a new candidate vector δ • , for which we re-sample µ • from its full conditional distribution in Equation (6). The acceptance probability for the move is: where the Hastings ratio H is equal to the ratio of full conditional densities: • In the deletion move (D) we randomly select one covariate X ∈ Π, and we propose to remove this covariate from Π. This yields Π • . Removing X makes the corresponding element δ from δ redundant, so that we remove it as well to obtain δ • .
• In the addition move (A) we randomly select one covariate X / ∈ Π, and we propose to add this covariate to Π. This yields Π • . We flip a coin to to determine the type (δ ∈ {0, 1}) of the new covariate. Adding the element δ to δ yields δ • .
• In the exchange move (E) we randomly select one covariate X • ∈ Π, and we propose to replace X • by a randomly selected new covariate X / ∈ Π. This yields Π • . We then flip a coin to determine the type (δ ∈ {0, 1}) of the new covariate. By removing the element δ • from δ and adding the element δ to δ, we obtain δ • .
Each sub-move (D, A and E) yields a pair [Π • , δ • ], which we complete to a triple by sampling a new µ • , conditional on Π • and δ • , from the full conditional distribution in Equation (6). When randomly selecting the move type (D, A or E) the acceptance probability is: 1 To see why 'blocking' is useful, imagine that there are two correlated variables A and B with the current values a and b. We further assume that A can only be sampled via a Metropolis-Hastings sampling step, while B can be sampled from its full conditional distribution (i.e. via a Gibbs sampling step). Let's say the model demands a and b to be similar. When the values of A are sampled separately, a newly proposed value for A, say a , will have a low acceptance probability if it is no longer similar to the current value b of B. That is, new values for A would be enforced to stay similar to the current value of B, and vice-versa. When blocking both variables, not only the new value for a is proposed, but within the same move also a new value b for B is proposed, where the latter is sampled conditional on the new candidate a . The 'blocked' move thus proposes to exchange (a, b) by (a , b ). When the candidate b is sampled from its full conditional distribution, conditional on A = a , then b is encouraged to be similar to a , so that the new state (a , b ) is likely to contain similar values again. Therefore, the blocked move from (a, b) to (a , b ) has usually a higher acceptance probability than the move from (a, b) to (a , b).
where the Hastings-Ratio H is equal to: · HR and the factor HR is move-specific (D, A and E): where N is the number of potential covariates, |.| denotes the cardinality, and the factors 2 and 0.5 stem from flipping a coin to determine the type (δ ∈ {0, 1}) of a new covariate X.

Part C -The Markov Chain Monte Carlo (MCMC) algorithm
To generate samples from the posterior distribution in Equation (10), we use a Markov Chain Monte Carlo (MCMC) algorithm, which combines the Gibbs-sampling steps from part A with the Metropolis Hastings steps from part B. We initialize all entities, e.g. Π = {}, δ = (δ 0 ) = (0), µ = 0, λ 2 = 1, λ 2 = 1, before we iterate among seven sampling steps: Gibbs part: Given Π and δ, we re-sample the parameters σ 2 , β, λ 2 , λ 2 , and µ. Although the parameters σ 2 and β B can be marginalized out, and thus do not appear in the posterior in Equation (10), the FCDs of λ 2 and λ 2 , and µ depend on them. Therefore σ 2 and β B have to be sampled too, but they can be withdrawn after sampling step (5). The Gibbs sampling steps have been derived in part A of this supplementary paper. Each step updates one parameter, and the subsequent steps are then always conditional on the newest parameter combination: (1 − δ i ) is the number of covariates with component-specific coefficients.
Metropolis-Hastings part: Withdraw σ 2 and β B , and keep λ 2 , λ 2 and µ. Perform the two blocked Metropolis-Hastings moves from part B: The MCMC algorithm generates a posterior sample: As described in the main paper, we run the MCMC algorithm for W = 100, 000 (W = 100k) iterations. We set the burn-in phase to 50k and we sample every 100th graph during the sampling phase. This yields R = 500 posterior samples for each response Y .

Part D -Gaussian Process smoothing
In this part of the supplementary paper we provide additional results for the proposed Gaussian process smoothing method; see Section 2.5 of the main paper. The proposed GP smoothing method can be implemented in 4 ways: 1. The GP parameters can either be sampled via Markov Chain Monte Carlo simulations ('MCMC') or their maximum a posteriori estimates ('MAP') can be computed.
2. Given the GP parameters, the required response vectors y k, can either be sampled from Equation (11) of the main paper ('PRED-SAMPLING') or y k, can be set to its predictive expectation,ŷ k, , see Equation (12) of the main paper ('PRED-EXPECTATION').
For our study, we implemented all four combinations: • MAP and PRED-EXPECTATION (used in the main paper) • MAP and PRED-SAMPLING • MCMC and PRED-EXPECTATION

• MCMC and PRED-SAMPLING
We compute the MAP estimates of the GP parameters (σ 2 , ξ 2 and l) using the Matlab package 'GPstuff' (Vanhatalo et al., 2013). For MCMC sampling of the GP parameters (σ 2 , ξ 2 and l) we implemented three Metropolis-Hastings moves. In the i-th move (i = 1, 2, 3) a small random number, uniformly distributed on [− , ], is added to the current instantiation of the i-th GP parameter. (We used = 0.01.) This yields a new candidate value for parameter i. If positive, the new candidate parameter is accepted with the Metropolis-Hastings acceptance probability. If the new candidate parameter is negative or not accepted, the i-th parameter is left unchanged.
As initial values for our MCMC sampling scheme we use the MAP estimates from 'GPstuff'.
Additional results for synthetic data: In Section 4.1 of the main paper, the GP smoothing method is applied to synthetic autoregressive (AR) and moving average (MA) data. As supplement to Figure 2 of the main paper, we here show and compare the performances of the four GP approaches. Figure 1 shows the results for the AR data, and Figure 2 shows the results for the MA data. In both figures the top left panel refers to the results of: 'MAP and PRED-EXPECTATION'. Those results are also shown in Figure 2 of the main paper. Figures 1-2 show that the four GP variants perform similar and yield comparable covariate scores. The two true covariates X 1 and X 2 yield higher scores and can clearly be distinguished from the irrelevant variables X 3 , . . . , X 8 . Tables 1-2 provide the method-specific average scores of the true covariates and the average scores of the irrelevant variables. Although neither significant nor substantial, the method 'MAP and PRED-EXPECTATION', used in the main paper, appears to perform slightly better than the other three GP methods: The difference between the average scores of the true and the false variables is slightly higher.   Additional results for mTORC1 data: In Section 4.5 of the main paper, the new partially non-homogeneous DBN model was combined with the GP-method: 'MAP and PRED-EXPECATION' to reconstruct the mTORC1 network topology. We also inferred the mTORC1 network using our new model in combination with the three alternative GP-methods, and computed the GP-method-specific edge scores. Figure 3 shows scatter plots, in which the scores of the reference method: 'MAP and PRED-EXPECATION' (vertical axes) are plotted against those of the three alternative GP methods (horizontal axes). Table 3 lists all edges that have been inferred and indicates, with which of the GP-methods those edges were extracted. From the scatter plots it can be seen that the method-specific edges scores are correlated (Pearson correlation coefficients: ρ = 0.86, ρ = 0.84, and ρ = 0.73), and Table 3 shows that the four networks share seven edges. Moreover, the networks for 'MAP and EXPECTATION' and 'MCMC and EXPECTATION' are similar (three mismatches), and the networks for 'MAP and SAMPLING' and 'MCMC and SAMPLING' are similar (also three mismatches). We conclude that it does not make a large difference, here, whether the GP parameters are MCMC sampled or MAP estimated. It can also be seen from Table 3 that the networks of the latter group ('... and SAMPLING') have less edges (they lose 5 edges and gain 1-2 edges). Further diagnostics (not shown) suggest that the sparsity of the networks can be explained as follows: The mTORC1 data are rather sparse (T k = 10 data points for each of K = 2 conditions only), so that there is a high amount of uncertainty in the predictive curves. The predictive distributions (see Equation (11) of the main paper) are therefore rather diffuse with high variances of the individual elements. Sampling from diffuse predictive distributions, yields response curves that can vary substantially from MCMC iteration to MCMC iteration. For the '...and SAMPLING' approaches the edge scores are now averages across all possible response vectors (with weights being their predictive probabilities), leading to lower edges scores and, hence, to sparser network topologies. Part E -The topology of the RAF signalling pathway Figure 4 show the topology of the RAF signalling pathway, as reported in Sachs et al. (2005). The RAF pathway consists of N = 11 nodes (phosphorylated proteins) and possesses 20 directed edges. We used this network topology for generating synthetic network data. See Section 4.2 of the main paper for further details.

Part F -Scalability and convergence analysis (additional results)
In this part of the supplementary material we provide additional figures for Section 4.3 of the main paper. Figures 5-7 show superimposed edge score scatter plots for the network sizes N = 10, N = 25 and N = 50. These figures correspond to the left panel of Figure 4 (main paper). The top right panel of Figure 4 (main paper) shows average AUC trace plots. Figure 8 of this supplement shows the individual AUC trace plots, from which the average curves in the top right panel of Figure 4 (main paper) were computed. We refer to these figures in Section 4.3 of the main paper.