-
PDF
- Split View
-
Views
-
Cite
Cite
Yiping Yuan, Xiaotong Shen, Wei Pan, Zizhuo Wang, Constrained likelihood for reconstructing a directed acyclic Gaussian graph, Biometrika, Volume 106, Issue 1, March 2019, Pages 109–125, https://doi.org/10.1093/biomet/asy057
Close -
Share
SUMMARY
Directed acyclic graphs are widely used to describe directional pairwise relations. Such relations are estimated by reconstructing a directed acyclic graph’s structure, which is challenging when the ordering of nodes of the graph is unknown. In such a situation, existing methods such as the neighbourhood and search-and-score methods have high estimation errors or computational complexities, especially when a local or sequential approach is used to enumerate edge directions by testing or optimizing a criterion locally, as a local method may break down even for moderately sized graphs. We propose a novel approach to simultaneously identifying all estimable directed edges and model parameters, using constrained maximum likelihood with nonconvex constraints. We develop a constraint reduction method that constructs a set of active constraints from super-exponentially many constraints. This, coupled with an alternating direction method of multipliers and a difference convex method, permits efficient computation for large-graph learning. We show that the proposed method consistently reconstructs identifiable directions of the true graph and achieves the optimal performance in terms of parameter estimation. Numerically, the method compares favourably with competitors. A protein network is analysed to demonstrate that the proposed method can make a difference in identifying the network’s structure.
1. Introduction
Directed acyclic graphical models are widely used to represent directional relations or parent-child relations among interacting units. The models encode graph-to-distribution correspondences, defined by a local Markov property (Edwards, 2012). Motivated by network analysis for uncovering gene-gene directional relations, as in Sachs et al. (2005), we reconstruct a directed acyclic graph without a prespecified total ordering of a set of units.
Most existing methods are designed for situations in which the graph is small relative to the sample size. The major approaches are of two types. The first uses local conditional independence tests sequentially (de Campos, 1998; Spirtes et al., 2000; Cheng et al., 2002) to check the existence of directed edges. Such methods usually have worst-case complexity that is exponential in the number of nodes. The popular PC-algorithm (Spirtes et al., 2000) has a complexity of order |$O(p^q)$| with |$p$| nodes and maximal neighbourhood size |$q$| (Kalisch & Bühlmann, 2007). The algorithm is efficient for a graph that has sparse neighbourhoods with, say, |$q \leqslant 3$|, but becomes inefficient when |$q>4$|. Yet the neighbourhood sizes for many natural graphs are rarely uniformly bounded by a small |$q$| (Faloutsos et al., 1999), while the maximal neighbourhood size |$q$| is expected to increase with |$p$|. The second approach, referred to as search-and-score, optimizes a goodness-of-fit measure for possible directions in a neighbourhood of interest (Chickering, 2003; Brown et al., 2005; de Campos, 2006). Computationally, such methods suffer from the curse of dimensionality. The major difficulty for both categories of methods comes from acyclicity, as pointed out in Cooper & Herskovits (1992), Chickering et al. (2004) and Malone et al. (2014). As a result, the development of directed acyclic graph models lags behind that of undirected ones (Liu et al., 2009) in high-dimensional situations. Shojaie & Michailidis (2010) proposed an |$L_1$|-penalization method for learning a directed acyclic graphical model with a known topological ordering of the nodes. Fu & Zhou (2013) used an |$L_1$|-penalization method for interventional data. Bühlmann et al. (2014) developed a high-dimensional additive structural equation model based on order search and penalized regression.
Exact learning of a directed acyclic graph can be achieved through dynamic programming (Koivisto & Sood, 2004), which guarantees a global minimizer but has exponential time complexity and is feasible only for small graphs, as shown in a 2005 Carnegie Mellon University technical report by Singh and Moore. An exact learning approach based on mixed integer linear programming (Jaakkola et al., 2010; De Campos & Ji, 2011; Cussens et al., 2013) may reduce the complexity to a polynomial order when the maximal number of parent nodes is restricted to one or two in the search, allowing dynamic programming that would otherwise be infeasible (Koivisto & Sood, 2004). Such a restriction limits the scope of application and may destroy the acyclicity requirement of a directed acyclic graph and hence the local Markov property (Edwards, 2012) of the graphical model. In this article, we employ an iterative difference convex algorithm to seek a good local minimizer with fast convergence. A global version of our difference convex algorithm, the outer approximation method (Breiman & Cutler, 1993), guarantees globality of a solution but converges slowly. As suggested by our numerical analysis, our difference convex algorithm yields satisfactory globality results.
In addition to the computational difficulties in the high-dimensional situation, there are theoretical challenges. First, the reconstruction error of a directed acyclic graph may grow super-exponentially in |$p$| when many local tests are performed. Roughly, this error is no less than |$\min [1,\,\exp\{p \log p - b(\varepsilon) n\} ]$| in view of Bahadur’s lower bound for the error of each test (Bahadur et al., 1980), where |$b(\varepsilon)$| is a constant describing the least favourable situation defined by the Kullback–Leibler information. In other words, any local and sequential method, and in particular the PC-algorithm and its variants, may fail, which occurs roughly when |$p \log p$| significantly exceeds |$n$|. Second, there is little theory to guide practice for reconstructing a directed acyclic graph. One relevant theory is consistent reconstruction of a directed acyclic graph’s structure by the PC-algorithm (Kalisch & Bühlmann, 2007), which relies on strong faithfulness (Spirtes et al., 2000; Zhang & Spirtes, 2002). Unfortunately, this assumption is rather restrictive, because it induces a small set of distributions (Uhler et al., 2013).
In this article, we develop a simultaneous reconstruction approach to jointly estimating the configuration of a directed acyclic graph and model parameters, which overcomes the difficulties of the local and sequential approaches. We propose a constrained likelihood with nonconvex constraints, to quantify the parameter space involving acyclicity. Our treatment of this crucial acyclicity requirement uses a property of doubly stochastic matrices to derive an equivalent form involving only |$p^3-p^2$| active constraints. This, combined with a constrained alternating direction method of multipliers (Gabay & Mercier, 1976) and difference convex programming, makes it possible to solve this bottleneck, thus leading to efficient computation involving a complexity of order |$O(p^3+np^2)$|. We develop theory to quantify what the proposed method can accomplish, focusing on a situation of equal error variance for identifiable directed acyclic graphical models (Peters & Bühlmann, 2014). We show that our method consistently reconstructs the true directed acyclic graph under a degree-of-reconstructability assumption, (18). With regard to estimating model parameters, it recovers the optimal performance of the oracle estimator. This degree-of-reconstructability assumption, similar to the beta-min condition (van de Geer & Bühlmann, 2013) or strong faithfulness, requires that the candidate and true models be well separated, i.e., that the minimal degree of separation should exceed a certain threshold; however, the exact connection between them remains unavailable. Raskutti & Uhler (2014) have shown that a weaker condition than strong faithfulness yields consistency for their score-based method.
2. Formulation
2.1. Gaussian directed acyclic graph and its parameter space
A directed acyclic graphical model encodes the joint probability distribution of a random vector |$(X_1,\dots, X_p)$|, whose nodes and directed edges represent, respectively, the variables |$X_1,\dots, X_p$| and the parent-child dependence relations between any two variables. The parents of |$X_j$|, denoted by |$\text{pa}_j$|, are the variables with an arrow towards |$X_j$| in the graph; |$\text{pa}_j$| is defined to be the empty set if |$X_{j}$| has no parents. The model factorizes the joint distribution of |$(X_1,\dots, X_p)$|, |${\rm pr}(X_1,\dots, X_p)$|, into a product of conditional distributions of each variable given its parents, that is, |$\prod_{j=1}^p {\rm pr}(X_j\mid\text{pa}_j)$|. This factorization property is equivalent to the local Markov property (Edwards, 2012) of a directed acyclic graph.
Directed acyclic graph models have been under-studied compared with undirected graphs (Liu et al., 2009). The major challenge stems from acyclicity and the corresponding nonconvex parameter space, defined by nonconvex constraints reinforcing acyclicity of a graph (van de Geer & Bühlmann, 2013). Characterization of such a parameter space remains an open problem. In contrast, the parameter space for Gaussian undirected graphical models (Friedman et al., 2008) is a positive-semidefinite cone |$\mathbb{S}^{p}_{+}= \{ \Omega \in \mathbb{R}^{p\times p} : \Omega= \Omega^{{\rm T}}, \,\Omega \succeq 0 \}$|, which is convex.
Next we present our main result on constraint reduction to reduce the |$O(p^{\,p})$| constraints in (3) to |$p^3-p^2$| constraints in (4). This result is based on duality and properties of permutation matrices, and can be used with any optimizing function.
In (4), there are |$p^3-p^2$| constraints on |$( A, \lambda)$|. This allows us not only to reduce the super-exponentially many constraints on |$ A$| to |$p^3-p^2$| active constraints over |$( A, \lambda)$| but also to achieve simplicity in the sense that each constraint involves only one parameter in |$ A$| and is linear in |$\lambda_{jk}$| and |$I(A_{ij} \neq 0)$|.
This nonuniqueness of |$\lambda$| in Theorem 1 does not affect our algorithm when letting, for example, the first row |$\lambda_{1k}$| be 1 to remove redundancy.
2.2. Constrained maximum likelihood
For simplicity we suppose that the |$\sigma_j$| (|$j=1,\ldots,p$|) in (2) are equal, as equal variances for the |$Z_j$| yield identifiable directed acyclic graphs (Peters & Bühlmann, 2014) and we can avoid an unnecessary detour involving equivalence classes. A case with different error variances is discussed in § 4 and in the Supplementary Material.
In practice, |$\tau$| and |$K$| are treated as tuning parameters, optimized using an independent validation set or by crossvalidation over a set of grids on their domain. Minimization of (11) subject to (12) and (13) involves |$p^3-p^2+1$| nonconvex constraints in (13), but compared to the original formulation with indicator functions, |$J_{\tau}(\cdot)$| is piecewise linear and can be decomposed into the difference of two convex functions. We solve this constrained minimization through difference convex programming and the alternating direction method of multipliers.
3. Computation
Our computational strategy for solving (11) subject to (12) and (13) consists of two steps. First, we relax (12) and (13) using a sequence of approximations involving convex constraints, where each approximation is refined iteratively. Then we solve each convex subproblem with |$p^3-p^2+1$| linear constraints by employing a constrained alternating direction method of multipliers. The underlying process iterates until convergence.
Step 1. Supply an initial estimate |${\hat{A}}^{(0)}$|, such as |$\hat{A}^{(0)}= 0$|.
Step 2. At iteration |$m$|, compute |${\hat{A}}^{(m)}$| by minimizing (16).
Step 3. Terminate when |$l(\hat { A}^{(m-1)})-l(\hat{A}^{(m)}) \leqslant \varepsilon$|.
Here |$\varepsilon$| is the precision tolerance. Algorithm 1 gives the final estimate |$\hat{A}=\hat{A}^{(m^*)}$|, where |$m^*$| is the smallest index at the termination criterion.
The computational complexity of Algorithm 1 in one iteration over six blocks in (17) is roughly |$O(p^3+np^2)$|. Usually the alternating direction method of multipliers converges with modest accuracy within a few tens of iterations, although it can be slow to converge with high accuracy (Boyd et al., 2011). This contrasts with the fast convergence of the difference convex part, which has a finite termination property (Shen et al., 2013). Our difference convex step converged within ten iterations for our examples. The algorithm is insensitive to the choice of initial values, better values of which lead to faster convergence.
4. Theory
In this section we develop theory for constrained maximum likelihood estimation with respect to reconstruction of a directed acyclic graph structure. We will proceed under the equal variance assumption, which implies that the distributions from different directed acyclic graphical models are identifiable (Peters & Bühlmann, 2014). We will then show that the proposed method recovers the true directed acyclic graph’s structure under (18).
Let |$\mathcal{B} = (G, \theta)$| be a parameterized directed acyclic graph, where |$ \theta = ( A, \sigma)$| and |$G=G( A)$| is a directed acyclic graph induced by parameters. Let |$E=E( A)= \{(i,j): A_{ji}\neq 0 \}$| be the set of nonzero elements in |${ A}$|, which is equivalent to the edge set of the graph. Let |$|E|$| denote the counts of elements in set |$E$| and |$E_1\backslash E_2$| the set difference of |$E_1$| and |$E_2$|. Let |$\mathcal F_0=\{ \theta=( A, \sigma): c_{\min}( \Omega) \geqslant M_1>0, \,\sup_{1 \leqslant j \leqslant p} |\Omega_{jj}| \leqslant M_2\}$| be the parameter space containing |$( A^0, \sigma^0)$|, where |$M_1,M_2>0$| are constants, |$ \Omega= ( I- A)^{{\rm T}} ( I- A)/\sigma^2$| is the precision matrix, and |$c_{\min}( \Omega) $| is the smallest eigenvalue of |$ \Omega$|. The assumptions that |$\inf_{ A \in \mathcal F_0} c_{\min}( \Omega) \geqslant M_1>0$| and |$\sup_{1 \leqslant j \leqslant p} |\Omega_{jj}| \leqslant M_2$| are sufficient to ensure that the likelihood function is bounded; |$\mathcal{B}^0$|, |$G^0$|, |$ \theta^0$|, |$ A^0$|, |$E^0$| and |$ \Omega^0$| represent the corresponding true values.
Below we derive a finite-sample probability error bound for reconstructing the true directed acyclic graph by the proposed method, which not only reconstructs the true directed acyclic graph when identifiable but also recovers the optimal performance of the oracle estimator |$(\hat{A}^{\text{OR}} ,\hat{\sigma}^{\text{OR}} )$|, defined as the maximum likelihood estimator assuming that the true edge set |$E^0$| is known a priori.
For some positive constants |$M_1$| and |$M_2$|, |$\inf_{ \Omega} c_{\min}( \Omega) \geqslant M_1>0$| and |$\sup_{1 \leqslant k \leqslant p} |\Omega_{kk}| \leqslant M_2$|, where |$c_{\min}( \Omega )$| is the smallest eigenvalue of |$ \Omega$| and |$\Omega_{kk}$| is the |$k$|th diagonal element of |$\Omega$|.
Assumption 1 is a condition on boundedness of entries of |$ \Omega$|. It implies that |$\sigma^2 \geqslant M^{-1}_2$| since |$\Omega_{jj}= 1/\sigma^2 + \sum_{k \neq j} A^2_{jk}/\sigma^2$| for an adjacency matrix. Moreover, from (2), |$\sigma^2 \leqslant \mbox{var}(X_j)=\Sigma_{jj}= e_j^{{\rm T}} \Sigma e_j \leqslant 1/c_{\min}( \Omega) \leqslant 1/M_1$|, where |$ e_k$| is a standard basis vector with the |$k$|th element equal to |$1$| and the rest of the elements being |$0$|. Therefore, |$\sigma^2$| is bounded above and below.
The degree of reconstructability measures the overall degree of difficulty of reconstruction, which is roughly determined by two terms. While |$\gamma_{\min}^2( A^0)$| reflects the signal strength in terms of the minimal nonzero size of |$ \Omega^0$|, |$c_{\min}( H)$| can be thought of as the local curvature or the second derivative of the loglikelihood at |$ \Omega^0$|, which measures dependency among entries of |$ \Omega^0$|. Assumption 2 below requires the degree of reconstructability to exceed a certain threshold that is proportional to |$n^{-1} \max(\log p,|E^0|)$|, or roughly |$c^* \gamma^2_{\min} c_{\min}( H) \geqslant n^{-1} \max(\log p,|E^0|)$|. This aspect differs from the existing conditions for a non-likelihood-based method. Unfortunately, the exact connection requires further investigation to clarify.
For reconstruction, we show that the proposed method enables us to consistently reconstruct the true directed acyclic graph, obtaining the optimal estimation, provided that the degree of reconstructability exceeds a certain level, |$n^{-1} \log p$|.
Theorem 2 below says that the oracle estimator |$\hat{\Omega}^{\rm OR} = ( I - \hat{A}^{\rm OR})^{{\rm T}}( I - \hat{A}^{\rm OR} )/(\hat{\sigma}^{\rm OR})^2$| can be reconstructed by a global minimizer |$\hat{\theta}^{L_0} = (\hat{A}^{L_0},\hat{\sigma}^{L_0})$| of (8) subject to (9) and (10). As |$n \rightarrow \infty$|, the reconstructed directed acyclic graph |$\hat{G}^{L_0}$| is consistent for the true directed acyclic graph |$G^0$|; moreover, the optimal performance |$E h^2(\hat{\Omega}^{\rm OR} , \Omega^{\rm 0 })$|, as measured by the Hellinger risk, is recovered by |$E h^2(\hat{\Omega}^{L_0}, \Omega^{\rm 0})$|.
Under Assumption 2, |${\rm pr}(\hat{G}^{L_0} \neq G^{\rm 0}) \rightarrow 0$|and |$ E\{ h^2(\hat{\theta}^{L_0}, \theta^{\rm 0})\}/ E \{h^2(\hat{\Omega}^{\rm OR}, \Omega^{\rm 0}) \} \rightarrow 1$| as |$n \rightarrow \infty$|, where |$p$| and |$|E^0|$| may depend on |$n$|.
A similar result can be established for the proposed estimator, the minimizer |$\hat{\theta}^{\rm {TL}} = (\hat{A}^{\rm {TL}} , \hat{D}^{\rm {TL}} )$| of (11) subject to (12) and (13), with the following additional assumption.
Under Assumption 2, |${\rm pr}(\hat{G}^{\rm {TL}} \neq G^0) \rightarrow 0$| and |$E \{h^2( \hat{\Omega}^{\rm {TL}} , \Omega^0)\}/E\{ h^2( \hat{\Omega}^{\rm {OR}} , \Omega^0) \} \rightarrow 1$| as |$n, p, |E^0| \rightarrow \infty$|.
The graphical structure is recovered by |$\hat{\theta}^{L_0}$| and its computational surrogate |$\hat{\theta}^{\rm {TL}} $|, and the optimal performance of the oracle estimator is achieved.
To see how (20) and (19) are related, we examine the following example.
So in contrast (19) only requires the minimal absolute value of three conditional covariances, |${\rm cov}(X_1,X_2\mid X_3)$|, |${\rm cov}(X_1,X_3\mid X_2)$| and |${\rm cov}(X_2,X_3\mid X_1)$|, to exceed a threshold that is of order |$\{n^{-1} \max(\log p, |E^0|)\}^{1/2}$|. This suggests that (20) is more stringent than (19).
The proposed method is applicable to nonequal error variances in (2), although the present paper focuses on the identifiable situation assuming equal error variances. This error variance assumption seems sensible for consistent directed acyclic graph reconstruction and is natural for applications with variables from a similar domain. When the error variances in (2) are not equal, our method continues to work with a minor modification of the likelihood function; see the Supplementary Material. In such a situation, directed acyclic graphical models are not identifiable, and the equivalence classes are estimated; see Andersson et al. (1997).
5. Numerical examples
5.1. General set-up
In this section we study operating characteristics of the proposed constrained maximum likelihood estimator, comparing its performance against competitors in terms of estimation accuracy of directed edges and of parameters. The proposed method is compared with four top performers: the test-based high-dimensional PC-algorithm (Spirtes et al., 2000); two score-and-search methods, namely the greedy equivalent search method (Chickering, 2003) and a hill climbing method (Jensen, 2004); and a hybrid version of max-min hill climbing (Tsamardinos et al., 2006). We coded our method in C, and based on this code we also developed an R routine (R Development Core Team, 2019) available at https://github.com/ciryiping/TLPDAG. For the PC-algorithm and greedy equivalent search, we use the R package pcalg (Kalisch et al., 2012); both output partially directed acyclic graphs, so parameter estimation is not available directly. To make a fair comparison, we extend a partially directed acyclic graph to directed acyclic graphs using the function pdag2dag in the R package pcalg and then fit parameters given the extended directed structure. For the hill climbing and max-min hill climbing methods, we use the R package bnlearn (Scutari, 2010). Two simulated examples are considered in § 5.2, followed by a real-data example in § 5.3.
For accuracy of estimating directed edges of a graph, we consider the false positive and false discovery rates.
The max-min hill climbing method and the PC-algorithm require one tuning parameter |$\alpha$| controlling the significance level for independence tests, which is selected from |$\{0.0001, 0.001, 0.005, 0.01, 0.05\}$|. For our method, |$\tau$| and |$K$| are chosen from |$\{0.001, 0.01, 0.1\}$| and the integers from |$1$| to |$150$|, respectively, and control the threshold and the number of edges. The best tuning parameters maximize the predicted loglikelihood (6) based on an independent tuning set |$ X_{\rm val}$| of size 1000 over the specified set of |$\alpha$| or |$(\tau,K)$|. The hill climbing and greedy equivalent search algorithms do not require tuning.
5.2. Simulated examples
Examples 2 and 3 involve respectively a sparse neighbourhood graph, where each node has a sparse neighbourhood (Kalisch & Bühlmann, 2007), and a sparse graph with a nonsparse neighbourhood.
Consider a directed acyclic graph of 100 nodes, generated randomly using the generation mechanism described in Kalisch & Bühlmann (2007). First we construct the true adjacency matrix |$ A$|. To begin with, set |$ A= 0$|. Given a prespecified ordering of the 100 nodes, we replace every entry in the lower triangle, or on the lower off-diagonals, of |$ A$| by a random sample of |$0$| or |$1$| following a Bernoulli distribution with success probability |$s=0.02$|, where |$1$| indicates an edge connection and |$s$| controls the degree of sparseness of the graph. Then we parameterize |$ A$| by replacing all the entries of value |$1$| with |$0.5$|, a value indicating the signal strength. Finally, given |$ A$|, a random sample is generated according to (5) with |$\sigma=1$|. Table 1 shows the results.
Results from Example 2: average false positive rate, false discovery rate, Matthews correlation coefficient, Kullback–Leibler loss, and structural Hamming distance, together with their standard errors in parentheses, for five competing methods based on |$100$| replications
| |$n$| . | |$p$| . | Method . | FPR (%) . | FDR (%) . | MCC . | KL . | SHD . |
|---|---|---|---|---|---|---|---|
| 50 | 100 | CMLE | 0.96 (0.27) | 62.9 (6.5) | 0.29 (0.05) | 22.6 (1.4) | 105.4 (8.5) |
| PC | 0.29 (0.07) | 90.5 (10.5) | 0.03 (0.05) | 24.8 (1.0) | 106.7 (2.3) | ||
| GES | 4.41 (0.17) | 69.4 (2.5) | 0.29 (0.04) | 42.2 (2.9) | 239.8 (11.8) | ||
| HC | 17.4 (1.93) | 91.9 (1.0) | 0.19 (0.02) | 6.1E13 (8.5E13) | 859.8 (92.7) | ||
| MMHC | 0.95 (0.13) | 49.7 (4.8) | 0.45 (0.04) | 24.1 (2.1) | 96.2 (7.7) | ||
| 100 | 100 | CMLE | 0.59 (0.23) | 24.2 (8.0) | 0.79 (0.05) | 10.6 (1.1) | 82.4 (13.7) |
| PC | 0.74 (0.19) | 71.9 (6.8) | 0.18 (0.07) | 17.6 (1.7) | 94.5 (5.1) | ||
| GES | 3.32 (0.19) | 57.0 (2.9) | 0.49 (0.04) | 15.4 (0.8) | 165 (10.5) | ||
| HC | 4.57 (0.43) | 67.6 (2.1) | 0.55 (0.02) | 25.6 (2.8) | 234.5 (25.2) | ||
| MMHC | 0.64 (0.08) | 25.4 (2.6) | 0.79 (0.02) | 9.4 (0.7) | 57.8 (6.2) | ||
| 1000 | 100 | CMLE | 0.05 (0.03) | 2.3 (1.4) | 0.98 (0.01) | 0.3 (0.1) | 2.5 (1.6) |
| PC | 1.32 (0.31) | 53.7 (6.9) | 0.47 (0.06) | 8.7 (0.4) | 64.3 (15.1) | ||
| GES | 1.39 (0.16) | 33.7 (3.4) | 0.71 (0.03) | 6.1 (0.1) | 67.8 (7.8) | ||
| HC | 1.20 (0.14) | 35.4 (2.7) | 0.79 (0.02) | 0.7 (0.1) | 58.2 (6.8) | ||
| MMHC | 0.42 (0.10) | 15.8 (3.3) | 0.91 (0.02) | 0.4 (0.0) | 20.2 (4.8) |
| |$n$| . | |$p$| . | Method . | FPR (%) . | FDR (%) . | MCC . | KL . | SHD . |
|---|---|---|---|---|---|---|---|
| 50 | 100 | CMLE | 0.96 (0.27) | 62.9 (6.5) | 0.29 (0.05) | 22.6 (1.4) | 105.4 (8.5) |
| PC | 0.29 (0.07) | 90.5 (10.5) | 0.03 (0.05) | 24.8 (1.0) | 106.7 (2.3) | ||
| GES | 4.41 (0.17) | 69.4 (2.5) | 0.29 (0.04) | 42.2 (2.9) | 239.8 (11.8) | ||
| HC | 17.4 (1.93) | 91.9 (1.0) | 0.19 (0.02) | 6.1E13 (8.5E13) | 859.8 (92.7) | ||
| MMHC | 0.95 (0.13) | 49.7 (4.8) | 0.45 (0.04) | 24.1 (2.1) | 96.2 (7.7) | ||
| 100 | 100 | CMLE | 0.59 (0.23) | 24.2 (8.0) | 0.79 (0.05) | 10.6 (1.1) | 82.4 (13.7) |
| PC | 0.74 (0.19) | 71.9 (6.8) | 0.18 (0.07) | 17.6 (1.7) | 94.5 (5.1) | ||
| GES | 3.32 (0.19) | 57.0 (2.9) | 0.49 (0.04) | 15.4 (0.8) | 165 (10.5) | ||
| HC | 4.57 (0.43) | 67.6 (2.1) | 0.55 (0.02) | 25.6 (2.8) | 234.5 (25.2) | ||
| MMHC | 0.64 (0.08) | 25.4 (2.6) | 0.79 (0.02) | 9.4 (0.7) | 57.8 (6.2) | ||
| 1000 | 100 | CMLE | 0.05 (0.03) | 2.3 (1.4) | 0.98 (0.01) | 0.3 (0.1) | 2.5 (1.6) |
| PC | 1.32 (0.31) | 53.7 (6.9) | 0.47 (0.06) | 8.7 (0.4) | 64.3 (15.1) | ||
| GES | 1.39 (0.16) | 33.7 (3.4) | 0.71 (0.03) | 6.1 (0.1) | 67.8 (7.8) | ||
| HC | 1.20 (0.14) | 35.4 (2.7) | 0.79 (0.02) | 0.7 (0.1) | 58.2 (6.8) | ||
| MMHC | 0.42 (0.10) | 15.8 (3.3) | 0.91 (0.02) | 0.4 (0.0) | 20.2 (4.8) |
CMLE, constrained maximum likelihood estimation; PC, PC-algorithm; GES, greedy equivalent search; HC, hill climbing; MMHC, max-min hill climbing; FPR, false positive rate; FDR, false discovery rate; MCC, Matthews correlation coefficient; KL, Kullback–Leibler loss; SHD, structural Hamming distance; a small value of FPR, FDR, KL, and SHD indicates a good result, whereas a large value of MCC is a good result.
Results from Example 2: average false positive rate, false discovery rate, Matthews correlation coefficient, Kullback–Leibler loss, and structural Hamming distance, together with their standard errors in parentheses, for five competing methods based on |$100$| replications
| |$n$| . | |$p$| . | Method . | FPR (%) . | FDR (%) . | MCC . | KL . | SHD . |
|---|---|---|---|---|---|---|---|
| 50 | 100 | CMLE | 0.96 (0.27) | 62.9 (6.5) | 0.29 (0.05) | 22.6 (1.4) | 105.4 (8.5) |
| PC | 0.29 (0.07) | 90.5 (10.5) | 0.03 (0.05) | 24.8 (1.0) | 106.7 (2.3) | ||
| GES | 4.41 (0.17) | 69.4 (2.5) | 0.29 (0.04) | 42.2 (2.9) | 239.8 (11.8) | ||
| HC | 17.4 (1.93) | 91.9 (1.0) | 0.19 (0.02) | 6.1E13 (8.5E13) | 859.8 (92.7) | ||
| MMHC | 0.95 (0.13) | 49.7 (4.8) | 0.45 (0.04) | 24.1 (2.1) | 96.2 (7.7) | ||
| 100 | 100 | CMLE | 0.59 (0.23) | 24.2 (8.0) | 0.79 (0.05) | 10.6 (1.1) | 82.4 (13.7) |
| PC | 0.74 (0.19) | 71.9 (6.8) | 0.18 (0.07) | 17.6 (1.7) | 94.5 (5.1) | ||
| GES | 3.32 (0.19) | 57.0 (2.9) | 0.49 (0.04) | 15.4 (0.8) | 165 (10.5) | ||
| HC | 4.57 (0.43) | 67.6 (2.1) | 0.55 (0.02) | 25.6 (2.8) | 234.5 (25.2) | ||
| MMHC | 0.64 (0.08) | 25.4 (2.6) | 0.79 (0.02) | 9.4 (0.7) | 57.8 (6.2) | ||
| 1000 | 100 | CMLE | 0.05 (0.03) | 2.3 (1.4) | 0.98 (0.01) | 0.3 (0.1) | 2.5 (1.6) |
| PC | 1.32 (0.31) | 53.7 (6.9) | 0.47 (0.06) | 8.7 (0.4) | 64.3 (15.1) | ||
| GES | 1.39 (0.16) | 33.7 (3.4) | 0.71 (0.03) | 6.1 (0.1) | 67.8 (7.8) | ||
| HC | 1.20 (0.14) | 35.4 (2.7) | 0.79 (0.02) | 0.7 (0.1) | 58.2 (6.8) | ||
| MMHC | 0.42 (0.10) | 15.8 (3.3) | 0.91 (0.02) | 0.4 (0.0) | 20.2 (4.8) |
| |$n$| . | |$p$| . | Method . | FPR (%) . | FDR (%) . | MCC . | KL . | SHD . |
|---|---|---|---|---|---|---|---|
| 50 | 100 | CMLE | 0.96 (0.27) | 62.9 (6.5) | 0.29 (0.05) | 22.6 (1.4) | 105.4 (8.5) |
| PC | 0.29 (0.07) | 90.5 (10.5) | 0.03 (0.05) | 24.8 (1.0) | 106.7 (2.3) | ||
| GES | 4.41 (0.17) | 69.4 (2.5) | 0.29 (0.04) | 42.2 (2.9) | 239.8 (11.8) | ||
| HC | 17.4 (1.93) | 91.9 (1.0) | 0.19 (0.02) | 6.1E13 (8.5E13) | 859.8 (92.7) | ||
| MMHC | 0.95 (0.13) | 49.7 (4.8) | 0.45 (0.04) | 24.1 (2.1) | 96.2 (7.7) | ||
| 100 | 100 | CMLE | 0.59 (0.23) | 24.2 (8.0) | 0.79 (0.05) | 10.6 (1.1) | 82.4 (13.7) |
| PC | 0.74 (0.19) | 71.9 (6.8) | 0.18 (0.07) | 17.6 (1.7) | 94.5 (5.1) | ||
| GES | 3.32 (0.19) | 57.0 (2.9) | 0.49 (0.04) | 15.4 (0.8) | 165 (10.5) | ||
| HC | 4.57 (0.43) | 67.6 (2.1) | 0.55 (0.02) | 25.6 (2.8) | 234.5 (25.2) | ||
| MMHC | 0.64 (0.08) | 25.4 (2.6) | 0.79 (0.02) | 9.4 (0.7) | 57.8 (6.2) | ||
| 1000 | 100 | CMLE | 0.05 (0.03) | 2.3 (1.4) | 0.98 (0.01) | 0.3 (0.1) | 2.5 (1.6) |
| PC | 1.32 (0.31) | 53.7 (6.9) | 0.47 (0.06) | 8.7 (0.4) | 64.3 (15.1) | ||
| GES | 1.39 (0.16) | 33.7 (3.4) | 0.71 (0.03) | 6.1 (0.1) | 67.8 (7.8) | ||
| HC | 1.20 (0.14) | 35.4 (2.7) | 0.79 (0.02) | 0.7 (0.1) | 58.2 (6.8) | ||
| MMHC | 0.42 (0.10) | 15.8 (3.3) | 0.91 (0.02) | 0.4 (0.0) | 20.2 (4.8) |
CMLE, constrained maximum likelihood estimation; PC, PC-algorithm; GES, greedy equivalent search; HC, hill climbing; MMHC, max-min hill climbing; FPR, false positive rate; FDR, false discovery rate; MCC, Matthews correlation coefficient; KL, Kullback–Leibler loss; SHD, structural Hamming distance; a small value of FPR, FDR, KL, and SHD indicates a good result, whereas a large value of MCC is a good result.
This example is modified from Example 2 to generate a directed acyclic graph of 100 nodes with a hub node. Instead of generating the edge set by Bernoulli sampling, the only directed edges are those from the first node, i.e., the hub node, to the next 49 nodes. The other 50 nodes are independent variables. Such highly connected hub nodes are of special interest because they are the backbones of the network architecture. The neighbourhood of the first node is no longer sparse, but the overall graph remains sparse. Table 2 shows the results.
Results from Example 3: average false positive rate, false discovery rate, Matthews correlation coefficient, Kullback–Leibler loss, and structural Hamming distance, together with their standard errors in parentheses, for five competing methods based on |$100$| replications
| |$n$| . | |$p$| . | Method . | FPR (%) . | FDR (%) . | MCC . | KL . | SHD . |
|---|---|---|---|---|---|---|---|
| 50 | 100 | CMLE | 0.29 (0.26) | 70.4 (11) | 0.21 (0.13) | 11.4 (1.7) | 56.4 (8.7) |
| PC | 0.03 (0.02) | 85 (32.8) | 0.03 (0.07) | 16.4 (1.2) | 49.7 (1.3) | ||
| GES | 5.26 (0.12) | 97.7 (0.2) | 0.01 (0.01) | 50.5 (3.3) | 300.9 (5.9) | ||
| HC | 20.2 (2.23) | 98.1 (0.5) | 0.05 (0.03) | 5.6E13 (7.6E13) | 1014.2 (110.3) | ||
| MMHC | 1.48 (0.14) | 94.2 (1.0) | 0.06 (0.01) | 25.9 (1) | 116.9 (6.3) | ||
| 100 | 100 | CMLE | 0.53 (0.14) | 49.6 (12) | 0.52 (0.13) | 5.7 (1.5) | 47.5 (12.3) |
| PC | 0.01 (0.02) | 34 (41.1) | 0.16 (0.10) | 15.4 (0.7) | 47.4 (1.6) | ||
| GES | 4.49 (0.15) | 97.3 (0.2) | 0.02 (0.01) | 20.9 (0.8) | 262.9 (7.1) | ||
| HC | 5.15 (0.27) | 86.4 (0.9) | 0.32 (0.03) | 27.7 (3.0) | 259.6 (12) | ||
| MMHC | 1.77 (0.10) | 92.2 (1.6) | 0.10 (0.02) | 15.9 (1.0) | 128.1 (5.8) | ||
| 1000 | 100 | CMLE | 0.01 (0.01) | 1.0 (1.1) | 0.99 (0.01) | 0.2 (0.05) | 0.5 (0.5) |
| PC|$\ast$| | N/A | N/A | N/A | N/A | N/A | ||
| GES | 3.20 (0.07) | 96.3 (0.1) | 0.04 (0.001) | 10.6 (0.1) | 199.8 (3.5) | ||
| HC | 0.94 (0.13) | 48.4 (3.1) | 0.72 (0.02) | 0.6 (0.1) | 46.2 (6.2) | ||
| MMHC|$\ast$| | N/A | N/A | N/A | N/A | N/A |
| |$n$| . | |$p$| . | Method . | FPR (%) . | FDR (%) . | MCC . | KL . | SHD . |
|---|---|---|---|---|---|---|---|
| 50 | 100 | CMLE | 0.29 (0.26) | 70.4 (11) | 0.21 (0.13) | 11.4 (1.7) | 56.4 (8.7) |
| PC | 0.03 (0.02) | 85 (32.8) | 0.03 (0.07) | 16.4 (1.2) | 49.7 (1.3) | ||
| GES | 5.26 (0.12) | 97.7 (0.2) | 0.01 (0.01) | 50.5 (3.3) | 300.9 (5.9) | ||
| HC | 20.2 (2.23) | 98.1 (0.5) | 0.05 (0.03) | 5.6E13 (7.6E13) | 1014.2 (110.3) | ||
| MMHC | 1.48 (0.14) | 94.2 (1.0) | 0.06 (0.01) | 25.9 (1) | 116.9 (6.3) | ||
| 100 | 100 | CMLE | 0.53 (0.14) | 49.6 (12) | 0.52 (0.13) | 5.7 (1.5) | 47.5 (12.3) |
| PC | 0.01 (0.02) | 34 (41.1) | 0.16 (0.10) | 15.4 (0.7) | 47.4 (1.6) | ||
| GES | 4.49 (0.15) | 97.3 (0.2) | 0.02 (0.01) | 20.9 (0.8) | 262.9 (7.1) | ||
| HC | 5.15 (0.27) | 86.4 (0.9) | 0.32 (0.03) | 27.7 (3.0) | 259.6 (12) | ||
| MMHC | 1.77 (0.10) | 92.2 (1.6) | 0.10 (0.02) | 15.9 (1.0) | 128.1 (5.8) | ||
| 1000 | 100 | CMLE | 0.01 (0.01) | 1.0 (1.1) | 0.99 (0.01) | 0.2 (0.05) | 0.5 (0.5) |
| PC|$\ast$| | N/A | N/A | N/A | N/A | N/A | ||
| GES | 3.20 (0.07) | 96.3 (0.1) | 0.04 (0.001) | 10.6 (0.1) | 199.8 (3.5) | ||
| HC | 0.94 (0.13) | 48.4 (3.1) | 0.72 (0.02) | 0.6 (0.1) | 46.2 (6.2) | ||
| MMHC|$\ast$| | N/A | N/A | N/A | N/A | N/A |
CMLE, constrained maximum likelihood estimation; PC, PC-algorithm; GES, greedy equivalent search; HC, hill climbing; MMHC, max-min hill climbing; FPR, false positive rate; FDR, false discovery rate; MCC, Matthews correlation coefficient; KL, Kullback–Leibler loss; SHD, structural Hamming distance; a small value of FPR, FDR, KL, and SHD indicates a good result, whereas a large value of MCC is a good result.
Results from Example 3: average false positive rate, false discovery rate, Matthews correlation coefficient, Kullback–Leibler loss, and structural Hamming distance, together with their standard errors in parentheses, for five competing methods based on |$100$| replications
| |$n$| . | |$p$| . | Method . | FPR (%) . | FDR (%) . | MCC . | KL . | SHD . |
|---|---|---|---|---|---|---|---|
| 50 | 100 | CMLE | 0.29 (0.26) | 70.4 (11) | 0.21 (0.13) | 11.4 (1.7) | 56.4 (8.7) |
| PC | 0.03 (0.02) | 85 (32.8) | 0.03 (0.07) | 16.4 (1.2) | 49.7 (1.3) | ||
| GES | 5.26 (0.12) | 97.7 (0.2) | 0.01 (0.01) | 50.5 (3.3) | 300.9 (5.9) | ||
| HC | 20.2 (2.23) | 98.1 (0.5) | 0.05 (0.03) | 5.6E13 (7.6E13) | 1014.2 (110.3) | ||
| MMHC | 1.48 (0.14) | 94.2 (1.0) | 0.06 (0.01) | 25.9 (1) | 116.9 (6.3) | ||
| 100 | 100 | CMLE | 0.53 (0.14) | 49.6 (12) | 0.52 (0.13) | 5.7 (1.5) | 47.5 (12.3) |
| PC | 0.01 (0.02) | 34 (41.1) | 0.16 (0.10) | 15.4 (0.7) | 47.4 (1.6) | ||
| GES | 4.49 (0.15) | 97.3 (0.2) | 0.02 (0.01) | 20.9 (0.8) | 262.9 (7.1) | ||
| HC | 5.15 (0.27) | 86.4 (0.9) | 0.32 (0.03) | 27.7 (3.0) | 259.6 (12) | ||
| MMHC | 1.77 (0.10) | 92.2 (1.6) | 0.10 (0.02) | 15.9 (1.0) | 128.1 (5.8) | ||
| 1000 | 100 | CMLE | 0.01 (0.01) | 1.0 (1.1) | 0.99 (0.01) | 0.2 (0.05) | 0.5 (0.5) |
| PC|$\ast$| | N/A | N/A | N/A | N/A | N/A | ||
| GES | 3.20 (0.07) | 96.3 (0.1) | 0.04 (0.001) | 10.6 (0.1) | 199.8 (3.5) | ||
| HC | 0.94 (0.13) | 48.4 (3.1) | 0.72 (0.02) | 0.6 (0.1) | 46.2 (6.2) | ||
| MMHC|$\ast$| | N/A | N/A | N/A | N/A | N/A |
| |$n$| . | |$p$| . | Method . | FPR (%) . | FDR (%) . | MCC . | KL . | SHD . |
|---|---|---|---|---|---|---|---|
| 50 | 100 | CMLE | 0.29 (0.26) | 70.4 (11) | 0.21 (0.13) | 11.4 (1.7) | 56.4 (8.7) |
| PC | 0.03 (0.02) | 85 (32.8) | 0.03 (0.07) | 16.4 (1.2) | 49.7 (1.3) | ||
| GES | 5.26 (0.12) | 97.7 (0.2) | 0.01 (0.01) | 50.5 (3.3) | 300.9 (5.9) | ||
| HC | 20.2 (2.23) | 98.1 (0.5) | 0.05 (0.03) | 5.6E13 (7.6E13) | 1014.2 (110.3) | ||
| MMHC | 1.48 (0.14) | 94.2 (1.0) | 0.06 (0.01) | 25.9 (1) | 116.9 (6.3) | ||
| 100 | 100 | CMLE | 0.53 (0.14) | 49.6 (12) | 0.52 (0.13) | 5.7 (1.5) | 47.5 (12.3) |
| PC | 0.01 (0.02) | 34 (41.1) | 0.16 (0.10) | 15.4 (0.7) | 47.4 (1.6) | ||
| GES | 4.49 (0.15) | 97.3 (0.2) | 0.02 (0.01) | 20.9 (0.8) | 262.9 (7.1) | ||
| HC | 5.15 (0.27) | 86.4 (0.9) | 0.32 (0.03) | 27.7 (3.0) | 259.6 (12) | ||
| MMHC | 1.77 (0.10) | 92.2 (1.6) | 0.10 (0.02) | 15.9 (1.0) | 128.1 (5.8) | ||
| 1000 | 100 | CMLE | 0.01 (0.01) | 1.0 (1.1) | 0.99 (0.01) | 0.2 (0.05) | 0.5 (0.5) |
| PC|$\ast$| | N/A | N/A | N/A | N/A | N/A | ||
| GES | 3.20 (0.07) | 96.3 (0.1) | 0.04 (0.001) | 10.6 (0.1) | 199.8 (3.5) | ||
| HC | 0.94 (0.13) | 48.4 (3.1) | 0.72 (0.02) | 0.6 (0.1) | 46.2 (6.2) | ||
| MMHC|$\ast$| | N/A | N/A | N/A | N/A | N/A |
CMLE, constrained maximum likelihood estimation; PC, PC-algorithm; GES, greedy equivalent search; HC, hill climbing; MMHC, max-min hill climbing; FPR, false positive rate; FDR, false discovery rate; MCC, Matthews correlation coefficient; KL, Kullback–Leibler loss; SHD, structural Hamming distance; a small value of FPR, FDR, KL, and SHD indicates a good result, whereas a large value of MCC is a good result.
With regard to structure estimation, in Example 2 the proposed method outperforms its competitors when |$n$| is large but performs slightly worse than the best performer when |$n$| is small, in terms of the Matthews correlation coefficient and the structural Hamming distance. In Example 3, it continues to be the best performer. However, the greedy equivalent search, hill climbing, and max-min hill climbing algorithms perform well in Example 2 but deteriorate in Example 3, where their performance suffers because of the nonsparse neighbourhood of one hub node.
Moreover, the PC-algorithm and max-min hill climbing algorithm fail to converge after 24 hours of running when |$n=1000$| in Example 3. The hill climbing and greedy equivalent search algorithms perform poorly in almost all the situations. The performance of hill climbing can be partly attributed to inappropriate use of the Bayesian information criterion for model selection in a high-dimensional setting.
With regard to parameter estimation, the proposed method performs best in most situations, as measured by the Kullback–Leibler loss.
Overall, our constrained maximum likelihood method compares favourably against the other methods.
5.3. Analysis of cell signalling data
We use our constrained maximum likelihood estimation method to analyse the multivariate flow cytometry data in Sachs et al. (2005). Data were collected after a series of stimulatory cues and inhibitory interventions, with cell reactions terminated by fixation 15 minutes after stimulation, to profile the effects of each condition on the intracellular signalling networks of human primary naive CD4+ T-cells. Flow cytometry measurements were taken from 11 phosphorylated proteins and phospholipids under nine experimental conditions. These simultaneous measurements allow researchers to infer causal influences in cellular signalling. Among these nine experimental conditions, we are particularly interested in one condition, where anti-cluster of differentiation 3/cluster of differentiation 28 and intercellular adhesion molecule-2 were used as general perturbations, because they tend to activate cell signalling.
This is a well-studied example and we use the representation of the consensus network in Sachs et al. (2005) as our true network, shown in Fig. 1(a). An arrow from one node to another is interpreted as a directional influence between the corresponding proteins.
The true network and networks reconstructed by using the proposed constrained maximum likelihood method and four other methods, with abbreviations as given in the captions of Tables 1 and 2; correct discoveries are repre- sented by solid black lines, false discoveries are displayed as black dotted lines, and incorrectly identified directions with correct discoveries of the skeleton of the true graph are marked as solid grey lines.
The true network and networks reconstructed by using the proposed constrained maximum likelihood method and four other methods, with abbreviations as given in the captions of Tables 1 and 2; correct discoveries are repre- sented by solid black lines, false discoveries are displayed as black dotted lines, and incorrectly identified directions with correct discoveries of the skeleton of the true graph are marked as solid grey lines.
There are 902 samples. We fit the five competing methods, using one-third of the samples for training and two-thirds for tuning. The learned networks are shown in Figs. 1(b)–(f).
Our constrained maximum likelihood method, the PC-algorithm, greedy equivalent search, the hill climbing method, and the max-min hill climbing method respectively identified 4, 1, 0, 7, and 6 directed edges correctly, in addition to identifying 5, 6, 8, 1, and 1 edges correctly but with the wrong directions. In other words, they discovered 9, 7, 8, 8, and 7 edges correctly. With regard to wrong discoveries, all five methods identified one edge from P38 to pjnk, which is not in the true graph. Finally, all the methods missed several known edges, perhaps because many directional relations in protein signalling networks may have nonlinear behaviour, which cannot be captured by any linear model. An analysis based on discretization data (Sachs et al., 2005; Schmidt & Murphy, 2009) tends to capture such nonlinearity. Another possible explanation is that not all transcription activities are activated under the general perturbations. Therefore, recovering more edges may require analysing inhomogeneous data under different experimental conditions, as in Sachs et al. (2005).
Acknowledgement
This research was supported in part by the U.S. National Science Foundation and National Institutes of Health. The authors thank the editors and reviewers for helpful comments and suggestions.
Supplementary material
Supplementary material available at Biometrika online includes an extension to the case of nonequal error variance, additional simulation results, technical details and proofs, and computational code.
References
R Development Core Team (

