SUMMARY

Directed acyclic graphs are widely used to describe directional pairwise relations. Such relations are estimated by reconstructing a directed acyclic graph’s structure, which is challenging when the ordering of nodes of the graph is unknown. In such a situation, existing methods such as the neighbourhood and search-and-score methods have high estimation errors or computational complexities, especially when a local or sequential approach is used to enumerate edge directions by testing or optimizing a criterion locally, as a local method may break down even for moderately sized graphs. We propose a novel approach to simultaneously identifying all estimable directed edges and model parameters, using constrained maximum likelihood with nonconvex constraints. We develop a constraint reduction method that constructs a set of active constraints from super-exponentially many constraints. This, coupled with an alternating direction method of multipliers and a difference convex method, permits efficient computation for large-graph learning. We show that the proposed method consistently reconstructs identifiable directions of the true graph and achieves the optimal performance in terms of parameter estimation. Numerically, the method compares favourably with competitors. A protein network is analysed to demonstrate that the proposed method can make a difference in identifying the network’s structure.

1. Introduction

Directed acyclic graphical models are widely used to represent directional relations or parent-child relations among interacting units. The models encode graph-to-distribution correspondences, defined by a local Markov property (Edwards, 2012). Motivated by network analysis for uncovering gene-gene directional relations, as in Sachs et al. (2005), we reconstruct a directed acyclic graph without a prespecified total ordering of a set of units.

Most existing methods are designed for situations in which the graph is small relative to the sample size. The major approaches are of two types. The first uses local conditional independence tests sequentially (de Campos, 1998; Spirtes et al., 2000; Cheng et al., 2002) to check the existence of directed edges. Such methods usually have worst-case complexity that is exponential in the number of nodes. The popular PC-algorithm (Spirtes et al., 2000) has a complexity of order |$O(p^q)$| with |$p$| nodes and maximal neighbourhood size |$q$| (Kalisch & Bühlmann, 2007). The algorithm is efficient for a graph that has sparse neighbourhoods with, say, |$q \leqslant 3$|⁠, but becomes inefficient when |$q>4$|⁠. Yet the neighbourhood sizes for many natural graphs are rarely uniformly bounded by a small |$q$| (Faloutsos et al., 1999), while the maximal neighbourhood size |$q$| is expected to increase with |$p$|⁠. The second approach, referred to as search-and-score, optimizes a goodness-of-fit measure for possible directions in a neighbourhood of interest (Chickering, 2003; Brown et al., 2005; de Campos, 2006). Computationally, such methods suffer from the curse of dimensionality. The major difficulty for both categories of methods comes from acyclicity, as pointed out in Cooper & Herskovits (1992), Chickering et al. (2004) and Malone et al. (2014). As a result, the development of directed acyclic graph models lags behind that of undirected ones (Liu et al., 2009) in high-dimensional situations. Shojaie & Michailidis (2010) proposed an |$L_1$|-penalization method for learning a directed acyclic graphical model with a known topological ordering of the nodes. Fu & Zhou (2013) used an |$L_1$|-penalization method for interventional data. Bühlmann et al. (2014) developed a high-dimensional additive structural equation model based on order search and penalized regression.

Exact learning of a directed acyclic graph can be achieved through dynamic programming (Koivisto & Sood, 2004), which guarantees a global minimizer but has exponential time complexity and is feasible only for small graphs, as shown in a 2005 Carnegie Mellon University technical report by Singh and Moore. An exact learning approach based on mixed integer linear programming (Jaakkola et al., 2010; De Campos & Ji, 2011; Cussens et al., 2013) may reduce the complexity to a polynomial order when the maximal number of parent nodes is restricted to one or two in the search, allowing dynamic programming that would otherwise be infeasible (Koivisto & Sood, 2004). Such a restriction limits the scope of application and may destroy the acyclicity requirement of a directed acyclic graph and hence the local Markov property (Edwards, 2012) of the graphical model. In this article, we employ an iterative difference convex algorithm to seek a good local minimizer with fast convergence. A global version of our difference convex algorithm, the outer approximation method (Breiman & Cutler, 1993), guarantees globality of a solution but converges slowly. As suggested by our numerical analysis, our difference convex algorithm yields satisfactory globality results.

In addition to the computational difficulties in the high-dimensional situation, there are theoretical challenges. First, the reconstruction error of a directed acyclic graph may grow super-exponentially in |$p$| when many local tests are performed. Roughly, this error is no less than |$\min [1,\,\exp\{p \log p - b(\varepsilon) n\} ]$| in view of Bahadur’s lower bound for the error of each test (Bahadur et al., 1980), where |$b(\varepsilon)$| is a constant describing the least favourable situation defined by the Kullback–Leibler information. In other words, any local and sequential method, and in particular the PC-algorithm and its variants, may fail, which occurs roughly when |$p \log p$| significantly exceeds |$n$|⁠. Second, there is little theory to guide practice for reconstructing a directed acyclic graph. One relevant theory is consistent reconstruction of a directed acyclic graph’s structure by the PC-algorithm (Kalisch & Bühlmann, 2007), which relies on strong faithfulness (Spirtes et al., 2000; Zhang & Spirtes, 2002). Unfortunately, this assumption is rather restrictive, because it induces a small set of distributions (Uhler et al., 2013).

In this article, we develop a simultaneous reconstruction approach to jointly estimating the configuration of a directed acyclic graph and model parameters, which overcomes the difficulties of the local and sequential approaches. We propose a constrained likelihood with nonconvex constraints, to quantify the parameter space involving acyclicity. Our treatment of this crucial acyclicity requirement uses a property of doubly stochastic matrices to derive an equivalent form involving only |$p^3-p^2$| active constraints. This, combined with a constrained alternating direction method of multipliers (Gabay & Mercier, 1976) and difference convex programming, makes it possible to solve this bottleneck, thus leading to efficient computation involving a complexity of order |$O(p^3+np^2)$|⁠. We develop theory to quantify what the proposed method can accomplish, focusing on a situation of equal error variance for identifiable directed acyclic graphical models (Peters & Bühlmann, 2014). We show that our method consistently reconstructs the true directed acyclic graph under a degree-of-reconstructability assumption, (18). With regard to estimating model parameters, it recovers the optimal performance of the oracle estimator. This degree-of-reconstructability assumption, similar to the beta-min condition (van de Geer & Bühlmann, 2013) or strong faithfulness, requires that the candidate and true models be well separated, i.e., that the minimal degree of separation should exceed a certain threshold; however, the exact connection between them remains unavailable. Raskutti & Uhler (2014) have shown that a weaker condition than strong faithfulness yields consistency for their score-based method.

2. Formulation

2.1. Gaussian directed acyclic graph and its parameter space

A directed acyclic graphical model encodes the joint probability distribution of a random vector |$(X_1,\dots, X_p)$|⁠, whose nodes and directed edges represent, respectively, the variables |$X_1,\dots, X_p$| and the parent-child dependence relations between any two variables. The parents of |$X_j$|⁠, denoted by |$\text{pa}_j$|⁠, are the variables with an arrow towards |$X_j$| in the graph; |$\text{pa}_j$| is defined to be the empty set if |$X_{j}$| has no parents. The model factorizes the joint distribution of |$(X_1,\dots, X_p)$|⁠, |${\rm pr}(X_1,\dots, X_p)$|⁠, into a product of conditional distributions of each variable given its parents, that is, |$\prod_{j=1}^p {\rm pr}(X_j\mid\text{pa}_j)$|⁠. This factorization property is equivalent to the local Markov property (Edwards, 2012) of a directed acyclic graph.

We embed directional effects induced by directed edges into a structural equation model,  
\begin{equation} \label{E:markov} X_j=f_j(\text{pa}_j,Z_j) \quad (j=1,\dots,p), \end{equation}
(1)
where |$Z_j$| is the latent error representing unexplained variation in node |$j$|⁠. Now consider a Gaussian structural equation model in which each |$f_j(\cdot\,,\cdot)$| in (1) becomes linear in |$(\text{pa}_j,Z_j)$| and |$Z_j$| follows an independent |$ N(0,\sigma_j^2)$| distribution,  
\begin{equation} \label{E:linear} X_j=\sum_{k:k\neq j} A_{jk} X_k + Z_j, \quad Z_j \sim N(0,\sigma_j^2) \quad (j=1,\dots,p), \end{equation}
(2)
where |$ A$| is the parameterized adjacency matrix in which a nonzero |$(j,k)$| element |$A_{jk}$| corresponds to a directed edge from parent node |$k$| to child node |$j$|⁠, with its value |$A_{jk}$| indicating the strength of the relation; |$A_{jk}=0$| when |$k\not\in \text{pa}_j$|⁠. In (2), our objective is to estimate |$ A$| subject to the requirement that |$ A$| defines a directed acyclic graph. This enables us to identify all parent-child relations and to estimate their strengths simultaneously through estimation of |$A$|⁠. Individual means can be incorporated by adding intercepts to (2). For simplicity, in what follows we set the means to zero.

Directed acyclic graph models have been under-studied compared with undirected graphs (Liu et al., 2009). The major challenge stems from acyclicity and the corresponding nonconvex parameter space, defined by nonconvex constraints reinforcing acyclicity of a graph (van de Geer & Bühlmann, 2013). Characterization of such a parameter space remains an open problem. In contrast, the parameter space for Gaussian undirected graphical models (Friedman et al., 2008) is a positive-semidefinite cone |$\mathbb{S}^{p}_{+}= \{ \Omega \in \mathbb{R}^{p\times p} : \Omega= \Omega^{{\rm T}}, \,\Omega \succeq 0 \}$|⁠, which is convex.

To introduce acyclicity constraints, denote a directed cycle by |$(j_1,\dots,j_L)$|⁠, a sequence of indices of nodes, where |$L$| is the length of the directed cycle and |$j_1$| is required to be the smallest index, |$j_1 = \min_{1\leqslant k\leqslant L}j_k$|⁠, so that the directed cycle is uniquely defined. For example, a directed cycle |$1 \rightarrow2 \rightarrow 3 \rightarrow 1$| is denoted by |$(1,2,3)$|⁠. A directed acyclic graph, by definition, does not contain a directed cycle. This implies that the number of directed edges is smaller than the number of nodes involved in every possible directed cycle (Grötschel et al., 1985). Let |$I(\cdot)$| be the indicator function. To prevent a directed cycle, say |$(1,2,3)$|⁠, from occurring, we introduce a constraint |$ I(A_{2\,1} \neq 0) + I(A_{3\, 2} \neq 0) + I(A_{1\, 3} \neq 0) \leqslant 2$| to require that the number of directed edges be no greater than the number of involved nodes minus 1, which equals 2 in this example. On this basis, we introduce constraints on entries of |$ A$| to reinforce acyclicity of any order:  
\begin{align} &\sum_{j_{L+1}=j_1: \, 1 \leqslant k \leqslant L} I(A_{ j_{k+1}j_{k}} \neq 0) \leqslant L-1, \label{loop} \\ &\{ (j_1,\dots,j_L): j_1 = \min_{1\leqslant k\leqslant L}j_k, \: j_1\neq \dots \neq j_L, \, L=2,\dots,p\}, \notag \end{align}
(3)
where the order |$L$| is the number of nodes in a directed cycle. The total number of constraints in (3) is  
$$\binom{p}{2}+\binom{p}{3}2!+ \dots + \binom{p}{p}(p-1)! = O(p^{\,p}),$$
which is super-exponential in |$p$|⁠. The parameter space of Gaussian directed acyclic graphical models is thus defined by the |$O(p^{\,p})$| acyclicity constraints on |$ A$|⁠.

Next we present our main result on constraint reduction to reduce the |$O(p^{\,p})$| constraints in (3) to |$p^3-p^2$| constraints in (4). This result is based on duality and properties of permutation matrices, and can be used with any optimizing function.

 

Theorem 1.
The adjacency matrix |$ A$| satisfies the acyclicity constraints in (3) if and only if there exists a |$p \times p$| variable matrix |$ \lambda=(\lambda_{jk})_{p \times p}\in \mathbb{R}^{p\times p}$| such that the following constraints are satisfied by |$ A$|⁠: 
\begin{eqnarray} & \lambda_{ik} + I(j \neq k) - \lambda_{jk} \geqslant I(A_{ij}\neq 0)\quad (i,j,k=1,\dots,p,\: i\neq j). \label{p3constraintsl0} \end{eqnarray}
(4)

In (4), there are |$p^3-p^2$| constraints on |$( A, \lambda)$|⁠. This allows us not only to reduce the super-exponentially many constraints on |$ A$| to |$p^3-p^2$| active constraints over |$( A, \lambda)$| but also to achieve simplicity in the sense that each constraint involves only one parameter in |$ A$| and is linear in |$\lambda_{jk}$| and |$I(A_{ij} \neq 0)$|⁠.

Theorem 1 requires only the existence of such a matrix |$\lambda =(\lambda_{ik})$|⁠. For an adjacency matrix |$ A$| meeting the acyclicity requirement in (3), the matrix |$\lambda$| satisfying (4) may not be unique. In fact, if such a matrix exists, the matrix |$(\lambda_{ik} + c_{k})$|⁠, with its |$k$|th column shifted by a constant |$c_k$|⁠, also satisfies (4), as  
\begin{eqnarray*} & (\lambda_{ik} +c_{k}) + I(j \neq k) - (\lambda_{jk}+c_{k}) =\lambda_{ik} + I(j \neq k) - \lambda_{jk} \geqslant I(A_{ij}\neq 0) \\ &\quad \hspace{3 in}(i,j,k=1,\dots,p,\: i\neq j). \end{eqnarray*}

This nonuniqueness of |$\lambda$| in Theorem 1 does not affect our algorithm when letting, for example, the first row |$\lambda_{1k}$| be 1 to remove redundancy.

2.2. Constrained maximum likelihood

For simplicity we suppose that the |$\sigma_j$| (⁠|$j=1,\ldots,p$|⁠) in (2) are equal, as equal variances for the |$Z_j$| yield identifiable directed acyclic graphs (Peters & Bühlmann, 2014) and we can avoid an unnecessary detour involving equivalence classes. A case with different error variances is discussed in § 4 and in the Supplementary Material.

If the error variances are equal, (2) becomes  
\begin{equation} \label{E:linear2} X_j=\sum_{k:\,k\neq j} A_{jk} X_k + Z_j \quad (j=1,\dots,p), \end{equation}
(5)
where all the |$Z_j$| are independent and identically distributed according to |$N(0, \sigma^2)$| with |$\sigma > 0$|⁠.
Next, given an |$n\times p$| data matrix |$\{x_{ij}\}_{n\times p}$| sampled from (5), where its |$ij$|th entry |$x_{ij}$| is the |$i$|th observation on the |$j$|th node, the negative loglikelihood, after dropping a constant term, is  
\begin{equation} l( A)=\frac{1}{2}\sum_{j=1}^p \sum_{i=1}^n \Biggl(x_{ij}-\sum_{k:\,k\neq j} x_{ik}A_{jk}\Biggr)^2, \label{E:loglik} \end{equation}
(6)
which is convex in |$ A$|⁠.
To estimate |$ A$|⁠, we impose a constraint to regularize its sparsity, in addition to the constraints defined in (3). The first constraint controls nonzero entries of |$ A$|⁠,  
\begin{equation} \sum_{i,j:\,i\neq j} I(A_{ij} \neq 0) \leqslant K, \nonumber \end{equation}
where |$I(\cdot)$| is the indicator function and |$K$| is an integer-valued tuning parameter that controls the degree of sparsity.
The directed acyclic graph learning problem can be formulated as  
\begin{align} \min_{ A}& \; l( A) \label{min6}\\ \mbox{subject to }& \sum_{i,j:\,i\neq j} I(A_{ij} \neq 0) \leqslant K, \nonumber \\ &\sum_{1 \leqslant k \leqslant L:\, j_{L+1}=j_1 } I(A_{j_{k+1} j_{k} } \neq 0) \leqslant L-1, \nonumber \\ & \{ (j_1,\dots,j_L): j_1 = \min_{1\leqslant k\leqslant L}j_k, \: j_1\neq \dots \neq j_L, \, L=2,\dots,p\}. \nonumber \end{align}
(7)
By Theorem 1, (7) is equivalent to  
\begin{align} \min_{( A, \lambda)}&\; l( A) \label{min7}\\[-2pt] \end{align}
(8)
 
\begin{align} \mbox{subject to } & \sum_{i,j:i\neq j} I(A_{ij} \neq 0) \leqslant K, \label{sparsity} \\[-2pt] \end{align}
(9)
 
\begin{align} & \lambda_{ik} + I(j \neq k) - \lambda_{jk} \geqslant I(A_{ij}\neq 0)\quad (i,j,k=1,\dots,p,\: j\neq i) . \label{p3L0} \end{align}
(10)
Next, we approximate the indicator functions by piecewise-linear functions to circumvent the discontinuity, so that difference convex programming will be applicable to our nonconvex minimization. Specifically, in (9) and (10), we replace the indicator function by its computational surrogate |$J_{\tau}(\cdot)$|⁠, where |$J_{\tau}(x)=\min({|x|}/{\tau},1)$| is the truncated |$L_1$|-function (Shen et al., 2012), which approximates the indicator function as |$\tau \rightarrow 0^+$|⁠. This yields  
\begin{align} \min_{( A, \lambda)}&\;l( A) \label{min10}\\[-2pt] \end{align}
(11)
 
\begin{align} \mbox{subject to }& \sum_{i,j:\,i\neq j}J_{\tau}(A_{ij})\leqslant K, \label{sparseTLP} \\[-2pt] \end{align}
(12)
 
\begin{align} & \lambda_{ik} + I(j \neq k) - \lambda_{jk} \geqslant J_{\tau}(A_{ij})\quad (i,j,k=1,\dots,p,\: j\neq i). \label{p3TLP} \end{align}
(13)

In practice, |$\tau$| and |$K$| are treated as tuning parameters, optimized using an independent validation set or by crossvalidation over a set of grids on their domain. Minimization of (11) subject to (12) and (13) involves |$p^3-p^2+1$| nonconvex constraints in (13), but compared to the original formulation with indicator functions, |$J_{\tau}(\cdot)$| is piecewise linear and can be decomposed into the difference of two convex functions. We solve this constrained minimization through difference convex programming and the alternating direction method of multipliers.

3. Computation

Our computational strategy for solving (11) subject to (12) and (13) consists of two steps. First, we relax (12) and (13) using a sequence of approximations involving convex constraints, where each approximation is refined iteratively. Then we solve each convex subproblem with |$p^3-p^2+1$| linear constraints by employing a constrained alternating direction method of multipliers. The underlying process iterates until convergence.

For convex relaxation of constraints (12) and (13), we employ difference convex programming. In particular, we decompose |$J_{\tau}$| into a difference of two convex functions, |$J_{\tau}(z)=S_1(z)-S_2(z) = {|z|}/{\tau} - \max({|z|}/{\tau}-1,0)$|⁠, and construct a sequence of convex approximating sets iteratively by replacing |$S_2$| in the decomposition at iteration |$m$| by its affine majorization at iteration |$(m-1)$|⁠, |$S_2(z^{(m-1)})+ \nabla S_2(z^{(m-1)})^{{\rm T}}(z-z^{(m-1)})$|⁠, where |$\nabla S_2$| is a subgradient of |$S_2$| at |$z^{(m-1)}$|⁠. Specifically, at iteration |$m$| we solve  
\begin{align} \min_{( A, \lambda)} &\; l( A) \label{min13}\\[-2pt] \mbox{subject to} & \sum_{i,\,j:\,i\neq j}|A_{ij}| w_{ij}^{(m-1)} \leqslant Z^{(m-1)}, \nonumber \\[-2pt] & \tau\lambda_{ik} + \tau I(j \neq k) - \tau\lambda_{jk} \geqslant |A_{ij}| w^{(m-1)}_{ij} + \tau(1-w_{ij}^{(m-1)})\nonumber \\[-2pt] & (i,j,k=1,\dots,p,\: j\neq i),\nonumber \end{align}
(14)
where |$w^{(m-1)}_{ij}=I( |\skew6\hat{A}^{(m-1)}_{ij}|< \tau)$| (⁠|$1 \leqslant i,j \leqslant p$|⁠), with |$\hat{A}^{(m-1)}$| being the solution at iteration |$(m-1)$|⁠, and |$Z^{(m-1)}=\tau\{K - \sum_{i,\,j:\, i\neq j} (1-w_{ij}^{(m-1)})\}$|⁠. This minimization problem with |$p^3-p^2+1$| convex constraints can be solved using a generic quadratic programming solver or by an alternating direction method of multipliers. Based on our limited numerical experiments, Algorithm 1 below may significantly reduce the runtime compared to quadratic programming.
To solve (14), we consider its equivalent form for efficient computation,  
\begin{align} \min_{( A,\lambda)} &\;\; l( A) + \mu \sum_{i,\,j:\,i\neq j}|A_{ij}| w^{(m-1)}_{ij} \label{original_uncon}\\ \mbox{subject to }\; & \tau\lambda_{ik} + \tau I(j \neq k) - \tau\lambda_{jk} \geqslant |A_{ij}|w^{(m-1)}_{ij} + \tau(1-w_{ij}^{(m-1)}) \nonumber\\ & (i,j,k=1,\dots,p,\: j\neq i), \nonumber \end{align}
(15)
where |$\mu$| is a nonnegative regularizer corresponding to |$Z^{(m-1)}$| in (14). Their correspondence is as follows: a solution |$\{\skew6\tilde{A}(\mu), \tilde{ \lambda}(\mu)\}$| of (15) is also a global minimizer of (14), where |$Z^{(m-1)}=\sum_{i,\,j:\,i\neq j}|\skew6\tilde{A}_{ij} (\mu)| w^{(m-1)}_{ij}$|⁠, and vice versa.
To proceed, we introduce |$ B_{p\times p}$| to separate the differentiable and nondifferentiable parts involving the |$L_1$|-norm. In addition, we introduce a slack variable tensor |$\xi =\{\xi_{ijk}\}_{p\times p \times p}$| to convert inequality to equality constraints. Then (15) becomes  
\begin{align} \min_{{( A, \lambda)}} &\;\; l( A) + \mu\sum_{i,\,j:\,i\neq j}|B_{ij}|w^{(m-1)}_{ij} \label{ADMMlasso2} \\ \mbox{subject to }\; & \tau\lambda_{ik} + \tau I(j \neq k) - \tau\lambda_{jk} - |B_{ij}|w^{(m-1)}_{ij} - \tau(1-w_{ij}^{(m-1)}) -\xi_{ijk}=0 \nonumber \\ & (i,j,k=1,\dots,p,\: j\neq i), \nonumber \\ & \xi_{ijk}\geqslant 0, \quad A- B = 0.\nonumber \end{align}
(16)
Following Boyd et al. (2011), we obtain an augmented Lagrangian |$L_{\rho}( A, B, \lambda, \xi, Y, U)$| by introducing the scaled dual variable tensor |$ Y= \{y_{ijk}\}_{p\times p \times p}$| and the dual variable matrix |$ U= \{u_{ij}\}_{ p \times p}$|⁠. Now |$L_{\rho}( A, B, \lambda, \xi, Y, U)$| can be written as  
\begin{align*} & l( A) +\mu\biggl(\sum_{i\neq j}|B_{ij}| w_{ij}^{(m-1)} \biggr) + \frac{\rho}{2}\bigl\| A - B + U\bigr\|^2_{\rm F} \\ & + \sum_k\sum_{i\neq j}\frac{\rho}{2} \big\{|B_{ij}|w^{(m-1)}_{ij} +\tau(1-w^{(m-1)}_{ij} )+ \xi_{ijk} - \tau\lambda_{ik} - \tau I(j\neq k) + \tau\lambda_{jk} +y_{ijk}\big\}^2. \nonumber \end{align*}
We iterate over six blocks |$( A, B, \lambda, \xi, y, U)$| until convergence. At iteration step |$s+1$|⁠,  
\begin{align} A^{(s+1)} & = \mathop{\arg\min}_{ A} \, L_{\rho}( A, B^{(s)}, \phi^{(s)}, \lambda^{(s)}, \xi^{(s)}, Y^{(s)}, U^{(s)}), \nonumber \\ B^{(s+1)} & = \mathop{\arg\min}_{ B} \, L_{\rho}( A^{ (s+1)}, B, \phi^{(s)}, \lambda^{(s)}, \xi^{(s)}, Y^{(s)}, U^{(s)}), \nonumber \\ \lambda^{(s+1)} & = \mathop{\arg\min}_{\lambda}\,L_{\rho}( A^{ (s+1)}, B^{(s+1)}, \phi^{(s)}, \lambda , \xi^{(s)}, Y^{(s)}, U^{(s)}), \nonumber \\ \xi^{(s+1)} & = \mathop{\arg\min}_{\{\xi_{ijk} \geqslant 0\}}\, L_{\rho}( A^{ (s+1)}, B^{(s+1)}, \phi^{(s)}, \lambda^{(s+1)} , \xi, Y^{(s)}, U^{(s)}), \nonumber\\ y_{ijk}^{(s+1)} & = y_{ijk}^{(s)}+|B_{ij}^{(s+1)}|w^{(m-1)}_{ij} +\tau(1-w^{(m-1)}_{ij} ) \nonumber \\ &\quad + \: \xi_{ijk}^{(s+1)} - \tau\lambda_{ik}^{(s+1)} - \tau I(j\neq k) + \tau\lambda_{jk}^{(s+1)},\nonumber \\ U^{(s+1)} & = U^{(s)}+\bigl( A^{(s+1)}- B^{(s+1)}\bigr), \nonumber \end{align}
where the formulae are given in the Supplementary Material. Overall, our computational algorithm is summarized as follows.

 

Algorithm 1.

  • Step 1. Supply an initial estimate |${\hat{A}}^{(0)}$|⁠, such as |$\hat{A}^{(0)}= 0$|⁠.

  • Step 2. At iteration |$m$|⁠, compute |${\hat{A}}^{(m)}$| by minimizing (16).

  • Step 3. Terminate when |$l(\hat { A}^{(m-1)})-l(\hat{A}^{(m)}) \leqslant \varepsilon$|⁠.

Here |$\varepsilon$| is the precision tolerance. Algorithm 1 gives the final estimate |$\hat{A}=\hat{A}^{(m^*)}$|⁠, where |$m^*$| is the smallest index at the termination criterion.

 

Proposition 1.
Algorithm 1 converges to a directed acyclic graph when |$\tau$| is small enough that |$|A_{ij}| \geqslant \tau$| for all edges |$(j,i) \in E$|⁠, and yields a local minimizer of (11) subject to (12) and (13) that satisfies the following local optimality condition for some multipliers |$\nu \geqslant 0$| and |$\{\zeta_{ijk}\geqslant 0\}_{i,\,j,\,k=1,\ldots,\,p,\,j\neq i}$|⁠: 
\begin{eqnarray} \frac{\partial l( A)}{\partial A_{ij}} + \frac{\nu}{\tau} s_{ij} + \frac{\sum_{1\leqslant k\leqslant p } \zeta_{ijk}}{\tau} s_{ij} & = & 0 \quad (i,j=1,\dots,p), \label{opt} \end{eqnarray}
(17)
where |$s_{ij}$| is the regular subdifferential of |$J_{\tau}(|A_{ij}|)$| at |$A_{ij}$|⁠, defined as |$s_{ij}={\rm sign}(A_{ij})$| if |$0<|A_{ij}| < \tau$|⁠, |$s_{ij} \in [-1,1]$| if |$A_{ij}=0$|⁠, |$s_{ij}=0$| if |$|A_{ij}| > \tau$|⁠, and |$s_{ij}=\emptyset$| (the empty set) if |$|A_{ij}|=\tau$|⁠.

The computational complexity of Algorithm 1 in one iteration over six blocks in (17) is roughly |$O(p^3+np^2)$|⁠. Usually the alternating direction method of multipliers converges with modest accuracy within a few tens of iterations, although it can be slow to converge with high accuracy (Boyd et al., 2011). This contrasts with the fast convergence of the difference convex part, which has a finite termination property (Shen et al., 2013). Our difference convex step converged within ten iterations for our examples. The algorithm is insensitive to the choice of initial values, better values of which lead to faster convergence.

4. Theory

In this section we develop theory for constrained maximum likelihood estimation with respect to reconstruction of a directed acyclic graph structure. We will proceed under the equal variance assumption, which implies that the distributions from different directed acyclic graphical models are identifiable (Peters & Bühlmann, 2014). We will then show that the proposed method recovers the true directed acyclic graph’s structure under (18).

Let |$\mathcal{B} = (G, \theta)$| be a parameterized directed acyclic graph, where |$ \theta = ( A, \sigma)$| and |$G=G( A)$| is a directed acyclic graph induced by parameters. Let |$E=E( A)= \{(i,j): A_{ji}\neq 0 \}$| be the set of nonzero elements in |${ A}$|⁠, which is equivalent to the edge set of the graph. Let |$|E|$| denote the counts of elements in set |$E$| and |$E_1\backslash E_2$| the set difference of |$E_1$| and |$E_2$|⁠. Let |$\mathcal F_0=\{ \theta=( A, \sigma): c_{\min}( \Omega) \geqslant M_1>0, \,\sup_{1 \leqslant j \leqslant p} |\Omega_{jj}| \leqslant M_2\}$| be the parameter space containing |$( A^0, \sigma^0)$|⁠, where |$M_1,M_2>0$| are constants, |$ \Omega= ( I- A)^{{\rm T}} ( I- A)/\sigma^2$| is the precision matrix, and |$c_{\min}( \Omega) $| is the smallest eigenvalue of |$ \Omega$|⁠. The assumptions that |$\inf_{ A \in \mathcal F_0} c_{\min}( \Omega) \geqslant M_1>0$| and |$\sup_{1 \leqslant j \leqslant p} |\Omega_{jj}| \leqslant M_2$| are sufficient to ensure that the likelihood function is bounded; |$\mathcal{B}^0$|⁠, |$G^0$|⁠, |$ \theta^0$|⁠, |$ A^0$|⁠, |$E^0$| and |$ \Omega^0$| represent the corresponding true values.

Below we derive a finite-sample probability error bound for reconstructing the true directed acyclic graph by the proposed method, which not only reconstructs the true directed acyclic graph when identifiable but also recovers the optimal performance of the oracle estimator |$(\hat{A}^{\text{OR}} ,\hat{\sigma}^{\text{OR}} )$|⁠, defined as the maximum likelihood estimator assuming that the true edge set |$E^0$| is known a priori.

 

Assumption 1.

For some positive constants |$M_1$| and |$M_2$|⁠, |$\inf_{ \Omega} c_{\min}( \Omega) \geqslant M_1>0$| and |$\sup_{1 \leqslant k \leqslant p} |\Omega_{kk}| \leqslant M_2$|⁠, where |$c_{\min}( \Omega )$| is the smallest eigenvalue of |$ \Omega$| and |$\Omega_{kk}$| is the |$k$|th diagonal element of |$\Omega$|⁠.

Assumption 1 is a condition on boundedness of entries of |$ \Omega$|⁠. It implies that |$\sigma^2 \geqslant M^{-1}_2$| since |$\Omega_{jj}= 1/\sigma^2 + \sum_{k \neq j} A^2_{jk}/\sigma^2$| for an adjacency matrix. Moreover, from (2), |$\sigma^2 \leqslant \mbox{var}(X_j)=\Sigma_{jj}= e_j^{{\rm T}} \Sigma e_j \leqslant 1/c_{\min}( \Omega) \leqslant 1/M_1$|⁠, where |$ e_k$| is a standard basis vector with the |$k$|th element equal to |$1$| and the rest of the elements being |$0$|⁠. Therefore, |$\sigma^2$| is bounded above and below.

Next we define the degree of reconstructability for directed acyclic graphs to be  
\begin{eqnarray*} C_{\min}( \Omega^0)=\inf_{\{ \Omega_E=( I- A_E)^{{\rm T}} ( I- A_E)/\sigma^2:\, A_E \neq A^0, \,|E|\leqslant |E^0|, \, A_E \text{ satisfies } (3)\}} \frac{-\log\{1-h^2( \Omega_E, \Omega^{0})\}}{\max(|E^{0}\backslash E|,1)}, \end{eqnarray*}
where  
\begin{eqnarray*} h^2( \Omega, \Omega^0)=1- \left[\frac{\{\mbox{det}(\Omega) \mbox{det}( \Omega^0)\}^{1/2}} {\mbox{det}\bigl(\frac{ \Omega+ \Omega^0}{2}\bigr)}\right]^{1/2} \end{eqnarray*}
is the Hellinger distance between |$ \Omega$| and |$ \Omega^0$| under (2). This quantity |$C_{\min}( \Omega^0)$| is bounded below by simpler but more interpretable terms. Under regularity conditions, a Taylor expansion of |$\log \det( \Omega)$| at |$ A^0$| yields  
\begin{eqnarray*} -2 \log \bigl\{1- h^2( \Omega_E, \Omega^0) \bigr\} & = & -\frac{1}{2} \big\{\log \det( \Omega_E) + \log \det( \Omega^0)\big\} + \log \det\left(\frac{ \Omega_E+ \Omega^0}{2}\right) \\ & \geqslant &\frac{1}{8} (\vec{ A}_E - \vec{ A}^0)^{{\rm T}} H (\vec{ A}_E - \vec{ A}^0) \geqslant c^* \bigl|E^0 \setminus E\bigr| \, c_{\min}( H) \gamma_{\min}^2 \end{eqnarray*}
for some constant |$c^*>0$|⁠, where |$\vec{ A}$| is a |$p^2$|-vector representation of |$ A$|⁠, |$\gamma_{\min}\equiv\gamma_{\min}(A^0)= \min\{|\Omega^0_{jk}|: \Omega^0_{jk} \neq 0,\, j \neq k\}$| is the minimal size of nonzero entries of |$\Omega^0$| whose |$ij$|th element is |$\Omega_{ij}$|⁠, and  
\begin{equation*} H=\left(\frac{\partial^2 [-\log \det\{( I- A)^{{\rm T}} ( I- A)/\sigma^2\}]}{\partial^2 A}\right)\bigg|_{ A= A^0} \end{equation*}
is the |$p^2 \times p^2$| Hessian matrix of |$-\log \det( \Omega)$|⁠. Then |$C_{\min}( \Omega^0) \geqslant c^* \gamma^2_{\min} c_{\min}( H)$|⁠.

The degree of reconstructability measures the overall degree of difficulty of reconstruction, which is roughly determined by two terms. While |$\gamma_{\min}^2( A^0)$| reflects the signal strength in terms of the minimal nonzero size of |$ \Omega^0$|⁠, |$c_{\min}( H)$| can be thought of as the local curvature or the second derivative of the loglikelihood at |$ \Omega^0$|⁠, which measures dependency among entries of |$ \Omega^0$|⁠. Assumption 2 below requires the degree of reconstructability to exceed a certain threshold that is proportional to |$n^{-1} \max(\log p,|E^0|)$|⁠, or roughly |$c^* \gamma^2_{\min} c_{\min}( H) \geqslant n^{-1} \max(\log p,|E^0|)$|⁠. This aspect differs from the existing conditions for a non-likelihood-based method. Unfortunately, the exact connection requires further investigation to clarify.

 

Assumption 2
(Degree of reconstructability). Assume that  
\begin{eqnarray} C_{\min}( \Omega^0) \geqslant 4 c_2^{-1} n^{-1} \max(\log p,|E^0|) \label{reconstruct} \end{eqnarray}
(18)
for some constant |$c_2>0$|⁠, say |$c_2=2/26\,001$|⁠.
A sufficient condition for checking (2) is that  
\begin{eqnarray} \label{sufficient} \gamma_{\min}^2( \Omega^0) c_{\min}(H) \geqslant c^{**} n^{-1} \max(\log p,|E^0|) \end{eqnarray}
(19)
for some constant |$c^{**}>0$|⁠.

For reconstruction, we show that the proposed method enables us to consistently reconstruct the true directed acyclic graph, obtaining the optimal estimation, provided that the degree of reconstructability exceeds a certain level, |$n^{-1} \log p$|⁠.

Theorem 2 below says that the oracle estimator |$\hat{\Omega}^{\rm OR} = ( I - \hat{A}^{\rm OR})^{{\rm T}}( I - \hat{A}^{\rm OR} )/(\hat{\sigma}^{\rm OR})^2$| can be reconstructed by a global minimizer |$\hat{\theta}^{L_0} = (\hat{A}^{L_0},\hat{\sigma}^{L_0})$| of (8) subject to (9) and (10). As |$n \rightarrow \infty$|⁠, the reconstructed directed acyclic graph |$\hat{G}^{L_0}$| is consistent for the true directed acyclic graph |$G^0$|⁠; moreover, the optimal performance |$E h^2(\hat{\Omega}^{\rm OR} , \Omega^{\rm 0 })$|⁠, as measured by the Hellinger risk, is recovered by |$E h^2(\hat{\Omega}^{L_0}, \Omega^{\rm 0})$|⁠.

 

Theorem 2.
Under Assumption 1, if |$K=|E^0|$| in (9), then there exists a constant |$c_2>0$|⁠, say |$c_2=2/ 26\,001$|⁠, such that for any |$(n,p,|E^0|)$|,  
\begin{eqnarray*} {\rm pr}\bigl(\hat{G}^{L_0} \neq G^{\rm 0}\bigr) &\leqslant& {\rm pr}\bigl(\hat{{\Omega}}^{L_0} \neq \hat{\Omega}^{{\rm OR}} \bigr) \nonumber \\ &\leqslant& \exp \bigl[-c_2 n C_{\min}( \Omega^{\rm 0})+ 2\log \{p(p-1)+1\} + 1\bigr]. \end{eqnarray*}

Under Assumption 2, |${\rm pr}(\hat{G}^{L_0} \neq G^{\rm 0}) \rightarrow 0$|and |$ E\{ h^2(\hat{\theta}^{L_0}, \theta^{\rm 0})\}/ E \{h^2(\hat{\Omega}^{\rm OR}, \Omega^{\rm 0}) \} \rightarrow 1$| as |$n \rightarrow \infty$|⁠, where |$p$| and |$|E^0|$| may depend on |$n$|.

A similar result can be established for the proposed estimator, the minimizer |$\hat{\theta}^{\rm {TL}} = (\hat{A}^{\rm {TL}} , \hat{D}^{\rm {TL}} )$| of (11) subject to (12) and (13), with the following additional assumption.

 

Assumption 3.
For some positive constants |$d_1, d_2$| and |$d_3$|⁠,  
\begin{eqnarray} h^2( \Omega, \Omega^{\rm 0}) \geqslant d_1 h^2( \Omega_{\tau}, \Omega^{\rm 0})- d_3 p \tau^{d_2}, \label{eqAssumptionC} \notag \end{eqnarray}
where |$ \Omega_{\tau} =( I - A_{\tau})^{{\rm T}} ( I - A_{\tau})/\sigma^2$|⁠, with the |$ij$|th element of |$ A_{\tau}$| defined as |$A_{ij}I(|A_{ij}|\geqslant \tau)$|⁠.

 

Theorem 3.
Under Assumptions 1 and 3, if |$K=|E^0|$| and |$\tau \leqslant C_{\min}( \Omega^{\rm 0}) M_1/4 p$|⁠, then there exists a constant |$c_2>0$|⁠, say |$c_2=2/ 26\,001$|⁠, such that for any |$(n,|E^0|,p)$|,  
\begin{eqnarray*} {\rm pr}\bigl(\hat{G}^{\rm {TL}} \neq G^0\bigr) \leqslant {\rm pr}\bigl(\hat{{\Omega}}^{\rm {TL}} \neq \hat{\Omega}^{\rm {OR}} \bigr) \leqslant \exp \bigl[-c_2 n C_{\min}( \Omega^{\rm 0}) + 2\log \{p(p-1)+1\} + 1 \bigr]. \end{eqnarray*}

Under Assumption 2, |${\rm pr}(\hat{G}^{\rm {TL}} \neq G^0) \rightarrow 0$| and |$E \{h^2( \hat{\Omega}^{\rm {TL}} , \Omega^0)\}/E\{ h^2( \hat{\Omega}^{\rm {OR}} , \Omega^0) \} \rightarrow 1$| as |$n, p, |E^0| \rightarrow \infty$|.

The graphical structure is recovered by |$\hat{\theta}^{L_0}$| and its computational surrogate |$\hat{\theta}^{\rm {TL}} $|⁠, and the optimal performance of the oracle estimator is achieved.

Assumption 1 is a regularity condition for |$ \Omega$|⁠. Assumption 2 may be viewed as an alternative to strong faithfulness, which is defined as follows. Given |$\kappa \in (0,1)$|⁠, a multivariate Gaussian distribution is said to be |$\kappa$|-strong faithful to a directed acyclic graph |$G=(V,E)$| if  
\begin{equation} \label{strongfaithfulness} \min_{S \subset V\backslash\{i,j\}} \min_{i,j}\big\{\big|\text{corr}(X_i,X_j \mid X_S)\big|: \,\text{corr}(X_i,X_j\mid X_S) \neq 0, \: 1 \leqslant i \neq j \leqslant p\big\} >\kappa, \end{equation}
(20)
where |$\kappa$| is of order |$\{s_0 \log ( p)/n\}^{1/2}$|⁠, with |$s_0$| being a sparsity measure. For a pair |$(i,j)$|⁠, the number of possible sets for |$S$| is |$2^{(p-2)}$|⁠. If there is a directed edge between |$i$| and |$j$|⁠, |$\text{corr}(X_i,X_j\mid X_S) \neq 0$| for any |$S$|⁠. Therefore, for this |$(i,j)$| pair alone, (20) could require exponentially many conditions, as illustrated in Example 1. As Uhler et al. (2013) argued, (20) is very restrictive.

To see how (20) and (19) are related, we examine the following example.

 

Example 1.
Consider (5) with |$p=3$| and |$\sigma=1$|⁠:  
\begin{eqnarray*} X_1=Z_1, \quad X_2=a_{21}X_1+Z_2, \quad X_3=a_{31} X_1+ a_{32} X_2+Z_3, \end{eqnarray*}
where |$A=(A_{ij})_{3 \times 3}$| with |$A_{21}=a_{21}$|⁠, |$A_{31}=a_{31}$|⁠, |$A_{32}=a_{32}$|⁠, and |$A_{ij}=0$| otherwise. According to Uhler et al. (2013), (20) is equivalent to the minimal absolute value of six covariances and conditional covariances exceeding a threshold of order |$\kappa$|⁠, that is, (i) |${\rm cov}(X_1,X_2)=a_{21}$|⁠, (ii) |${\rm cov}(X_1,X_3)=a_{31}+a_{21} a_{32}$|⁠, (iii) |${\rm cov}(X_2,X_3)=a^2_{21} a_{32}+ a_{21} a_{31}+a_{32}$|⁠, (iv) |${\rm cov}(X_1,X_2\mid X_3) =a_{31} a_{32}-a_{21}$|⁠, (v) |${\rm cov}(X_1,X_3\mid X_2)=-a_{31}$|⁠, and (vi) |${\rm cov}(X_2,X_3\mid X_1)=-a_{32}$|⁠. For (19) we have  
\begin{equation*} \Omega=(I-A)^{{\rm T}} (I-A) = \begin{bmatrix} 1+a^2_{21}+a^2_{31} &\quad -a_{21}+a_{32}a_{31} & \quad- a_{31} \\ -a_{21}+a_{32} a_{31} &\quad 1+a^2_{32} &\quad -a_{32} \\ -a_{31} & \quad -a_{32} &\quad 1 \end{bmatrix}\!. \end{equation*}

So in contrast (19) only requires the minimal absolute value of three conditional covariances, |${\rm cov}(X_1,X_2\mid X_3)$|⁠, |${\rm cov}(X_1,X_3\mid X_2)$| and |${\rm cov}(X_2,X_3\mid X_1)$|⁠, to exceed a threshold that is of order |$\{n^{-1} \max(\log p, |E^0|)\}^{1/2}$|⁠. This suggests that (20) is more stringent than (19).

The proposed method is applicable to nonequal error variances in (2), although the present paper focuses on the identifiable situation assuming equal error variances. This error variance assumption seems sensible for consistent directed acyclic graph reconstruction and is natural for applications with variables from a similar domain. When the error variances in (2) are not equal, our method continues to work with a minor modification of the likelihood function; see the Supplementary Material. In such a situation, directed acyclic graphical models are not identifiable, and the equivalence classes are estimated; see Andersson et al. (1997).

5. Numerical examples

5.1. General set-up

In this section we study operating characteristics of the proposed constrained maximum likelihood estimator, comparing its performance against competitors in terms of estimation accuracy of directed edges and of parameters. The proposed method is compared with four top performers: the test-based high-dimensional PC-algorithm (Spirtes et al., 2000); two score-and-search methods, namely the greedy equivalent search method (Chickering, 2003) and a hill climbing method (Jensen, 2004); and a hybrid version of max-min hill climbing (Tsamardinos et al., 2006). We coded our method in C, and based on this code we also developed an R routine (R Development Core Team, 2019) available at https://github.com/ciryiping/TLPDAG. For the PC-algorithm and greedy equivalent search, we use the R package pcalg (Kalisch et al., 2012); both output partially directed acyclic graphs, so parameter estimation is not available directly. To make a fair comparison, we extend a partially directed acyclic graph to directed acyclic graphs using the function pdag2dag in the R package pcalg and then fit parameters given the extended directed structure. For the hill climbing and max-min hill climbing methods, we use the R package bnlearn (Scutari, 2010). Two simulated examples are considered in § 5.2, followed by a real-data example in § 5.3.

For accuracy of estimating directed edges of a graph, we consider the false positive and false discovery rates.

For overall accuracy of estimating the directed acyclic graph structure, we use the Matthews correlation coefficient as an overall performance metric. This is commonly used in binary classification and is defined to be  
\begin{equation} \label{E:MCC} \frac{{\small{\text{TP}}}\times {\small{\text{TN}}} - {\small{\text{FP}}} \times {\small{\text{fn}}}} {\left\{({\small{\text{TP}}}+{\small{\text{FP}}})({\small{\text{TP}}}+{\small{\text{fn}}})({\small{\text{TN}}}+{\small{\text{FP}}})({\small{\text{TN}}}+{\small{\text{fn}}}) \right\}^{1/2}}, \end{equation}
(21)
where |${\small{\text{TP}}}$|⁠, |${\small{\text{FP}}}$|⁠, |${\small{\text{TN}}}$| and |${\small{\text{fn}}}$| are the true positive, false positive, true negative and false negative rates of edge estimation. A large value of (21) indicates a good fit, with |$1$| and |$-1$| being best and worst. We also employ the structural Hamming distance, which measures the number of edge insertions, deletions or flips needed to transform one graph to another (Tsamardinos et al., 2006). A smaller structural Hamming distance implies closeness of two graphs. The reader may refer to the R package pcalg for computation.
For accuracy of parameter estimation, we use the Kullback–Leibler loss to measure the discrepancy between an estimator |$\hat{\theta}$| and the truth |$ \theta^0$|⁠:  
\begin{eqnarray} {{\small{\text{KL}}}}(\hat{\theta}, \theta^0)={\rm{tr}}( \Sigma^0 \hat{\Omega})- \log | \Sigma^0 \hat{\Omega}|-p, \label{parameter estimation} \notag \end{eqnarray}
where |$ \Sigma^0$| is the true covariance matrix.

The max-min hill climbing method and the PC-algorithm require one tuning parameter |$\alpha$| controlling the significance level for independence tests, which is selected from |$\{0.0001, 0.001, 0.005, 0.01, 0.05\}$|⁠. For our method, |$\tau$| and |$K$| are chosen from |$\{0.001, 0.01, 0.1\}$| and the integers from |$1$| to |$150$|⁠, respectively, and control the threshold and the number of edges. The best tuning parameters maximize the predicted loglikelihood (6) based on an independent tuning set |$ X_{\rm val}$| of size 1000 over the specified set of |$\alpha$| or |$(\tau,K)$|⁠. The hill climbing and greedy equivalent search algorithms do not require tuning.

5.2. Simulated examples

Examples 2 and 3 involve respectively a sparse neighbourhood graph, where each node has a sparse neighbourhood (Kalisch & Bühlmann, 2007), and a sparse graph with a nonsparse neighbourhood.

 

Example 2.

Consider a directed acyclic graph of 100 nodes, generated randomly using the generation mechanism described in Kalisch & Bühlmann (2007). First we construct the true adjacency matrix |$ A$|⁠. To begin with, set |$ A= 0$|⁠. Given a prespecified ordering of the 100 nodes, we replace every entry in the lower triangle, or on the lower off-diagonals, of |$ A$| by a random sample of |$0$| or |$1$| following a Bernoulli distribution with success probability |$s=0.02$|⁠, where |$1$| indicates an edge connection and |$s$| controls the degree of sparseness of the graph. Then we parameterize |$ A$| by replacing all the entries of value |$1$| with |$0.5$|⁠, a value indicating the signal strength. Finally, given |$ A$|⁠, a random sample is generated according to (5) with |$\sigma=1$|⁠. Table 1 shows the results.

Table 1.

Results from Example 2: average false positive rate, false discovery rate, Matthews correlation coefficient, Kullback–Leibler loss, and structural Hamming distance, together with their standard errors in parentheses, for five competing methods based on |$100$| replications

|$n$||$p$|MethodFPR (%)FDR (%)MCCKLSHD
50 100 CMLE 0.96 (0.27) 62.9 (6.5) 0.29 (0.05) 22.6 (1.4) 105.4 (8.5) 
    PC 0.29 (0.07) 90.5 (10.5) 0.03 (0.05) 24.8 (1.0) 106.7 (2.3) 
    GES 4.41 (0.17) 69.4 (2.5) 0.29 (0.04) 42.2 (2.9) 239.8 (11.8) 
    HC 17.4 (1.93) 91.9 (1.0) 0.19 (0.02) 6.1E13 (8.5E13) 859.8 (92.7) 
    MMHC 0.95 (0.13) 49.7 (4.8) 0.45 (0.04) 24.1 (2.1) 96.2 (7.7) 
100 100 CMLE 0.59 (0.23) 24.2 (8.0) 0.79 (0.05) 10.6 (1.1) 82.4 (13.7) 
    PC 0.74 (0.19) 71.9 (6.8) 0.18 (0.07) 17.6 (1.7) 94.5 (5.1) 
    GES 3.32 (0.19) 57.0 (2.9) 0.49 (0.04) 15.4 (0.8) 165 (10.5) 
    HC 4.57 (0.43) 67.6 (2.1) 0.55 (0.02) 25.6 (2.8) 234.5 (25.2) 
    MMHC 0.64 (0.08) 25.4 (2.6) 0.79 (0.02) 9.4 (0.7) 57.8 (6.2) 
1000 100 CMLE 0.05 (0.03) 2.3 (1.4) 0.98 (0.01) 0.3 (0.1) 2.5 (1.6) 
    PC 1.32 (0.31) 53.7 (6.9) 0.47 (0.06) 8.7 (0.4) 64.3 (15.1) 
    GES 1.39 (0.16) 33.7 (3.4) 0.71 (0.03) 6.1 (0.1) 67.8 (7.8) 
    HC 1.20 (0.14) 35.4 (2.7) 0.79 (0.02) 0.7 (0.1) 58.2 (6.8) 
    MMHC 0.42 (0.10) 15.8 (3.3) 0.91 (0.02) 0.4 (0.0) 20.2 (4.8) 
|$n$||$p$|MethodFPR (%)FDR (%)MCCKLSHD
50 100 CMLE 0.96 (0.27) 62.9 (6.5) 0.29 (0.05) 22.6 (1.4) 105.4 (8.5) 
    PC 0.29 (0.07) 90.5 (10.5) 0.03 (0.05) 24.8 (1.0) 106.7 (2.3) 
    GES 4.41 (0.17) 69.4 (2.5) 0.29 (0.04) 42.2 (2.9) 239.8 (11.8) 
    HC 17.4 (1.93) 91.9 (1.0) 0.19 (0.02) 6.1E13 (8.5E13) 859.8 (92.7) 
    MMHC 0.95 (0.13) 49.7 (4.8) 0.45 (0.04) 24.1 (2.1) 96.2 (7.7) 
100 100 CMLE 0.59 (0.23) 24.2 (8.0) 0.79 (0.05) 10.6 (1.1) 82.4 (13.7) 
    PC 0.74 (0.19) 71.9 (6.8) 0.18 (0.07) 17.6 (1.7) 94.5 (5.1) 
    GES 3.32 (0.19) 57.0 (2.9) 0.49 (0.04) 15.4 (0.8) 165 (10.5) 
    HC 4.57 (0.43) 67.6 (2.1) 0.55 (0.02) 25.6 (2.8) 234.5 (25.2) 
    MMHC 0.64 (0.08) 25.4 (2.6) 0.79 (0.02) 9.4 (0.7) 57.8 (6.2) 
1000 100 CMLE 0.05 (0.03) 2.3 (1.4) 0.98 (0.01) 0.3 (0.1) 2.5 (1.6) 
    PC 1.32 (0.31) 53.7 (6.9) 0.47 (0.06) 8.7 (0.4) 64.3 (15.1) 
    GES 1.39 (0.16) 33.7 (3.4) 0.71 (0.03) 6.1 (0.1) 67.8 (7.8) 
    HC 1.20 (0.14) 35.4 (2.7) 0.79 (0.02) 0.7 (0.1) 58.2 (6.8) 
    MMHC 0.42 (0.10) 15.8 (3.3) 0.91 (0.02) 0.4 (0.0) 20.2 (4.8) 

CMLE, constrained maximum likelihood estimation; PC, PC-algorithm; GES, greedy equivalent search; HC, hill climbing; MMHC, max-min hill climbing; FPR, false positive rate; FDR, false discovery rate; MCC, Matthews correlation coefficient; KL, Kullback–Leibler loss; SHD, structural Hamming distance; a small value of FPR, FDR, KL, and SHD indicates a good result, whereas a large value of MCC is a good result.

Table 1.

Results from Example 2: average false positive rate, false discovery rate, Matthews correlation coefficient, Kullback–Leibler loss, and structural Hamming distance, together with their standard errors in parentheses, for five competing methods based on |$100$| replications

|$n$||$p$|MethodFPR (%)FDR (%)MCCKLSHD
50 100 CMLE 0.96 (0.27) 62.9 (6.5) 0.29 (0.05) 22.6 (1.4) 105.4 (8.5) 
    PC 0.29 (0.07) 90.5 (10.5) 0.03 (0.05) 24.8 (1.0) 106.7 (2.3) 
    GES 4.41 (0.17) 69.4 (2.5) 0.29 (0.04) 42.2 (2.9) 239.8 (11.8) 
    HC 17.4 (1.93) 91.9 (1.0) 0.19 (0.02) 6.1E13 (8.5E13) 859.8 (92.7) 
    MMHC 0.95 (0.13) 49.7 (4.8) 0.45 (0.04) 24.1 (2.1) 96.2 (7.7) 
100 100 CMLE 0.59 (0.23) 24.2 (8.0) 0.79 (0.05) 10.6 (1.1) 82.4 (13.7) 
    PC 0.74 (0.19) 71.9 (6.8) 0.18 (0.07) 17.6 (1.7) 94.5 (5.1) 
    GES 3.32 (0.19) 57.0 (2.9) 0.49 (0.04) 15.4 (0.8) 165 (10.5) 
    HC 4.57 (0.43) 67.6 (2.1) 0.55 (0.02) 25.6 (2.8) 234.5 (25.2) 
    MMHC 0.64 (0.08) 25.4 (2.6) 0.79 (0.02) 9.4 (0.7) 57.8 (6.2) 
1000 100 CMLE 0.05 (0.03) 2.3 (1.4) 0.98 (0.01) 0.3 (0.1) 2.5 (1.6) 
    PC 1.32 (0.31) 53.7 (6.9) 0.47 (0.06) 8.7 (0.4) 64.3 (15.1) 
    GES 1.39 (0.16) 33.7 (3.4) 0.71 (0.03) 6.1 (0.1) 67.8 (7.8) 
    HC 1.20 (0.14) 35.4 (2.7) 0.79 (0.02) 0.7 (0.1) 58.2 (6.8) 
    MMHC 0.42 (0.10) 15.8 (3.3) 0.91 (0.02) 0.4 (0.0) 20.2 (4.8) 
|$n$||$p$|MethodFPR (%)FDR (%)MCCKLSHD
50 100 CMLE 0.96 (0.27) 62.9 (6.5) 0.29 (0.05) 22.6 (1.4) 105.4 (8.5) 
    PC 0.29 (0.07) 90.5 (10.5) 0.03 (0.05) 24.8 (1.0) 106.7 (2.3) 
    GES 4.41 (0.17) 69.4 (2.5) 0.29 (0.04) 42.2 (2.9) 239.8 (11.8) 
    HC 17.4 (1.93) 91.9 (1.0) 0.19 (0.02) 6.1E13 (8.5E13) 859.8 (92.7) 
    MMHC 0.95 (0.13) 49.7 (4.8) 0.45 (0.04) 24.1 (2.1) 96.2 (7.7) 
100 100 CMLE 0.59 (0.23) 24.2 (8.0) 0.79 (0.05) 10.6 (1.1) 82.4 (13.7) 
    PC 0.74 (0.19) 71.9 (6.8) 0.18 (0.07) 17.6 (1.7) 94.5 (5.1) 
    GES 3.32 (0.19) 57.0 (2.9) 0.49 (0.04) 15.4 (0.8) 165 (10.5) 
    HC 4.57 (0.43) 67.6 (2.1) 0.55 (0.02) 25.6 (2.8) 234.5 (25.2) 
    MMHC 0.64 (0.08) 25.4 (2.6) 0.79 (0.02) 9.4 (0.7) 57.8 (6.2) 
1000 100 CMLE 0.05 (0.03) 2.3 (1.4) 0.98 (0.01) 0.3 (0.1) 2.5 (1.6) 
    PC 1.32 (0.31) 53.7 (6.9) 0.47 (0.06) 8.7 (0.4) 64.3 (15.1) 
    GES 1.39 (0.16) 33.7 (3.4) 0.71 (0.03) 6.1 (0.1) 67.8 (7.8) 
    HC 1.20 (0.14) 35.4 (2.7) 0.79 (0.02) 0.7 (0.1) 58.2 (6.8) 
    MMHC 0.42 (0.10) 15.8 (3.3) 0.91 (0.02) 0.4 (0.0) 20.2 (4.8) 

CMLE, constrained maximum likelihood estimation; PC, PC-algorithm; GES, greedy equivalent search; HC, hill climbing; MMHC, max-min hill climbing; FPR, false positive rate; FDR, false discovery rate; MCC, Matthews correlation coefficient; KL, Kullback–Leibler loss; SHD, structural Hamming distance; a small value of FPR, FDR, KL, and SHD indicates a good result, whereas a large value of MCC is a good result.

 

Example 3.

This example is modified from Example 2 to generate a directed acyclic graph of 100 nodes with a hub node. Instead of generating the edge set by Bernoulli sampling, the only directed edges are those from the first node, i.e., the hub node, to the next 49 nodes. The other 50 nodes are independent variables. Such highly connected hub nodes are of special interest because they are the backbones of the network architecture. The neighbourhood of the first node is no longer sparse, but the overall graph remains sparse. Table 2 shows the results.

Table 2.

Results from Example 3: average false positive rate, false discovery rate, Matthews correlation coefficient, Kullback–Leibler loss, and structural Hamming distance, together with their standard errors in parentheses, for five competing methods based on |$100$| replications

|$n$||$p$|MethodFPR (%)FDR (%)MCCKLSHD
50 100 CMLE 0.29 (0.26) 70.4 (11) 0.21 (0.13) 11.4 (1.7) 56.4 (8.7) 
    PC 0.03 (0.02) 85 (32.8) 0.03 (0.07) 16.4 (1.2) 49.7 (1.3) 
    GES 5.26 (0.12) 97.7 (0.2) 0.01 (0.01) 50.5 (3.3) 300.9 (5.9) 
    HC 20.2 (2.23) 98.1 (0.5) 0.05 (0.03) 5.6E13 (7.6E13) 1014.2 (110.3) 
    MMHC 1.48 (0.14) 94.2 (1.0) 0.06 (0.01) 25.9 (1) 116.9 (6.3) 
100 100 CMLE 0.53 (0.14) 49.6 (12) 0.52 (0.13) 5.7 (1.5) 47.5 (12.3) 
    PC 0.01 (0.02) 34 (41.1) 0.16 (0.10) 15.4 (0.7) 47.4 (1.6) 
    GES 4.49 (0.15) 97.3 (0.2) 0.02 (0.01) 20.9 (0.8) 262.9 (7.1) 
    HC 5.15 (0.27) 86.4 (0.9) 0.32 (0.03) 27.7 (3.0) 259.6 (12) 
    MMHC 1.77 (0.10) 92.2 (1.6) 0.10 (0.02) 15.9 (1.0) 128.1 (5.8) 
1000 100 CMLE 0.01 (0.01) 1.0 (1.1) 0.99 (0.01) 0.2 (0.05) 0.5 (0.5) 
    PC|$\ast$| N/A N/A N/A N/A N/A 
    GES 3.20 (0.07) 96.3 (0.1) 0.04 (0.001) 10.6 (0.1) 199.8 (3.5) 
    HC 0.94 (0.13) 48.4 (3.1) 0.72 (0.02) 0.6 (0.1) 46.2 (6.2) 
    MMHC|$\ast$| N/A N/A N/A N/A N/A 
|$n$||$p$|MethodFPR (%)FDR (%)MCCKLSHD
50 100 CMLE 0.29 (0.26) 70.4 (11) 0.21 (0.13) 11.4 (1.7) 56.4 (8.7) 
    PC 0.03 (0.02) 85 (32.8) 0.03 (0.07) 16.4 (1.2) 49.7 (1.3) 
    GES 5.26 (0.12) 97.7 (0.2) 0.01 (0.01) 50.5 (3.3) 300.9 (5.9) 
    HC 20.2 (2.23) 98.1 (0.5) 0.05 (0.03) 5.6E13 (7.6E13) 1014.2 (110.3) 
    MMHC 1.48 (0.14) 94.2 (1.0) 0.06 (0.01) 25.9 (1) 116.9 (6.3) 
100 100 CMLE 0.53 (0.14) 49.6 (12) 0.52 (0.13) 5.7 (1.5) 47.5 (12.3) 
    PC 0.01 (0.02) 34 (41.1) 0.16 (0.10) 15.4 (0.7) 47.4 (1.6) 
    GES 4.49 (0.15) 97.3 (0.2) 0.02 (0.01) 20.9 (0.8) 262.9 (7.1) 
    HC 5.15 (0.27) 86.4 (0.9) 0.32 (0.03) 27.7 (3.0) 259.6 (12) 
    MMHC 1.77 (0.10) 92.2 (1.6) 0.10 (0.02) 15.9 (1.0) 128.1 (5.8) 
1000 100 CMLE 0.01 (0.01) 1.0 (1.1) 0.99 (0.01) 0.2 (0.05) 0.5 (0.5) 
    PC|$\ast$| N/A N/A N/A N/A N/A 
    GES 3.20 (0.07) 96.3 (0.1) 0.04 (0.001) 10.6 (0.1) 199.8 (3.5) 
    HC 0.94 (0.13) 48.4 (3.1) 0.72 (0.02) 0.6 (0.1) 46.2 (6.2) 
    MMHC|$\ast$| N/A N/A N/A N/A N/A 

CMLE, constrained maximum likelihood estimation; PC, PC-algorithm; GES, greedy equivalent search; HC, hill climbing; MMHC, max-min hill climbing; FPR, false positive rate; FDR, false discovery rate; MCC, Matthews correlation coefficient; KL, Kullback–Leibler loss; SHD, structural Hamming distance; a small value of FPR, FDR, KL, and SHD indicates a good result, whereas a large value of MCC is a good result.

Table 2.

Results from Example 3: average false positive rate, false discovery rate, Matthews correlation coefficient, Kullback–Leibler loss, and structural Hamming distance, together with their standard errors in parentheses, for five competing methods based on |$100$| replications

|$n$||$p$|MethodFPR (%)FDR (%)MCCKLSHD
50 100 CMLE 0.29 (0.26) 70.4 (11) 0.21 (0.13) 11.4 (1.7) 56.4 (8.7) 
    PC 0.03 (0.02) 85 (32.8) 0.03 (0.07) 16.4 (1.2) 49.7 (1.3) 
    GES 5.26 (0.12) 97.7 (0.2) 0.01 (0.01) 50.5 (3.3) 300.9 (5.9) 
    HC 20.2 (2.23) 98.1 (0.5) 0.05 (0.03) 5.6E13 (7.6E13) 1014.2 (110.3) 
    MMHC 1.48 (0.14) 94.2 (1.0) 0.06 (0.01) 25.9 (1) 116.9 (6.3) 
100 100 CMLE 0.53 (0.14) 49.6 (12) 0.52 (0.13) 5.7 (1.5) 47.5 (12.3) 
    PC 0.01 (0.02) 34 (41.1) 0.16 (0.10) 15.4 (0.7) 47.4 (1.6) 
    GES 4.49 (0.15) 97.3 (0.2) 0.02 (0.01) 20.9 (0.8) 262.9 (7.1) 
    HC 5.15 (0.27) 86.4 (0.9) 0.32 (0.03) 27.7 (3.0) 259.6 (12) 
    MMHC 1.77 (0.10) 92.2 (1.6) 0.10 (0.02) 15.9 (1.0) 128.1 (5.8) 
1000 100 CMLE 0.01 (0.01) 1.0 (1.1) 0.99 (0.01) 0.2 (0.05) 0.5 (0.5) 
    PC|$\ast$| N/A N/A N/A N/A N/A 
    GES 3.20 (0.07) 96.3 (0.1) 0.04 (0.001) 10.6 (0.1) 199.8 (3.5) 
    HC 0.94 (0.13) 48.4 (3.1) 0.72 (0.02) 0.6 (0.1) 46.2 (6.2) 
    MMHC|$\ast$| N/A N/A N/A N/A N/A 
|$n$||$p$|MethodFPR (%)FDR (%)MCCKLSHD
50 100 CMLE 0.29 (0.26) 70.4 (11) 0.21 (0.13) 11.4 (1.7) 56.4 (8.7) 
    PC 0.03 (0.02) 85 (32.8) 0.03 (0.07) 16.4 (1.2) 49.7 (1.3) 
    GES 5.26 (0.12) 97.7 (0.2) 0.01 (0.01) 50.5 (3.3) 300.9 (5.9) 
    HC 20.2 (2.23) 98.1 (0.5) 0.05 (0.03) 5.6E13 (7.6E13) 1014.2 (110.3) 
    MMHC 1.48 (0.14) 94.2 (1.0) 0.06 (0.01) 25.9 (1) 116.9 (6.3) 
100 100 CMLE 0.53 (0.14) 49.6 (12) 0.52 (0.13) 5.7 (1.5) 47.5 (12.3) 
    PC 0.01 (0.02) 34 (41.1) 0.16 (0.10) 15.4 (0.7) 47.4 (1.6) 
    GES 4.49 (0.15) 97.3 (0.2) 0.02 (0.01) 20.9 (0.8) 262.9 (7.1) 
    HC 5.15 (0.27) 86.4 (0.9) 0.32 (0.03) 27.7 (3.0) 259.6 (12) 
    MMHC 1.77 (0.10) 92.2 (1.6) 0.10 (0.02) 15.9 (1.0) 128.1 (5.8) 
1000 100 CMLE 0.01 (0.01) 1.0 (1.1) 0.99 (0.01) 0.2 (0.05) 0.5 (0.5) 
    PC|$\ast$| N/A N/A N/A N/A N/A 
    GES 3.20 (0.07) 96.3 (0.1) 0.04 (0.001) 10.6 (0.1) 199.8 (3.5) 
    HC 0.94 (0.13) 48.4 (3.1) 0.72 (0.02) 0.6 (0.1) 46.2 (6.2) 
    MMHC|$\ast$| N/A N/A N/A N/A N/A 

CMLE, constrained maximum likelihood estimation; PC, PC-algorithm; GES, greedy equivalent search; HC, hill climbing; MMHC, max-min hill climbing; FPR, false positive rate; FDR, false discovery rate; MCC, Matthews correlation coefficient; KL, Kullback–Leibler loss; SHD, structural Hamming distance; a small value of FPR, FDR, KL, and SHD indicates a good result, whereas a large value of MCC is a good result.

With regard to structure estimation, in Example 2 the proposed method outperforms its competitors when |$n$| is large but performs slightly worse than the best performer when |$n$| is small, in terms of the Matthews correlation coefficient and the structural Hamming distance. In Example 3, it continues to be the best performer. However, the greedy equivalent search, hill climbing, and max-min hill climbing algorithms perform well in Example 2 but deteriorate in Example 3, where their performance suffers because of the nonsparse neighbourhood of one hub node.

Moreover, the PC-algorithm and max-min hill climbing algorithm fail to converge after 24 hours of running when |$n=1000$| in Example 3. The hill climbing and greedy equivalent search algorithms perform poorly in almost all the situations. The performance of hill climbing can be partly attributed to inappropriate use of the Bayesian information criterion for model selection in a high-dimensional setting.

With regard to parameter estimation, the proposed method performs best in most situations, as measured by the Kullback–Leibler loss.

Overall, our constrained maximum likelihood method compares favourably against the other methods.

5.3. Analysis of cell signalling data

We use our constrained maximum likelihood estimation method to analyse the multivariate flow cytometry data in Sachs et al. (2005). Data were collected after a series of stimulatory cues and inhibitory interventions, with cell reactions terminated by fixation 15 minutes after stimulation, to profile the effects of each condition on the intracellular signalling networks of human primary naive CD4+ T-cells. Flow cytometry measurements were taken from 11 phosphorylated proteins and phospholipids under nine experimental conditions. These simultaneous measurements allow researchers to infer causal influences in cellular signalling. Among these nine experimental conditions, we are particularly interested in one condition, where anti-cluster of differentiation 3/cluster of differentiation 28 and intercellular adhesion molecule-2 were used as general perturbations, because they tend to activate cell signalling.

This is a well-studied example and we use the representation of the consensus network in Sachs et al. (2005) as our true network, shown in Fig. 1(a). An arrow from one node to another is interpreted as a directional influence between the corresponding proteins.

Fig. 1.

The true network and networks reconstructed by using the proposed constrained maximum likelihood method and four other methods, with abbreviations as given in the captions of Tables 1 and 2; correct discoveries are repre- sented by solid black lines, false discoveries are displayed as black dotted lines, and incorrectly identified directions with correct discoveries of the skeleton of the true graph are marked as solid grey lines.

Fig. 1.

The true network and networks reconstructed by using the proposed constrained maximum likelihood method and four other methods, with abbreviations as given in the captions of Tables 1 and 2; correct discoveries are repre- sented by solid black lines, false discoveries are displayed as black dotted lines, and incorrectly identified directions with correct discoveries of the skeleton of the true graph are marked as solid grey lines.

There are 902 samples. We fit the five competing methods, using one-third of the samples for training and two-thirds for tuning. The learned networks are shown in Figs. 1(b)(f).

Our constrained maximum likelihood method, the PC-algorithm, greedy equivalent search, the hill climbing method, and the max-min hill climbing method respectively identified 4, 1, 0, 7, and 6 directed edges correctly, in addition to identifying 5, 6, 8, 1, and 1 edges correctly but with the wrong directions. In other words, they discovered 9, 7, 8, 8, and 7 edges correctly. With regard to wrong discoveries, all five methods identified one edge from P38 to pjnk, which is not in the true graph. Finally, all the methods missed several known edges, perhaps because many directional relations in protein signalling networks may have nonlinear behaviour, which cannot be captured by any linear model. An analysis based on discretization data (Sachs et al., 2005; Schmidt & Murphy, 2009) tends to capture such nonlinearity. Another possible explanation is that not all transcription activities are activated under the general perturbations. Therefore, recovering more edges may require analysing inhomogeneous data under different experimental conditions, as in Sachs et al. (2005).

Acknowledgement

This research was supported in part by the U.S. National Science Foundation and National Institutes of Health. The authors thank the editors and reviewers for helpful comments and suggestions.

Supplementary material

Supplementary material available at Biometrika online includes an extension to the case of nonequal error variance, additional simulation results, technical details and proofs, and computational code.

References

Andersson,
S. A.
,
Madigan,
D.
&
Perlman,
M. D.
(
1997
).
A characterization of Markov equivalence classes for acyclic digraphs
.
Ann. Statist.
25
,
505
41
.

Bahadur,
R. R.
,
Zabell,
S. L.
&
Gupta,
J. C.
(
1980
).
Large deviations, tests and estimates
. In
Asymptotic Theory of Statistical Tests and Estimation
,
Chakravarti,
I. M.
ed.
New York
:
Academic Press
, pp.
33
64
.

Boyd,
S.
,
Parikh,
N.
,
Chu,
E.
,
Peleato,
B.
&
Eckstein,
J.
(
2011
).
Distributed optimization and statistical learning via the alternating direction method of multipliers
.
Foundat. Trends Mach. Learn.
3
,
1
122
.

Breiman,
L.
&
Cutler,
A.
(
1993
).
A deterministic algorithm for global optimization
.
Math. Programm.
58
,
179
99
.

Brown,
L. E.
,
Tsamardinos,
I.
&
Aliferis,
C. F.
(
2005
).
A comparison of novel and state-of-the-art polynomial Bayesian network learning algorithms
. In
Proc. Nat. Conf. Artif. Intel. (AAAI-05, Pittsburgh, Pennsylvania)
.
AAAI Press
, pp.
739
45
.

Bühlmann,
P.
,
Peters,
J.
&
Ernest,
J.
(
2014
).
CAM: Causal additive models, high-dimensional order search and penalized regression
.
Ann. Statist.
42
,
2526
56
.

Cheng,
J.
,
Greiner,
R.
,
Kelly,
J.
,
Bell,
D.
&
Liu,
W.
(
2002
).
Learning Bayesian networks from data: An information-theory based approach
.
Artif. Intel.
137
,
43
90
.

Chickering,
D. M.
(
2003
).
Optimal structure identification with greedy search
.
J. Mach. Learn. Res.
3
,
507
54
.

Chickering,
D. M.
,
Heckerman,
D.
&
Meek,
C.
(
2004
).
Large-sample learning of Bayesian networks is NP-hard
.
J. Mach. Learn. Res.
5
,
1287
330
.

Cooper,
G. F.
&
Herskovits,
E.
(
1992
).
A Bayesian method for the induction of probabilistic networks from data
.
Mach. Learn.
9
,
309
47
.

Cussens,
J.
,
Bartlett,
M.
,
Jones,
E. M.
&
Sheehan,
N. A.
(
2013
).
Maximum likelihood pedigree reconstruction using integer linear programming
.
Genet. Epidemiol.
37
,
69
83
.

De Campos,
C. P.
&
Ji,
Q.
(
2011
).
Efficient structure learning of Bayesian networks using constraints
.
J. Mach. Learn. Res.
12
,
663
89
.

de Campos,
L. M.
(
1998
).
Independency relationships and learning algorithms for singly connected networks
.
J. Exp. Theor. Artif. Intel.
10
,
511
49
.

de Campos,
L. M.
(
2006
).
A scoring function for learning Bayesian networks based on mutual information and conditional independence tests
.
J. Mach. Learn. Res.
7
,
2149
87
.

Edwards,
D.
(
2012
).
Introduction to Graphical Modelling
.
New York
:
Springer
.

Faloutsos,
M.
,
Faloutsos,
P.
&
Faloutsos,
C.
(
1999
).
On power-law relationships of the internet topology
.
ACM SIGCOMM Comp. Comm. Rev.
29
,
251
62
.

Friedman,
J. H.
,
Hastie,
T. J.
&
Tibshirani,
R. J.
(
2008
).
Sparse inverse covariance estimation with the graphical lasso
.
Biostatistics
9
,
432
41
.

Fu,
F.
&
Zhou,
Q.
(
2013
).
Learning sparse causal Gaussian networks with experimental intervention: Regularization and coordinate descent
.
J. Am. Statist. Assoc.
108
,
288
300
.

Gabay,
D.
&
Mercier,
B.
(
1976
).
A dual algorithm for the solution of nonlinear variational problems via finite element approximation
.
Comp. Math. Appl.
2
,
17
40
.

Grötschel,
M.
,
Jünger,
M.
&
Reinelt,
G.
(
1985
).
On the acyclic subgraph polytope
.
Math. Programm.
33
,
28
42
.

Jaakkola,
T.
,
Sontag,
D.
,
Globerson,
A.
&
Meila,
M.
(
2010
).
Learning Bayesian network structure using LP relaxations
.
J. Mach. Learn. Res.
9
,
358
65
.

Jensen,
F. V.
(
2004
).
Bayesian artificial intelligence
.
Pat. Anal. Appl.
7
,
221
3
.

Kalisch,
M.
&
Bühlmann,
P.
(
2007
).
Estimating high-dimensional directed acyclic graphs with the PC-algorithm
.
J. Mach. Learn. Res.
8
,
613
36
.

Kalisch,
M.
,
Mächler,
M.
,
Colombo,
D.
,
Maathuis,
M. H.
&
Bühlmann,
P.
(
2012
).
Causal inference using graphical models with the R package pcalg
.
J. Statist. Software
47
,
1
26
.

Koivisto,
M.
&
Sood,
K.
(
2004
).
Exact Bayesian structure discovery in Bayesian networks
.
J. Mach. Learn. Res.
5
,
549
73
.

Liu,
H.
,
Lafferty,
J.
&
Wasserman,
L.
(
2009
).
The nonparanormal: Semiparametric estimation of high dimensional undirected graphs
.
J. Mach. Learn. Res.
10
,
2295
328
.

Malone,
B.
,
Kangas,
K.
,
Järvisalo,
M.
,
Koivisto,
M.
&
Myllymäki,
P.
(
2014
).
Predicting the hardness of learning Bayesian networks
. In
Proc. 28th AAAI Conf. Artif. Intel, AAAI-14, Québec City, Canada
,
AAAI Press
, pp.
2460
6
.

Peters,
J.
&
Bühlmann,
P.
(
2014
).
Identifiability of Gaussian structural equation models with equal error variances
.
Biometrika
101
,
219
28
.

R Development Core Team (

2019
).
R: A Language and Environment for Statistical Computing
.
Vienna, Austria
:
R Foundation for Statistical Computing
. ISBN 3-900051-07-0, http://www.R-project.org.

Raskutti,
G.
&
Uhler,
C.
(
2014
).
Learning directed acyclic graphs based on sparsest permutations
.
arXiv: 1307.0366v3
.

Sachs,
K.
,
Perez,
O.
,
Pe’er,
D.
,
Lauffenburger,
D. A.
&
Nolan,
G. P.
(
2005
).
Causal protein-signaling networks derived from multiparameter single-cell data
.
Science
308
,
523
9
.

Schmidt,
M.
&
Murphy,
K.
(
2009
).
Modeling discrete interventional data using directed cyclic graphical models
. In
Proc. 25th Conf. Uncert. Artif. Intel, Montreal, Québec, Canada
,
AUAI Press
, pp.
487
95
.

Scutari,
M.
(
2010
).
Learning Bayesian networks with the bnlearn R package
.
J. Statist. Software
35
,
1
22
.

Shen,
X.
,
Pan,
W.
&
Zhu,
Y.
(
2012
).
Likelihood-based selection and sharp parameter estimation
.
J. Am. Statist. Assoc.
107
,
223
32
.

Shen,
X.
,
Pan,
W.
,
Zhu,
Y.
&
Zhou,
H.
(
2013
).
On constrained and regularized high-dimensional regression
.
Ann. Inst. Statist. Math.
65
,
807
32
.

Shojaie,
A.
&
Michailidis,
G.
(
2010
).
Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs
.
Biometrika
97
,
519
38
.

Spirtes,
P.
,
Glymour,
C.
&
Scheines,
R.
(
2000
).
Causation, Prediction, and Search
.
Cambridge, Massachusetts
:
MIT Press
.

Tsamardinos,
I.
,
Brown,
L. E.
&
Aliferis,
C.
(
2006
).
The max-min hill-climbing Bayesian network structure learning algorithm
.
Mach. Learn.
65
,
31
78
.

Uhler,
C.
,
Raskutti,
G.
,
Bühlmann,
P.
&
Yu,
B.
(
2013
).
Geometry of the faithfulness assumption in causal inference
.
Ann. Statist.
41
,
436
63
.

van de Geer,
S.
&
Bühlmann,
P.
(
2013
).
|$l_0$|-penalized maximum likelihood for sparse directed acyclic graphs
.
Ann. Statist.
41
,
536
67
.

Zhang,
J.
&
Spirtes,
P.
(
2003
).
Strong faithfulness and uniform consistency in causal inference
. In
Proc. 19th Conf. Uncert. Artif. Intel. (Acapulco, Mexico)
.
Morgan Kaufmann Publishers
, pp.
632
9
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data