A structural Markov property for decomposable graph laws that allows control of clique intersections

We present a new kind of structural Markov property for probabilistic laws on decomposable graphs, which allows the explicit control of interactions between cliques, so is capable of encoding some interesting structure. We prove the equivalence of this property to an exponential family assumption, and discuss identifiability, modelling, inferential and computational implications.


Introduction
The conditional independence properties among components of a multivariate distribution are key to understanding its structure, and precisely describe the qualitative manner in which information flows among the variables. Further, these properties are well-represented by a graphical model, in which nodes, representing variables in the model, are connected by undirected edges, encoding the conditional independence properties of the distribution (Lauritzen, 1996). Inference about the underlying graph from observed data is therefore an important task, sometimes known as structural learning.
Bayesian structural learning requires specification of a prior distribution on graphs, and there is a need for a flexible but tractable family of such priors, capable of representing a variety of prior beliefs about the conditional independence structure. In the interests of tractability and scalability, there has been a strong focus on the case where the true graph may be assumed to be decomposable.
Just as this underlying graph localises the pattern of dependence among variables, it is appealing that the prior on the graph itself should exhibit dependence locally, in the same graphical sense. Informally, the presence or absence of two edges should be independent when they are sufficiently separated by other edges in the graph. The first class of graph priors demonstrating such a structural Markov property was presented in a 2012 Cambridge University PhD thesis by Simon Byrne, and later published in Byrne and Dawid (2015).
That priors with this property are also tractable arises from an equivalence demonstrated by Byrne and Dawid (2015), between their structural Markov property for decomposable graphs and the assumption that the graph law follows a clique exponential family.
This important result is yet another example of a theme in the literature, making a connection between systems of conditional independence statements among random variables, often encoded graphically, and factorisations of the joint probability distribution of these variables. Examples include the global Markov property for undirected graphs, which is necessary, and under an additional condition sufficient, for the joint distribution to factorise as a product of potentials over cliques; the Markov property for directed acyclic graphs, which is equivalent to the existence of a factorisation of the joint distributions into child-given-parents conditional distributions; and the existence of a factorisation into clique and separator marginal distributions for undirected decomposable graphs.
All of these results are now well-known, and for these and other essentials of graphical models, the reader is referred to Lauritzen (1996).
In this note, we introduce a weaker version of this structural Markov property, and show that it is nevertheless sufficient for equivalence to a certain exponential family, and therefore to a factorisation of the graph law. This gives us a more flexible family of graph priors for use in modelling data. We show that the advantages of conjugacy, and its favourable computational implications, remain true in this broader class, and illustrate the richer structures that are generated by such priors. Efficient prior and posterior sampling from decomposable graphical models can be performed with the junction tree sampler of Green and Thomas (2013).

The weak structural Markov property 2.1 Notation and terminology
We follow the terminology for graphs and graphical models of Lauritzen (1996), with a few exceptions and additions, noted here. Many of these are also used by Byrne and Dawid (2015). We use the term graph law for the distribution of a random graph, but do not use a different symbol, for exampleG, for a random graph. For any graph G on a vertex set V , and any subset A ⊆ V , G A is the subgraph induced on vertex set A; its edges are those of G joining vertices that are both in A. A complete subgraph is one where all pairs of vertices are joined. If G A is complete and maximal, in the sense that G B is not complete for any superset B ⊃ A, then A is a clique. Here and throughout the paper, the symbols ⊃ and ⊂ refer to strict inclusion. A junction tree based on a decomposable graph G on vertex set V is any graph whose vertices are the cliques of G, joined by edges in such a way that for any A ⊆ V , those vertices of the junction tree containing A form a connected subtree. A separator is the intersection of two adjacent cliques in any junction tree. As in Green and Thomas (2013) we adopt the convention that we allow separators to be empty, with the effect that every junction tree is connected. A covering pair is any pair (A, B) of subsets of V such that A ∪ B = V ; (A, B) is a decomposition if A ∩ B is complete, and separates A \ B and B \ A. Figure 1 illustrates the idea of a decomposition.

Definitions
We begin with the definition of the structural Markov property from Byrne and Dawid (2015).  The various conditional independence statements each restrict the graph law, so we can weaken the definition by reducing the number of such statements, for example by replacing the conditioning set by a smaller one. This motivates our definition.
Definition 2. (Weak structural Markov property) A graph law G(G) over the set U of undirected decomposable graphs on V is weakly structurally Markov if for any covering pair (A, B), we have where U (A, B) is the set of decomposable graphs for which (A, B) is a decomposition, and A ∩ B is a clique, that is a maximal complete subgraph, in G A .
The only difference with the structural Markov property is that we condition on the event U (A, B), not U(A, B), so we only require independence when A ∩ B is a clique in G A , that is, is maximal in G A ; it is already complete because (A, B) is a decomposition. Obviously, by symmetry, U (A, B) could be defined with A and B interchanged without changing the meaning, but it is not the same as conditioning on the set of decomposable graphs for which (A, B) is a decomposition, and A ∩ B is a clique in at least one of G A and G B , since in the conditional independence statement, it is G that is random, not (A, B).
The weak structural Markov property is illustrated in Figure 2.

Clique-separator exponential family
We now define an important family of graph laws, by an algebraic specification. This family has previously been described, though not named, by Bornn and Caron (2011). These authors do not examine any Markov properties of the family, but advocate it for flexible prior specification.
Definition 3. (Clique-separator exponential family and clique-separator factorisation laws) The clique-separator exponential family is the exponential family of graph laws over F ⊆ U, with (t + , t − ) as natural statistic with respect to the uniform measure on U, where t + A = max(t A , 0) and t − A = min(t A , 0), and where ν A (G) is the multiplicity of separator A in G. That is, laws in the family have densities of the form: . Here all vectors indexed by subsets of V are listed in a fixed but arbitrary order, and the product of two such vectors is the scalar product.
S is a separator in G, again otherwise 0. This density π can be equivalently written as a clique-separator factorisation law where C is the set of cliques and S the multiset of separators of G, and φ C = exp(ω + C ) and ψ S = exp(ω − S ); this is the form we prefer to use hereafter. This definition is an immediate generalisation of that of the clique exponential family of Byrne and Dawid (2015), in which t = t + + t − is the natural statistic, so ω + A and ω − A coincide, as do φ A and ψ A . Byrne and Dawid (2015) show that for any fixed vertex set, the structurally Markov laws are precisely those in a clique exponential family. In the next section we show an analogous alignment between the weak structural Markov property and clique-separator factorisation laws.

Main result
Theorem 1. A graph law G over the set U of undirected decomposable graphs on V , whose support is all of U, is weakly structural Markov if and only if it is a clique-separator factorisation law.
Remark 1. Exactly as in Byrne and Dawid (2015, Theorem 3.15) it is possible to weaken the condition of full support, that is, positivity of the density π. It is enough that if G is in the support, so is G (C) for any clique C of G.
Our proof makes use of a compact notation for decomposable graphs, and a kind of ordering of cliques that is more stringent than perfect ordering/enumeration.
A decomposable graph is determined by its cliques. We write G (C 1 ,C 2 ,...) for the decomposable graph with cliques C 1 , C 2 , . . .. Without ambiguity we can omit singleton cliques from the list. In case the vertex set V of the graph is not clear from the context, we emphasise it thus: G ...) . In particular, G (A) is the graph on V that is complete in A and empty otherwise, and G (A,B) is the graph on V that is complete on both A and B and empty otherwise.
Recall that, starting from a list of the cliques, we can place these in a perfect sequence and simultaneously construct a junction tree by maintaining two complementary subsets: those cliques visited and those unvisited. We initialize the process by placing an arbitrary clique in the visited set and all others in the unvisited. At each successive stage, we move one unvisited clique into the visited set choosing arbitrarily from those that are available, that is, are adjacent to a visited clique in the junction tree; at the same time a new link is added to the junction tree.
Definition 4. If at each step j we select an available clique, numbering it C j , such that the separator S j = C j ∩ i<j C i is not a proper subset of any other separator that would arise by choosing a different available clique then we call the ordering pluperfect.
Clearly, it is computationally convenient and sufficient, but not necessary, to choose the available clique that creates one of the largest of the separators, a construction closely related to the maximum cardinality search of Tarjan and Yannakakis (1984). This shows that a pluperfect ordering always exists and that any clique can be chosen as the first.
Lemma 1. Let π be the density of a weakly structurally Markov graph law on V , and let G be a decomposable graph on V . Consider a particular pluperfect ordering C 1 , . . . , C J of the cliques of G, and a junction tree in which the links connect C j and C h(j) via separator S j for each j = 2, . . . , J, where h(j) ≤ j − 1. For each such j, let R j be any subset of C h(j) that is a proper superset of S j . Then for any choice of such {R j }, we have is a decomposition, and A ∩ B = R j . This intersection R j is a clique in G A . For, suppose for a contradiction that R j is not a clique in G A , i.e., it is not maximal. Then there exists a vertex v in A \ R j , such that R = R j ∪ {v} is complete. So R is a subset of a clique in the original graph G. Either all the cliques containing R are among {C i , i < j}, so that v is not in A, a contradiction, or one of them, say C , is among {C i , i ≥ j}, in which case there is a path in the junction tree between C h(j) and C , with every clique along the path containing R j ; so there must be a separator that is a superset of R j (so a strict superset of S j ), connects to C h(j) , and is among {S j+1 , S j+2 , ..., S J }. This contradicts the assumption that the ordering is pluperfect.
This choice of (A, B) forms a covering pair and G ∈ U (A, B), so under WSM, we know that G A and G B are independent under π A,B , their joint distribution given that A ∩ B is complete in G (C 1 ,C 2 ,...,C j ) . Thus we have the cross-over identity or equivalently, π(G (R j ) )π(G (C 1 ,...,C j ) ) = π(G (C 1 ,...,C j−1 ) )π(G (R j ,C j ) ).
We can therefore write Lemma 2. Let π be the density of a weakly structurally Markov graph law on V , and let S be any subset of the vertices V with |S| ≤ n − 2. Then π(G (R 1 ,R 2 ) )/{π(G (R 1 ) )π(G (R 2 ) )} depends only on S, for all sets of vertices R 1 , R 2 for which R 1 ∪ R 2 ⊆ V , R 1 ∩ R 2 = S, and where both R 1 and R 2 are strict supersets of S.
Proof. G (R 1 ,R 2 ) is a decomposable graph whose unique junction tree has cliques R 1 and R 2 , and separator S. Applying Lemma 1 to this graph, we have that is, π(G (R 1 ,R 2 ) ) π(G (R 1 ) )π(G (R 2 ) ) = π(G (R,R 2 ) ) π(G (R) )π(G (R 2 ) ) , for any R with S ⊂ R ⊆ R 1 . This means that any vertices may be added to or removed from R 1 , or by symmetry to or from R 2 , without changing the value of π(G (R 1 ,R 2 ) )/{π(G (R 1 ) )π(G (R 2 ) )}, providing it remains true that R 1 ∪ R 2 ⊆ V , R 1 ∩ R 2 = S, R 1 ⊃ S and R 2 ⊃ S. But any unordered pair of subsets R 1 , R 2 of V with R 1 ∪ R 2 ⊆ V , R 1 ∩ R 2 = S, R 1 ⊃ S and R 2 ⊃ S can be transformed stepwise to any other such pair by successively adding or removing vertices to or from one or other of the subsets. Thus π(G (R 1 ,R 2 ) )/{π(G (R 1 ) )π(G (R 2 ) )} can depend only on S: we will denote it by 1/ψ S .
of Theorem 1. Suppose that π is the density of a weakly structurally Markov graph law on V . For each A ⊆ V , let φ A = π(G (A) ). Then by Lemmas 1 and 2, {v,w}) , for distinct vertices v, w ∈ V all denote the same graph, we must have φ {v} = π(G (∅) ) for all v, and also ψ ∅ = π(G (∅) ). Under these conditions, the constant of proportionality in (1) is evidently 1. Conversely, it is trivial to show that if the clique-separator factorisation property (1) applies to π, then π is the density of a weakly structurally Markov graph law. Byrne and Dawid (2015, Proposition 3.14) point out that their {t A (G)} values are subject to |V | + 1 linear constraints, A⊆V t A (G) = 1, A v t A (G) = 1 for all v ∈ V , so that their parameters ω A , or equivalently φ A , are not all identifiable. They obtain identifiability, by proposing a standardised vector ω , with |V | + 1 necessarily 0 entries, that is a linear transform of ω. By the same token, the |V | + 1 constraints on t A (G) are linear constraints on t + A (G) and t − A (G), and so {φ A } and {ψ A } are not all identifiable. We could obtain identifiable parameters by for example choosing ψ ∅ = 1 and φ {v} = 1 for all v ∈ V , or, as above, by setting ψ ∅ = π(G (∅) ) and φ {v} = π(G (∅) ) for all v, or in other ways.

Identifiability of parameters
Note in addition that ∅ cannot be a clique, and neither A = V nor any subset A of V with |A| = |V | − 1 can be a separator, so the corresponding φ A and ψ A are never used. The dimension of the space of clique-separator factorisation laws is therefore 2 × 2 |V | − 2|V | − 3, nearly twice that of clique exponential family laws, 2 |V | − |V | − 1.
For example, when |V | = 3, all graphs are decomposable, and all graph laws are clique-separator factorisation laws, while clique exponential family laws have dimension 4; when |V | = 4, 61 out of 64 graphs are decomposable, and the dimensions of the two spaces of laws are 21 and 11; when |V | = 7, only 617675 out of the 2 21 graphs are decomposable, and the dimensions are 239 and 120.

Conjugacy and posterior updating
As priors for the graph underlying a model P (X|G) for data X, clique-separator factorisation laws are conjugate for decomposable likelihoods, in the case where there are no unknown parameters in the distribution: given X from the model where λ A (X A ) denotes the marginal distribution of X A , the posterior for G is More generally, when there are parameters in the graph-specific likelihoods, the notions of compatibility and hyper-compatibility (Byrne and Dawid, 2015) allow the extension of the idea of structural Markovianity to the joint Markovianity of the graph and the parameters, and gives the form of the corresponding posterior.

Computational implications
Computing posterior distributions of graphs on a large scale remains problematic, with Markov chain Monte Carlo methods seemingly the only option except for toy problems, and these methods having notoriously poor mixing. However, the junction tree sampler of Green and Thomas (2013) seems to give acceptable performance for moderate-sized problems of up to a few hundred variables. Posteriors induced by clique-separator factorisation law priors are ideal material for these samplers, which explicitly use a clique-separator representation of all graphs and distributions.
In Bornn and Caron (2011), a different Markov chain Monte Carlo sampler for clique-separator factorisation laws is introduced. We have evidence that the examples shown in their figures are not representative samples from the particular models claimed, due to poor mixing.

Modelling
Here we briefly discuss the way in which choice of particular forms for the parameters φ A and ψ A govern the qualitative and even quantitative aspects of the graph law. These choices are important in designing a graph law for a particular purpose, whether or not this is prior modelling in Bayesian structural learning.
A limitation of clique exponential family models is that because large clique potentials count in favour of a graph, and large separator potentials count against, it is difficult for these laws to encourage the same features in both cliques and separators. For instance, if we choose clique potentials to favour large cliques, we seem to be forced to favour small separators.
A popular choice for a graph prior in past work on Bayesian structural learning is the wellknown Erdős-Rényi random graph model, in which each of the |V |(|V | − 1)/2 possible edges on the vertex set V is present independently, with probability p. This model is amenable to theoretical study, but realisations of this model typically exhibit no discernible structure. When restricted to decomposable graphs, the Erdős-Rényi model is a rather extreme example of a clique exponential family law, arising by taking φ A = (p/(1 − p)) |A|(|A|−1)/2 . Again realisations appear unstructured, essentially because of the quadratic dependence on clique or separator size in the exponent of the potentials φ A .
For a concrete example of a model with much more structure, suppose that our decomposable graph represents a communication network. There are two types of vertices, hubs and non-hubs. Adjacent vertices can all communicate with each other, but only hubs will relay messages. So, for a non-hub to communicate with a non-adjacent non-hub, there must be a path in the graph from one to the other where all intermediate nodes are hubs. This example has the interesting feature that using only local properties, it enforces a global property, universal communication. A necessary and sufficient condition for universal communication is that every separator contains a hub. This implies that either the graph is a single clique, or every clique must also contain a hub. To model this with a clique-separator factorisation law, we can set the separator potential to be ψ S = ∞ if S does not contain a hub. We are free to set the remaining values of ψ S , and the values of the clique potentials φ C for all cliques C, as we wish. In this example, these parameters are chosen to control the sizes of cliques and separators; specifically, φ C = exp(−4|C|) and ψ S = exp(−0.5|S|) when S contains a hub, which discourages both large cliques, and separators containing only hubs. The graph probability π(G) will be zero for all decomposable graphs that fail to allow universal communication, and otherwise will follow the distribution implied by the potentials. Note that this example requires the slight generalisation of the Theorem 1 mentioned in the Remark following it. Figure 3 shows a sample from this model, generated using a junction tree sampler.

Significance for statistical analysis
This short paper is not the place for a comprehensive investigation of the practical implications of adopting prior models from the clique-separator factorisation family in statistical analysis, something we intend to explore in later work. Instead, we extend the discussion of the example of the previous section to draw some lessons about inference. First we make the simple but important observation that the support of the posterior distribution of the graph cannot be greater than that of the prior. So, in the example of the hub model, the posterior will be concentrated on decomposable graphs where every separator contains a hub, and realisations will have some of the character of Figure 3.
There has been considerable interest recently in learning graphical models using methods that implicitly or explicitly favour hubs, defined in various ways with some affinity to our use of the term; see for example, Mohan et al. (2014), Tan et al. (2014) and Zhang et al. (2017). These are often motivated by genetic applications in which hubs may be believed to correspond to genes of special significance in gene regulation. These methods usually assume that the labelling of nodes as hubs is unknown, but it is straightforward to extend our hub model to put a probability model on this labelling, and to augment the Monte Carlo posterior sampler with a move that reallocates the hub labels, using any process that maintains the presence of at least one hub in every separator. This is a strong hint of the possibility of a fully Bayesian procedure that learns graphical models with hubs.

The cost of assuming the graph is decomposable when it is not
The assumption of a decomposable graph law as prior on Bayesian structural learning is of course a profound restriction. There is no reason why nature should have been kind enough to generate data from graphical models that are decomposable. However the computational advantages of such an assumption are tremendous; see the experiments and thorough review in Jones et al. (2005). The position has not changed much since this paper was written, so far as computation of exact posteriors is concerned.
However, an optimistic perspective on this conflict between prior reasonableness and computation tractability can be justified by work of Fitch et al. (2014). For the zero-mean Gaussian case, with a hyper-inverse-Wishart prior on the concentration matrix, they conclude that asymptotically the posterior will converge to graphical structures that are minimal triangulations of the true graph, the marginal log likelihood ratio comparing different minimal triangulations is stochastically bounded and appears to remain data dependent regardless of the sample size, and the covariance matrices corresponding to the different minimal triangulations are essentially equivalent, so model averaging is of minimal benefit. Informally, restriction to decomposable graphs doesn't matter really, with the right parameter priors; we can still fit essentially the right model, though perhaps inference on the graph itself should not be over-interpreted.

An even weaker structural Markov property
It is tempting to wonder if clique-separator factorisation is equivalent to a simpler definition of weak structural Markovianity, one that places yet fewer conditional independence constraints on G; the existence of the Theorem makes this possibility implausible, but it remains conceivably possible that a smaller collection of conditional independences could be equivalent. The following counter-example rules out the possibility of requiring only that where U + (A, B) is the set of decomposable graphs for which (A, B) is a decomposition, and A ∩ B is a clique in G.
Example. Consider graphs on vertices {1, 2, 3, 4}. The only non-trivial conditional independence statements implied by property (2)  . These two choices are independent, by (2), and this imposes 4 equality constraints on the graph law. There are 6 different choices for the two-vertex clique A ∩ B, so not more than 24 constraints overall (they may not all be independent). There are 61 decomposable graphs on four vertices, so the set of graph laws satisfying (2) has dimension at least 60 − 24 = 36. But as we saw in section 2.5, the set of clique-separator factorisation laws has dimension 2 × 2 |V | − 2|V | − 3 = 21.
Essentially, assumption (2) does not constrain the graph law sufficiently to obtain the explicit clique-separator factorisation. In fact, it is easy to show that (2) places no constraints on π(G) for any connected G consisting of one or two cliques.