-
PDF
- Split View
-
Views
-
Cite
Cite
François Caron, Emily B. Fox, Sparse Graphs Using Exchangeable Random Measures, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 79, Issue 5, November 2017, Pages 1295–1366, https://doi.org/10.1111/rssb.12233
- Share Icon Share
Summary
Statistical network modelling has focused on representing the graph as a discrete structure, namely the adjacency matrix. When assuming exchangeability of this array—which can aid in modelling, computations and theoretical analysis—the Aldous–Hoover theorem informs us that the graph is necessarily either dense or empty. We instead consider representing the graph as an exchangeable random measure and appeal to the Kallenberg representation theorem for this object. We explore using completely random measures (CRMs) to define the exchangeable random measure, and we show how our CRM construction enables us to achieve sparse graphs while maintaining the attractive properties of exchangeability. We relate the sparsity of the graph to the Lévy measure defining the CRM. For a specific choice of CRM, our graphs can be tuned from dense to sparse on the basis of a single parameter. We present a scalable Hamiltonian Monte Carlo algorithm for posterior inference, which we use to analyse network properties in a range of real data sets, including networks with hundreds of thousands of nodes and millions of edges.
1. Introduction
The rapid increase in the availability and importance of network data has been a driving force behind the significant recent attention on random-graph models. In devising such models, there are several competing forces:
flexibility to capture network features like sparsity of connections between nodes, heavy-tailed node degree distributions, dense spots or block structure;
interpretability of the network model and associated parameters;
theoretical tractability of analysis of network properties;
computational tractability of inference with the ability to scale analyses to large collections of nodes.
A plethora of network models have been proposed in recent decades, each with different trade-offs made between flexibility, interpretability and theoretical and computational tractability; we refer the interested reader to overviews of such models provided by Newman (2003, 2010), Bollobás (2001), Durrett (2007), Goldenberg et al. (2010) and Fienberg (2012). In this paper, our focus is on providing a new framework in which to make these trade-offs. We demonstrate the ability to make gains in multiple directions using this framework through a specific example where the goal is to capture
- i.
sparsity—tunable from sparse to dense via interpretable parameters,
- ii.
heavy-tailed degree distributions—again controlled via interpretable parameters—and
- iii.
computational tractability of Bayesian inference, scaling to networks with hundreds of thousands of nodes and millions of edges.
Classically, the graph being modelled has been represented by a discrete structure, or adjacency matrix, Z where Z ij is a binary variable with Z ij = 1 indicating an edge from node i to node j. In the case of undirected graphs, we furthermore restrict Z ij = Z ji. Then, the statistical network model is devised with this structure representing the observable quantity.
From a modelling, computational and theoretical standpoint, making an assumption of exchangeability is attractive. Under the adjacency matrix graph representation, such a statement informally equates with an invariance in distribution to permutations of node orderings. One reason why this assumption is attractive can be seen from applying the celebrated Aldous–Hoover theorem (Aldous, 1981; Hoover, 1979) to the adjacency matrix: infinite exchangeability of the binary matrix implies a mixture model representation involving transformations of uniform random variables (see theorem 7 in Appendix A.1). For undirected graphs, this transformation is specified by the graphon (Borgs et al. 2008; Lovász, 2013), which was originally studied as the limit object of dense graph sequences (Lovász and Szegedy, 2006; Borgs et al. 2010). The connection with the Aldous–Hoover theorem was made by Diaconis and Janson (2008). The graphon provides an object by which to study the theoretical properties of the statistical network process and to devise new estimators, as has been studied extensively in recent years (Bickel and Chen, 2009; Bickel et al., 2011; Rohe, et al., 2011; Zhao et al., 2012; Airoldi et al., 2013; Choi and Wolfe, 2014). Furthermore, the mixture model is a cornerstone of Bayesian modelling and provides a framework in which computational strategies are straightforwardly devised. Indeed, the Aldous–Hoover constructive definition has motivated new models (Lloyd et al., 2012) and many popular existing models can be recast in this framework, including the stochastic block model (Nowicki and Snijders, 2001; Airoldi et al., 2008) and latent space model (Hoff et al., 2002).
One consequence of the Aldous–Hoover theorem is that graphs that are represented by an exchangeable random array are either empty or dense, i.e. the number of edges grows quadratically with the number of nodes n (see Lovász (2013) and Orbanz and Roy (2015)). However, empirical analyses suggest that many real world networks are sparse (Newman, 2010). Formally, sparsity is an asymptotic property of a graph. Following Bollobás and Riordan (2009), we refer to graphs with Θ(n 2) edges as dense and graphs with o(n 2) edges as sparse (for notation, see Appendix C). The conclusion appears to be that we cannot have both exchangeability, with the associated benefits described above, and sparse graphs. Although network models can often adapt parameters to fit finite graphs, it is appealing to have a modelling framework with theoretically provable properties that are consistent with observed network attributes.
There are a couple of approaches to handling this apparent issue. One is to give up on exchangeability to obtain sparse graphs, such as in the popular preferential attachment model (Barabási and Albert, 1999; Berger et al., 2014) or configuration model (Bollobás, 1980; Newman, 2010). Indeed, the networks literature is dominated by sparse non-exchangeable models. Alternatively, there is a body of literature that examines rescaling graph properties with network size n, leading to sparse graph sequences where each graph is finitely exchangeable (Bollobás et al., 2007; Bollobás and Riordan, 2009; Olhede and Wolfe, 2012; Wolfe and Olhede, 2013; Borgs et al., 2014a,2014a). Convergence of sparse graph sequences, analogous to the study of limiting objects for dense graph sequences, has also been studied (e.g. Borgs et al. (2017)). However, any method building on a rescaling approach provides a graph distribution π n that lacks projectivity: marginalizing node n does not yield π n−1, the distribution on graphs of size n−1.

Point process representation of a random graph: each node i is embedded in at some location and is associated with a sociability parameter w i; an edge between nodes and is represented by a point at locations and in
We note that exchangeability of the point process representation does not imply exchangeability of the associated adjacency matrix; however, the same modelling, computational and theoretical advantages remain. Interestingly, we arrive at a direct analogue to the constructive representation of the Aldous–Hoover theorem for exchangeable arrays and the associated graphon. Appealing to Kallenberg (1990, 2005), chapter 9, a point process on is exchangeable if and only if it can be represented as a transformation of unit rate Poisson processes and uniform random variables (see theorem 1 in Section 2).
This link function has been previously used by several others to build network models (Aldous, 1997; Norros and Reittu, 2006; Bollobás, et al., 2007; van der Hofstad, 2014). Note that, under this specification, the ‘time indices’ and θ j of nodes i and j do not influence the probability of these two nodes to form a link. This is in contrast with, for example, standard latent space models (Hoff et al., 2002). See Section 4 for further discussion.
To define the set of underlying this statistical network model, we explore the use of completely random measures (CRMs) (Kingman, 1967). The (w i)i = 1, 2,… are the jumps of the CRM and the the locations of the atoms. We show that, by carefully choosing the Lévy measure characterizing this CRM, we can construct graphs ranging from sparse to dense. In particular, any Lévy measure yielding an infinite activity CRM leads to sparse graphs; alternatively, finite activity CRMs yield dense graphs. For the class of infinite activity regularly varying CRMs, we can sharpen the results to obtain graphs where the number of edges increases at a rate below n a, where 1<a<2 depending on the Lévy measure. We focus on the flexible generalized gamma process CRM and show that one can tune the graph from dense to sparse via a single parameter.
Building on the framework of CRMs leads to other desirable properties as well. One is that our CRM-based exchangeable point process leads to an analytic representation for the graphon analogue in the Kallenberg framework (see Section 5.1). Another is that, by drawing on the considerable theory of CRMs that has been well studied in the Bayesian non-parametric community, we can derive network simulation techniques and develop a principled statistical estimation procedure. For the latter, in Section 7 we devise a scalable Hamiltonian Monte Carlo (HMC) sampler that can automatically handle a range of graphs from dense to sparse. We empirically show in Section 8 that our methods scale to graphs with hundreds of thousands of nodes and millions of edges. Importantly, exchangeability of the random measure underlies the efficiency of the sampler.
In summary, the CRM-based formulation combined with the specific link model of equation (2) serves as a proof of concept that moving to the point process representation of equation (1) can yield models with desirable attributes that are different from what can be obtained by using the discrete adjacency matrix representation. More generally, the notion of modelling the graph as an exchangeable random measure and appealing to a Kallenberg representation for such exchangeable random measures serves as an important new framework for devising and studying random-graph models, just as the original graphon concept stimulated considerable work in the network community in the past decade.
Our paper is organized as follows. In Section 2, we provide background on exchangeability and CRMs. Our statistical network models for directed multigraphs, undirected graphs and bipartite graphs are in Section 3. A discussion of our framework compared with related network models is provided in Section 4. Properties, such as exchangeability and sparsity, and methods for simulation are presented in Section 5. Specific cases of our formulation leading to dense and sparse graphs are considered in Section 6, including an empirical analysis of network properties of our proposed formulation. Our Markov chain Monte Carlo (MCMC) posterior computations are in Section 7. Finally, Section 8 provides a simulation study and an analysis of a variety of large, real world graphs.
The programs that were used to analyse the data can be obtained from
http://wileyonlinelibrary.com/journal/rssdatasets
2. Background on exchangeability
Our focus is on exchangeable random structures that can represent networks. We first briefly review exchangeability for random sequences, continuous time processes and discrete network arrays. Thorough and accessible overviews of exchangeability of random structures have been presented in the surveys of Aldous (1985) and Orbanz and Roy (2015). Here, we simply abstract away the notions that are relevant to placing our network formulation in context, as summarized in Table 1.
The classical representation theorem arising from a notion of exchangeability for discrete sequences of random variables is due to de Finetti (1931). The theorem states that a sequence Z 1,Z 2,… with is exchangeable if and only if there is a random probability measure on Z with law ν such that the Z i are conditionally independently and identically distributed (IID) given , i.e. all exchangeable infinite sequences can be represented as a mixture with mixing measure ν. If examining continuous time processes instead of sequences, the representation that is associated with exchangeable increments was given by Bühlmann (1960) (see also Freedman (1996) in terms of mixing Lévy processes.
In words, result (4) states that the point process Z is exchangeable if, for any arbitrary regular square grid on the plane, the associated infinite array of increments (edge counts between nodes in a square) is exchangeable (Fig. 2). This provides a notion of exchangeability akin to that of the Aldous–Hoover theorem, but fundamentally different as the array being considered here is formed on the basis of an underlying continuous process. This array is not equivalent to an adjacency matrix, regardless of how fine a grid is considered.

Illustration of the notion of exchangeability for point processes on the plane: for any regular square grid on the plane (a), the associated infinite array counting the number of points in each square (b) is exchangeable in the sense of result (3)
Kallenberg (1990) derived de-Finetti-style representation theorems for separately and jointly exchangeable random measures on , which we present for the jointly exchangeable case in theorem 1. In what follows λ denotes the Lebesgue measure on , λ D the Lebesgue measure on the diagonal and . We also define a U-array to be an array of independent uniform random variables.
Theorem 1(representation theorem for jointly exchangeable random measures on (Kallenberg (1990) and Kallenberg (2005), theorem 9.24)). A random measure ξ on is jointly exchangeable if and only if almost surelyfor some measurable functions , and h,h ′,l,l ′: . Here, (ζ {i,j}) with is a U-array. and {(σ ij,χ ij)} on and on are independent, unit rate Poisson processes. Furthermore, α 0,β 0,γ 0⩾0 are an independent set of random variables.(5)
We can think of the as random time indices, the sets and forming Poisson processes of vertical and horizontal lines. The representation (1) is slightly more involved than the representation theorem for exchangeable arrays (see Appendix A.1). The first component of ξ is, however, similar to the representation for exchangeable arrays, the sequence of fixed indices i = 1, 2,… and uniform random variables (U i)i = 1, 2,… in equation (46 )in Appendix A.1 being replaced by a unit rate Poisson process on . We place our proposed network model of Section 3 within this Kallenberg representation in Section 5.1, yielding direct analogues to the classical graphon representation of graphs based on exchangeability of the adjacency matrix.
3. Proposed statistical network model
There are many options for defining a statistical model for the point process graph representation Z. We consider one in particular in this paper based on a specific choice of
- a.
link probability model and
- b.
a prior on the model parameters.
As mentioned in Section 1, this link probability model is not new to the statistical networks community and is a straightforward method for achieving node heterogeneity (see Aldous (1997) and Norros and Reittu (2006)).
3.1. Defining node parameters by using completely random measures
The model parameters consist of a collection of node-specific sociability parameters w i>0 and continuous-valued node indices .
CRMs have been used extensively in the Bayesian non-parametric literature for proposing flexible classes of priors over functional space (Regazzini et al., 2003; Lijoi and Prünster, 2010). We briefly review a few important properties of CRMs that are relevant to our construction; the reader can refer to Kingman (1993) or Daley and Vere-Jones (2008) for an exhaustive coverage.
A CRM W on is a random measure such that, for all finite families of disjoint, bounded measurable sets (A 1,…,A n) of , the random variables W(A 1),…,W(A n) are mutually independent.
We shall focus here on CRMs with no deterministic component and stationary increments (i.e. the distribution of W([t,s]) depends only on t − s). In this case, the CRM takes the form (7), with the points of a Poisson point process on defined by a mean measure ν(dw,dθ) = ρ(dw)λ(dθ), where λ is the Lebesgue measure and ρ is a Lévy measure on (0,∞).
In Section 6, we consider special cases including the (compound) Poisson process and generalized gamma process (GGP) (Brix, 1999; Lijoi et al., 2007).
3.2. Directed multigraphs
Formally, our undirected graph model is viewed as a transformation of a directed integer-weighted graph, or multigraph, as detailed in Section 3.3. We now specify this directed multigraph. Although our primary focus is on undirected network models, in some applications the directed multigraph might actually be the direct quantity of interest. For example, in social networks, interactions are often not only directed (‘person i messages person j’) but also have an associated count. Additionally, interactions might be typed (‘message’, ‘SMS’, ‘like’, ‘tag’). Our proposed framework could be directly extended to model such data.
![Example of (a) an atomic measure D as in equation (10) restricted to [0, 1]2, (b) the corresponding directed multigraph and (c) the corresponding undirected graph](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/79/5/10.1111_rssb.12233/4/m_rssb_79_5_1295_f3.jpeg?Expires=1747852303&Signature=wJ66~KPrakyjpaE8kL37laeCRQDnPjd-~Dwt5lRabN4CcjfU26mJx3DVZzxvVA2idmVzZeT5u-OUItdZDLtCocHeL2qHui2r-Pu~DpiwJWta2EQPR9Kzi1xt0VqbTx0gdQbVD08USaTT2K7~NBDAjzsAVZc1Pmcq75A43MT9BLoihaJ8njmfaWmrPWzdmtTn-ArU-yjzNGiWA8Cpbt2ERQDbYiCcy5VLsdBl8twumqnF3T3-kK9pBEwTGlsPWwFvOGMyRkM3B1K8a5uutXlWj9qkoEDU~NOD17bpeopXTx~wGjXnsug~lAqx31h-r~9GaH2Tq9eOCpZc-KISlhQ9TQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Example of (a) an atomic measure D as in equation (10) restricted to [0, 1]2, (b) the corresponding directed multigraph and (c) the corresponding undirected graph
3.3. Undirected graphs via transformations of directed graphs
This process is depicted graphically in Fig. 4.

Example of (a) the product measure for CRM W, (b) a draw of the directed multigraph measure D|W∼PP (W × W) and (c) the corresponding undirected measure
To see the equivalence between this formulation and that specified in equation (6), note that, for i≠j, . By properties of the Poisson process, n ij and n ji are independent random variables conditioned on W. The sum of two Poisson random variables, each with rate w i w j, is again Poisson with rate 2w i w j. Result (6) arises from the fact that . Likewise, the i = j case arises by using a similar reasoning for .
We note that our computational strategy of Section 7 relies on this interpretation of our model for undirected graphs as a transformation of a directed multigraph. In particular, we introduce the directed edge counts as latent variables and do inference over these counts.
3.4. Bipartite graphs
The above construction can also be extended to bipartite graphs. Let V = (θ 1,θ 2,…) and be two countably infinite sets of nodes with . We assume that only connections between nodes of different sets are allowed.
Model (14) has been proposed by Caron (2012) in a slightly different formulation. In this paper, we recast this model within our general framework, enabling new theoretical and practical insights.
3.5. Interpretation of
We can think of the positive, continuous valued node index as representing the time at which a potential node enters the network and has the opportunity to link with other existing nodes . We use the terminology ‘potential node’ here to clarify that this node need not form any observed connections with other nodes existing before time . We emphasize that an observed link between and some other node will eventually occur almost surely as time progresses. This could represent, for example, signing onto a social networking service before your friends do and only forming a link once they join. On the basis of our CRM specification, we have almost surely an infinite number of potential nodes as time goes to ∞. For infinite activity CRMs, we have almost surely an infinite set of potential nodes even at any finite time.
In Section 5, we examine properties of the network process across time, and we describe methods for simulating networks at any finite time. There, our focus is on the observed link process from this set of potential nodes. For example, sparsity is examined with respect to the set of nodes with degree at least 1, not with respect to the set of potential nodes. Since we need not think of as a time index, but rather just a general construct of our formulation, we also generically refer to as the node location in the remainder of the paper.
4. Related work
There has been extensive work over recent years on flexible Bayesian non-parametric models for networks, allowing complex latent structures of unknown dimension to be uncovered from real world networks (Kemp et al., 2006; Miller et al., 2009; Lloyd et al., 2012; Palla et al., 2012; Herlau et al., 2014). However, as mentioned in the unifying overview of Orbanz and Roy (2015), these methods all fit in the Aldous–Hoover framework and as such produce dense graphs.
Norros and Reittu (2006) proposed a conditionally Poissonian multigraph process with similarities to be drawn to our multigraph process. In their formulation, each node has a given sociability parameter and the number of edges between two nodes i and j is drawn from a Poisson distribution with rate the product of the sociability parameters, normalized by the sum of the sociability parameters of all the nodes. The normalization makes this model similar to models based on rescaling of the graphon and, as such, does not define a projective model, as explained in Section 1. See van der Hofstad (2014) for a review of this model and Britton et al. (2006) for a similar model.
As pointed out by Jacobs and Clauset (2014) in their discussion of an earlier version of this paper, another related model is the degree-corrected random-graph model (Karrer and Newman, 2011), where edges of the multigraph are drawn from a Poisson distribution whose rate is the product of node-specific sociability parameters and a parameter tuning the interaction between the latent communities to which these nodes belong. When the sociability parameters are assumed to be IID from some distribution, this model yields an exchangeable adjacency matrix and thus a dense graph.
Additionally, there are similarities to be drawn with the extensive literature on latent space modelling (e.g. Hoff et al. (2002), Penrose (2003) and Hoff (2009)). In such models, nodes are embedded in a low dimensional, continuous latent space and the probability of an edge is determined by a distance or similarity metric of the node-specific latent factors. In our case, the continuous node index is of no importance in forming edge probabilities. It would, however, be possible to extend our approach to time- or location-dependent connections by considering inhomogeneous CRMs.
Finally, as we shall detail in Section 5.5, our model admits a construction with connections to the configuration model (Bollobás, 1980; Newman, 2010), which is a popular model for generating simple graphs with a given degree sequence.
The connections with this broad set of past work place our proposed network model within the context of existing literature. Importantly, however, to the best of our knowledge this work represents the first fully generative and projective approach to sparse graph modelling (see Section 5), and with a notion of exchangeability that is essential for devising our scalable statistical estimation procedure, as shown in Section 7.
5. General properties and simulation
We provide general properties of our network model depending on the properties of the Lévy measure ρ.
5.1. Exchangeability under the Kallenberg framework
Proposition 1(joint exchangeability of undirected graph measure). For any CRM W∼CRM(ρ,λ), the point process Z defined by equation (12), or equivalently by equation (6), is jointly exchangeable.
The proof is given in Appendix B. In the adjacency matrix representation, we think of exchangeability as invariance to node orderings. Here, we have invariance to the time of arrival of the nodes, thinking of as a time index.
In Section 6, we provide explicit forms for depending on our choice of Lévy measure ρ. Expression (16) represents a direct analogue to that arising from the Aldous–Hoover framework. In particular, M here is akin to the graphon ω of expression (47) in Appendix A.1, and thus allows us to connect our CRM-based formulation with the extensive literature on graphons. An illustration of the network construction from the Kallenberg representation, including the function M, is in Fig. 5. Note that, if we had started from the Kallenberg representation and selected an f (or M) arbitrarily, we would probably not have obtained a network model with the normalized CRM interpretation that enables both interpretability and analysis of network properties.
![Illustration of the model construction based on the Kallenberg representation: (a) a unit rate Poisson process (θi,ϑi), i∈N, on [0,α]×R+; (b) for each pair {i,j}∈N~2, set z ij = z ji = 1 with probability M(ϑ i,ϑ j) (here, M is indicated by the blue shading (darker shading indicates higher value) for a stable process (GGP with τ = 0); in this case there is an analytic expression for ρ¯−1 and therefore M)](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/79/5/10.1111_rssb.12233/4/m_rssb_79_5_1295_f5.jpeg?Expires=1747852303&Signature=t2-nLS4hhaJd4P8z-da0Eq~O57Gn0UlKiUiorxh~eRb7G8B1ieI5kSkhcLLe8EQ10nMKrauUxTv5oJi8vRbX-WQ7w1oKAtJ6yvzaF2c1sGP17RFkACQdwky47TeREikb05tsVuvlKil6C5mPud2Kr3NL~db-ClDi17oyTf9677xd688pz0xMI~Iz5N~A~xiCSgQYUJxJISPMwH9l80TklTG8ly9~QFuNDM0butn7Qdp08YZMxeiOz-UfttybV2-RSNA8huXWkINTixyiSxKJW~2SAAr12wLOrwhv0335VwIp5dRqEzIapLakpbYeH3ODnMZDZE1qmTBb3ObFsUg1bg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Illustration of the model construction based on the Kallenberg representation: (a) a unit rate Poisson process , , on ; (b) for each pair , set z ij = z ji = 1 with probability M(ϑ i,ϑ j) (here, M is indicated by the blue shading (darker shading indicates higher value) for a stable process (GGP with τ = 0); in this case there is an analytic expression for and therefore M)
For the bipartite graph, Kallenberg's representation theorem for separate exchangeability (Kallenberg (1990) and Kallenberg (2005), theorem 9.23) can likewise be applied.
5.2. Interactions between groups
5.3. Graph restrictions
Let us consider the restriction of our process to the square [0,α]2. For finite activity CRMs, there will be a finite number of potential nodes (jumps) in the interval [0,α]. For infinite activity CRMs, we shall have an infinite number of potential nodes. We are interested in the properties of the process as α grows, where we can think of α as representing time and observing the process as new potential nodes and any resulting edges enter the network. We note that, in the limit of α→∞, the number of edges approaches ∞ since almost surely.
Let D α and Z α be the restrictions of D and Z respectively to the square [0,α]2. Then, (D α)α⩾0 and (Z α)α⩾0 are measure-valued stochastic processes, indexed by α. We also denote by W α and λ α the corresponding CRM and Lebesgue measure on [0,α]. In what follows, our interests are in studying how the following quantities vary with α:
- a.
, the number of nodes with degree at least one in the network, and
- b.
, the number of edges in the undirected network.

Example of point process Z and above it the associated integer-valued stochastic processes for the number of observed nodes () and edges (
)
5.4. Sparsity
In this section we state the sparsity properties of our graph model, which relate to the properties of the Lévy measure ρ. In particular, we are interested in the relative asymptotic behaviour of the number of edges with respect to the number of observed nodes as α→∞. Henceforth, we consider , since the case of trivially gives almost surely.
In theorem 2 we characterize the sparsity of the graph with respect to the properties of its Lévy measure: graphs obtained from infinite activity CRMs are sparse, whereas graphs obtained from finite activity CRMs are dense. The rate of growth can be further specified when ρ is a regularly varying Lévy measure (Feller, 1971; Karlin, 1967; Gnedin et al., 2006, 2007), as defined in Appendix A.2. We follow the notation of Janson (2011) for probability asymptotics (see Appendix C.1 for details).
Theorem 2Consider a point process Z representing an undirected graph. Let be the number of edges and be the number of observed nodes in the point process restriction Z α (see equations (17) and (18)). Assume that the defining Lévy measure is such that . If the CRM W is finite activity, i.e.then the number of edges scales quadratically with the number of observed nodesalmost surely as α→∞, implying that the graph is dense.(19)If the CRM is infinite activity, i.e.then the number of edges scales subquadratically with the number of observed nodesalmost surely as α→∞, implying that the graph is sparse.(20)Furthermore, if the Lévy measure ρ is regularly varying (see definition 1 in Appendix A.2), with exponent σ ∈ (0, 1) and slowly varying function l satisfying , then(21)
Theorem 2 is a direct consequence of two theorems that we state now and prove in Appendix C. The first theorem states that the number of edges grows quadratically with α, whereas the second states that the number of nodes scales superlinearly with α for infinite activity CRMs, and linearly otherwise.
Theorem 3Consider the point process Z. If , then the number of edges in Z α grows quadratically with α:almost surely. Otherwise, almost surely.(22)
Theorem 4Consider the point process Z. Thenalmost surely as α→∞. In words, the number of nodes with degree at least 1 in Z α scales linearly with α for finite activity CRMs and superlinearly with α for infinite activity CRMs. Furthermore, for a regularly varying Lévy measure with slowly varying function l such that , we have(23)(24)
We finally give the expressions of the expectations for the number of edges and nodes in the model. The proof is given in Appendix C.4. (Equations (26) and (27) could alternatively be derived as particular cases of the results in Veitch and Roy (2015).)
Theorem 5The expected number of edges in the multigraph, edges in the undirected graph and observed nodes are given as follows:(25)(26)where is the Laplace exponent. Additionally, if ρ is a regularly varying Lévy measure with exponent σ ∈ [0, 1) and slowly varying function l, and then(27)(28)
5.5. Simulation
5.5.1. Direct simulation of graph restrictions
Because of the discreteness of W α, there will be ties between the (U k1,U k2), and the number of such ties corresponds to the multiplicity of that edge. In particular, a total of nodes U kj are drawn but result in some distinct values. We overload the notation here because this quantity also corresponds to the number of nodes with degree at least 1 in the resulting undirected network. Recall that the undirected network construction simply forms an undirected edge between a set of nodes if there is at least one directed edge between them. If we consider unordered pairs {U k1,U k2}, the number of such unique pairs takes a number of distinct values, where corresponds to the number of edges in the undirected network.
The construction above enables us to re-express our Cox process model in terms of normalized CRMs (Regazzini et al., 2003). This is very attractive both practically and theoretically. As we show in Section 6 for special cases of CRMs, one can use the results surrounding normalized CRMs to derive an exact simulation technique for our directed and undirected graphs.
Remark 1The construction above enables us to draw connections with the configuration model (Bollobás, 1980; Newman, 2010), which proceeds as follows. First, the degree k i of each node i = 1,…,n is specified such that the sum of k i is an odd number. Each node i is given a total of k i stubs, or demiedges. Then, we repeatedly choose pairs of stubs uniformly at random, without replacement, and connect the selected pairs to form an edge. The simple graph is obtained either by discarding the multiple edges and self-loops (an erased configuration model), or by repeating the above sampling until obtaining a simple graph. In our case, we have an infinite set of (potential) nodes and do not prespecify the node degrees. Furthermore, each node in the pair (U k1,U k2) is drawn from a normalized CRM rather than the pair being selected uniformly at random. However, at a high level, there is a similar flavour to our construction.
5.5.2. Urn-based simulation of graph restrictions
We now describe an urn formulation that allows us to obtain a finite dimensional generative process. Recall that, in practice, we cannot sample W α∼CRM(ρ,λ α) if the CRM is infinite activity since there will be an infinite number of jumps.
5.5.3. Approximate simulation of graph restrictions
If we cannot sample from in expression (33) and evaluate the EPPF in expression (32), we resort to approximate simulation methods. In particular, we harness the directed multigraph representation and approximate the draw of W α. For our undirected graphs, we simply transform the (approximate) draw of a directed multigraph as described in Section 3.3.
One approach to approximate simulation of W α, which is possible for some Lévy measures ρ, is to resort to adaptive thinning (Lewis and Shedler, 1979; Ogata, 1981; Favaro and Teh, 2013). A related alternative approximate approach, but applicable to any Lévy measure ρ satisfying condition (9), is the inverse Lévy method. This method first defines a threshold ɛ and then samples the weights by using a Poisson measure on [ɛ,∞]. One then simulates D α using these truncated weights Ω.
6. Special cases
In this section, we examine the properties of various models and their link to classical random-graph models depending on the Lévy measure ρ. We show that, in the GGP case, the resulting graph can be either dense or sparse, with the sparsity tuned by a single hyperparameter. Furthermore, exact simulation is possible via expression (33). We focus on the undirected graph case, but similar results can be obtained for directed multigraphs and bipartite graphs.
6.1. Poisson process
6.2. Compound Poisson process
6.3. Generalized gamma process
When σ⩾0, the GGP has an infinite number of jumps over any interval [s,t]. It includes as special cases the gamma process (σ = 0, τ>0), the stable process (σ ∈ (0, 1), τ = 0) and the inverse Gaussian process (, τ>0).

Sample graphs: (a) Erdoós–Rényi graph G(n,p) with n = 1000 and p = 0.05, and GGP graphs GGP(α,τ,σ) with (b)–(d) α = 100, τ = 2 and (b) σ = 0, (c) σ = 0.5 and (d) σ = 0.8 (the size of a node is proportional to its degree; the graphs were generated with the software Gephi (Bastian et al., 2009))
6.3.1. Exact sampling via an urn approach
6.3.2. Sparsity
Appealing to theorem 2, we use the following facts about the GGP to characterize the sparsity properties of this special case.
- a.
For σ<0, the CRM is finite activity with ; thus theorem 2 implies that the graph is dense.
- b.
When σ⩾0 the CRM is infinite activity; moreover, for τ>0, , and thus theorem 2 implies that the graph is sparse.
- c.For σ>0, the tail Lévy intensity has the asymptotic behaviourand, as such, is regularly varying with exponent σ and constant slowly varying function.
Remark 2The proof technique of theorem 2 requires and thus excludes the stable process (τ = 0,σ ∈ (0, 1)), although we conjecture that the graph is also sparse in that case.
6.3.3. Empirical analysis of graph properties
For the GGP-based formulation, we provide an empirical analysis of our network properties in Fig. 8 by simulating undirected graphs by using the approach that was described in Section 5.5 for various values of σ and τ. We compare with an Erdoós–Rényi random graph, preferential attachment (Barabási and Albert, 1999) and the Bayesian non-parametric network model of Lloyd et al. (2012). The particular features that we explore are as follows.

Examination of the GGP undirected network properties (averaging over graphs with various α) in comparison with an Erdoós–Rényi G(n,p) model with p = 0.05 (), the preferential attachment model of Barabási and Albert (1999) (
) and the non-parametric formulation of Lloyd et al. (2012) (
): (a) degree distribution on a log–log-scale for (a) various values of σ (
, σ = 0.2;
, σ = 0.5;
, σ = 0.8) (τ = 10−2) and (b) various values of τ (
, τ = 10−1;
, τ = 1;
, τ = 5) (σ = 0.5) for the GGP; (c) number of nodes with degree 1 versus number of nodes on a log–log-scale (
, σ = 0.2;
, σ = 0.5;
, σ = 0.8) (note that the Lloyd method leads to dense graphs such that no node has only degree 1); (d) number of edges versus number of nodes (
, σ = 0.2;
, σ = 0.5;
, σ = 0.8) (here we note growth at a rate o(n 2) for our GGP graph models, and Θ(n 2) for the Erdoós-Rényi and Lloyd models (dense graphs))
- a.
Degree distribution: Fig. 8(a) suggests empirically that the model can exhibit power law behaviour providing a heavy-tailed degree distribution. As shown in Fig. 8(b), the model can also handle an exponential cut-off in the tails of the degree distribution, which is an attractive property (Clauset et al., 2009; Olhede and Wolfe, 2012).
- b.
Number of degree 1 nodes: Fig. 8(c) examines the fraction of degree 1 nodes versus the number of nodes.
- c.
Sparsity: Fig. 8(d) plots number of edges versus number of nodes. The larger σ, the sparser the graph is. In particular, for the GGP random-graph model, we have network growth at a rate O(n a) for 1<a<2 whereas the Erdös–Rényi (dense) graph grows as Θ(n 2).
6.3.4. Interpretation of hyperparameters
On the basis of the properties derived and illustrated empirically in this section, we see that our hyperparameters have the following interpretations.
- a.
σ—from Figs 8(a) and 8(d), σ relates to the slope of the degree distribution in its power law regime and the overall network sparsity. Increasing σ leads to higher power law exponent and sparser networks.
- b.
α—from theorem 5, α provides an overall scale that affects the number of nodes and directed interactions, with larger α leading to larger networks.
- c.
τ—from Fig. 8(b), τ determines the exponential decay of the tails of the degree distribution, with τ small looking like pure power law. This is intuitive from the form of ρ(dw) in equation (35), where we see that τ affects large weights more than small weights.
7. Posterior characterization and inference
In this section, we consider the posterior characterization and MCMC inference of parameters and hyperparameters in our statistical network models.
Assume that we have observed a set of undirected connections or directed connections where is the observed number of nodes with at least one connection. Without loss of generality, we assume that the locations of these nodes are ordered, and we write as their associated sociability parameters. For simplicity, we are overloading notation here with the unordered nodes in of equation (7).
We aim to infer the sociability parameters w i, , for each of the observed nodes. We also aim to infer the sociability parameters of the nodes with no connections (the difference between the set of potential nodes and those with observed interactions). We refer to these as unobserved nodes. Under our framework, the number of such nodes is either finite but unknown or infinite. The observed connections, however, provide information about only the sum of their sociabilities, denoted w *. The node locations of both observed and unobserved nodes are also not likelihood identifiable and are thus ignored. We additionally aim to estimate α and the hyperparameters of the Lévy intensity ρ of the CRM; we write ϕ for the set of hyperparameters. We therefore aim to approximate the posterior for an observed undirected graph and for an observed directed graph. (Formally, this density is with respect to a product measure that has a Dirac mass at 0 for w *, as detailed in Appendix F.)
7.1. Directed multigraph posterior
In theorem 6, we characterize the posterior in the directed multigraph case. This plays a key role in the undirected case that is explored in Section 7.2 as well.
Theorem 6For , let be the set of support points of the measure D α such that . Let and . We havewhere for are the node degrees of the multigraph and is the probability distribution of the random variable , with Laplace transform(40)Additionally, conditionally on observing an empty graph, i.e. , we have(41)(42)
The proof builds on posterior characterizations for normalized CRMs (James, 2002, 2005; Prünster, 2002; Pitman, 2003; James et al., 2009) using the hierarchical construction of expression (29)–(30). See Appendix E.
The conditional distribution of given does not depend on the locations because we considered a homogeneous CRM. This fact is important since the locations are typically not observed, and our algorithm outlined below will not consider these terms in the inference.
7.2. Markov chain Monte Carlo sampling for generalized gamma process based directed and undirected graphs
For the update of the total mass w * and hyperparameters ϕ, we use a Metropolis–Hastings step. Unless σ = 0 or , does not admit any tractable analytical expression. We therefore use a specific proposal for w * based on exponential tilting of that alleviates the need to evaluate this probability density function in the Metropolis–Hasting ratio (see the details in Appendix F). To summarize, the MCMC sampler is defined as follows.
Step 1: update the weights given the rest by using an HMC update.
Step 2: update the total mass w * and hyperparameters ϕ = (α,σ,τ) given the rest by using a Metropolis–Hastings update.
Step 3: (undirected graph) update the latent counts given the rest by using the conditional distribution (44) or a Metropolis–Hastings update.
The computational bottlenecks lie in steps 1 and 3, which roughly scale linearly in the number of nodes and edges respectively, although one can parallelize step 3 over edges. If L is the number of leapfrog steps in the HMC algorithm and n iter the number of MCMC iterations, the overall complexity is in . We show in Section 8 that the algorithm scales well to large networks with hundreds of thousands of nodes and edges. To scale the HMC algorithm to even larger collections of nodes of edges, one could explore the methods of Chen et al. (2014).
8. Experiments
8.1. Simulated data
We first study the convergence of the MCMC algorithm on simulated data where the graph is simulated from our model. We simulated a GGP undirected graph with parameters α = 300, σ = 0.5 and τ = 1, which places us in the sparse regime. The sampled graph resulted in 13995 nodes and 76605 edges. We ran three MCMC chains each with 40000 iterations and with different initial values and L = 10 leapfrog steps; the step size of the leapfrog algorithm was adapted during the first 10000 iterations to obtain an acceptance rate of 0.6. Standard deviations of the random-walk Metropolis–Hastings steps for log (τ) and log (1 − σ) were set to 0.02. The computing time for running the three chains successively was 10 min using a MATLAB implementation on a standard computer (central processor unit at 3.10 GHz; four cores). Trace plots of the parameters α, σ, τ and w * are given in Fig. 9. We computed the potential scale factor reduction (Brooks and Gelman, 1998; Gelman et al., 2014) for all 13999 parameters and τ) and found a maximum value of 1.01, suggesting convergence of the algorithm. This is quite remarkable as the MCMC sampler actually samples from a target distribution of dimension 13995 + 76605 + 4 = 90604. Posterior credible intervals of the sociability parameters w i of the nodes with highest degrees and log-sociability parameters log (w i) of the nodes with lowest degrees are displayed in Figs 10(a) and 10(b) respectively, showing the ability of the method to recover sociability parameters of both low and high degree nodes accurately.

MCMC trace plots of parameters (a) α, (b) σ, (c) τ and (d) w * for a graph generated from a GGP model with parameters α = 300,σ = 0.5 and τ = 1: , chain 1;
, chain 2;
, chain 3;
, true

95% posterior intervals () of (a) the sociability parameters w i of the 50 nodes with highest degree and (b) the log-sociability parameter log (w i) of the 50 nodes with lowest degree, for a graph generated from a GGP model with parameters α = 300,σ = 0.5 and τ = 1:
, true values
To show the versatility of the GGP graph model, we now examine our approach when the observed graph is actually generated from an Erdoós–Rényi model with n = 1 pt000 and p = 0.01. The generated graph had 1000 nodes and 5058 edges. We ran three MCMC chains with the same specifications as above. In this dense graph regime, the following transformation of our parameters α, σ and τ is more informative: ς 1 = −(α/σ)τ σ, ς 2 = − σ/τ and ς 3 = − σ/τ 2. When σ<0, ς 1 corresponds to the expected number of nodes, ς 2 to the mean of the sociability parameters and ς 3 to their variance (see Section 6.3). In contrast, the parameters σ and τ are only weakly identifiable in this case. The potential scale reduction factor is computed on , and its maximum value was 1.01, suggesting convergence.
The value of ς 1 converges around the true number of nodes and ς 2 to the true sociability parameter (constant across nodes for the Erdoós–Rényi model), whereas ς 3 is close to 0 as the variance over the sociability parameters is very small. The total mass is also very close to 0, indicating that there are no nodes with degree 0.
Posterior credible intervals for the nodes with highest and lowest degrees are in Fig. 11, showing that the model can accurately recover sociability parameters of both low and high degree nodes in the dense regime as well.

95% posterior intervals () of (a) sociability parameters w i of the 50 nodes with highest degree and (b) log-sociability parameters log (w i) of the 50 nodes with lowest degree, for a graph generated from an Erdoós-Rényi model with parameters n = 1000 and p = 0.01: in this case, all nodes have the same true sociability parameter (
)
8.2. Assessing properties of real world graphs
We now turn to using our methods to assess properties of a collection of real world graphs, including their degree distributions and aspects of sparsity. For the latter, evaluation based on a single finite graph is notoriously challenging as sparsity relates to the asymptotic behaviour of the graph. Measures of sparsity from finite graphs exist but can be costly to implement (Nešetřil and Ossona de Mendez, 2012). On the basis of our GGP-based formulation and associated theoretical results described in Section 6, we consider Pr(σ⩾0|z) as informative of the connectivity structure of the graph since the GGP graph model yields dense graphs for σ<0, and sparse graphs for σ ∈ [0, 1) (see equation (39)). For our analyses, we consider improper priors on the unknown parameters (α,σ,τ). We report Pr(σ ⩾ 0|z) based on a set of observed connections , which can be directly approximated from the MCMC output. We consider 12 different data sets:
- a.
facebook107—social circles in Facebook (https://snap.stanford.edu/data/egonetsFacebook.html ) (McAuley and Leskovec, 2012);
- b.
polblogs—political blogosphere (February 2005) (http://www.cise.ufl.edu/research/sparse/matrices/Newman/polblogs ) (Adamic and Glance, 2005);
- c.
USairport—US airport connection network in 2010 (http://toreopsahl.com/datasets/ ) (Colizza et al., 2007);
- d.
UCirvine—social network of students at the University of California, Irvine (http://toreopsahl.com/datasets/ ) (Opsahl and Panzarasa, 2009);
- e.
yeast—yeast protein interaction network (http://www.cise.ufl.edu/research/sparse/matrices/Pajek/yeast.html ) (Bu et al., 2003);
- f.
USpower—network of the high-voltage power grid in the western states of the USA (https://snap.stanford.edu/data/emailEnron.html ) (Watts and Strogatz, 1998);
- g.
IMDB—actor collaboration network based on acting in the same movie (http://www.cise.ufl.edu/research/sparse/matrices/Pajek/IMDB.html );
- h.
cond-mat1—co-authorship network (https://snap.stanford.edu/data/emailEnron.html ) (Newman, 2001), based on preprints posted to condensed matter of arXiv between 1995 and 1999, obtained from the bipartite preprints–authors network using a one-mode projection;
- i.
cond-mat2—as in cond-mat1, but using Newman's projection method;
- j.
Enron—Enron collaboration network from a multigraph e-mail network (https://snap.stanford.edu/data/emailEnron.html );
- k.
internet—connectivity of Internet routers (http://www.cise.ufl.edu/research/sparse/matrices/Pajek/internet.html );
- l.
www—linked World Wide Web pages in the nd.edu domain (http://lisgi1.engr.ccny.cuny.edu/~makse/soft_data.html ).
The sizes of the various data sets are given in Table 2 and range from a few hundred nodes or edges to a million. The adjacency matrices for these networks are plotted in Fig. 12 and empirical degree distributions in Fig. 13 (red).
Data set . | Number of nodes . | Number of edges . | Time (min) . | Pr(σ⩾0|z) . | 99% credible interval σ . |
---|---|---|---|---|---|
facebook107 | 1034 | 26749 | 1 | 0.00 | [−1.06,−0.82] |
polblogs | 1224 | 16715 | 1 | 0.00 | [−0.35,−0.20] |
USairport | 1574 | 17215 | 1 | 1.00 | [0.10, 0.18] |
UCirvine | 1899 | 13838 | 1 | 0.00 | [−0.14,−0.02] |
yeast | 2284 | 6646 | 1 | 0.28 | [−0.09, 0.05] |
USpower | 4941 | 6594 | 1 | 0.00 | [−4.84,−3.19] |
IMDB | 14752 | 38369 | 2 | 0.00 | [−0.24,−0.17] |
cond-mat1 | 16264 | 47594 | 2 | 0.00 | [−0.95,−0.84] |
cond-mat2 | 7883 | 8586 | 1 | 0.00 | [−0.18,−0.02] |
Enron | 36692 | 183831 | 7 | 1.00 | [0.20, 0.22] |
internet | 124651 | 193620 | 15 | 0.00 | [−0.20,−0.17] |
www | 325729 | 1090108 | 132 | 1.00 | [0.26, 0.30] |
Data set . | Number of nodes . | Number of edges . | Time (min) . | Pr(σ⩾0|z) . | 99% credible interval σ . |
---|---|---|---|---|---|
facebook107 | 1034 | 26749 | 1 | 0.00 | [−1.06,−0.82] |
polblogs | 1224 | 16715 | 1 | 0.00 | [−0.35,−0.20] |
USairport | 1574 | 17215 | 1 | 1.00 | [0.10, 0.18] |
UCirvine | 1899 | 13838 | 1 | 0.00 | [−0.14,−0.02] |
yeast | 2284 | 6646 | 1 | 0.28 | [−0.09, 0.05] |
USpower | 4941 | 6594 | 1 | 0.00 | [−4.84,−3.19] |
IMDB | 14752 | 38369 | 2 | 0.00 | [−0.24,−0.17] |
cond-mat1 | 16264 | 47594 | 2 | 0.00 | [−0.95,−0.84] |
cond-mat2 | 7883 | 8586 | 1 | 0.00 | [−0.18,−0.02] |
Enron | 36692 | 183831 | 7 | 1.00 | [0.20, 0.22] |
internet | 124651 | 193620 | 15 | 0.00 | [−0.20,−0.17] |
www | 325729 | 1090108 | 132 | 1.00 | [0.26, 0.30] |
Data set . | Number of nodes . | Number of edges . | Time (min) . | Pr(σ⩾0|z) . | 99% credible interval σ . |
---|---|---|---|---|---|
facebook107 | 1034 | 26749 | 1 | 0.00 | [−1.06,−0.82] |
polblogs | 1224 | 16715 | 1 | 0.00 | [−0.35,−0.20] |
USairport | 1574 | 17215 | 1 | 1.00 | [0.10, 0.18] |
UCirvine | 1899 | 13838 | 1 | 0.00 | [−0.14,−0.02] |
yeast | 2284 | 6646 | 1 | 0.28 | [−0.09, 0.05] |
USpower | 4941 | 6594 | 1 | 0.00 | [−4.84,−3.19] |
IMDB | 14752 | 38369 | 2 | 0.00 | [−0.24,−0.17] |
cond-mat1 | 16264 | 47594 | 2 | 0.00 | [−0.95,−0.84] |
cond-mat2 | 7883 | 8586 | 1 | 0.00 | [−0.18,−0.02] |
Enron | 36692 | 183831 | 7 | 1.00 | [0.20, 0.22] |
internet | 124651 | 193620 | 15 | 0.00 | [−0.20,−0.17] |
www | 325729 | 1090108 | 132 | 1.00 | [0.26, 0.30] |
Data set . | Number of nodes . | Number of edges . | Time (min) . | Pr(σ⩾0|z) . | 99% credible interval σ . |
---|---|---|---|---|---|
facebook107 | 1034 | 26749 | 1 | 0.00 | [−1.06,−0.82] |
polblogs | 1224 | 16715 | 1 | 0.00 | [−0.35,−0.20] |
USairport | 1574 | 17215 | 1 | 1.00 | [0.10, 0.18] |
UCirvine | 1899 | 13838 | 1 | 0.00 | [−0.14,−0.02] |
yeast | 2284 | 6646 | 1 | 0.28 | [−0.09, 0.05] |
USpower | 4941 | 6594 | 1 | 0.00 | [−4.84,−3.19] |
IMDB | 14752 | 38369 | 2 | 0.00 | [−0.24,−0.17] |
cond-mat1 | 16264 | 47594 | 2 | 0.00 | [−0.95,−0.84] |
cond-mat2 | 7883 | 8586 | 1 | 0.00 | [−0.18,−0.02] |
Enron | 36692 | 183831 | 7 | 1.00 | [0.20, 0.22] |
internet | 124651 | 193620 | 15 | 0.00 | [−0.20,−0.17] |
www | 325729 | 1090108 | 132 | 1.00 | [0.26, 0.30] |

Adjacency matrices for various real world networks: (a) facebook107; (b) polblogs; (c) USairport; (d) UCirvine; (e) yeast; (f) USpower; (g) IMDB; (h) cond-mat1; (i) cond-mat2; (j) Enron; (k) internet; (l) www

Empirical degree distribution () and posterior predictive (
) for various real world networks: (a) facebook107; (b) polblogs; (c) USairport; (d) UCirvine; (e) yeast; (f) USpower; (g) IMDB; (h) cond-mat1; (i) cond-mat2; (j) Enron; (k) internet; (l) www
We ran three MCMC chains for 40000 iterations with the same specifications as above and report the estimate of Pr(σ⩾0|z) and 99% posterior credible intervals of σ in Table 2; we additionally provide run times. MCMC trace plots suggested rapid convergence of the sampler. Since sparsity is an asymptotic property of a graph, and we are analysing finite graphs, our inference of σ here simply provides insight into some structure of the graph and is not formally a test of sparsity. From Table 2, we note that we infer negative σ-values for many of the smaller networks. This might indicate that these graphs have dense connectivity; for example, our facebook107 data set represents a small social circle that is probably highly interconnected and the polblogs data set represents two tightly connected political parties. We infer positive σ-values for three of the data sets (USairport, Enron and www); note that two of these data sets are in the top three largest networks considered, where sparse connectivity is more commonplace. In the remaining large network, internet, a question is why the inferred σ is negative. This may be due to dense subgraphs or spots (for example, spatially proximate routers may be highly interconnected, but sparsely connected outside the group) (Borgs et al., 2014b). This relates to the idea of community structure, though not every node need be associated with a community. As in many sparse network models that assume no dense spots (Bollobás and Riordan, 2009; Wolfe and Olhede, 2013), our approach does not explicitly model such effects. Capturing such structure remains a direction of future research that is likely to be feasible within our generative framework. However, our current method has the benefit of simplicity with three hyperparameters tuning the network properties. Finally, we note in Table 2 that our analyses finish in a remarkably short time although the code base was implemented in MATLAB on a standard desktop machine, without leveraging possible opportunities for parallelizing and other mechanisms for scaling the sampler (see Section 7 for a discussion).
To assess our fit to the empirical degree distributions, we used the methods that were described in Section 5.5 to simulate 5000 graphs from the posterior predictive distribution in each scenario. Fig. 13 provides a comparison between the empirical degree distributions and those based on the simulated graphs. In all cases, we see a reasonably good fit. For the largest networks, Figs 13(j)–13(l), we see a slight underestimate of the tail of the distribution, i.e. we do not capture as many high degree nodes as truly present. This may be because these graphs exhibit a power law behaviour, but only after a certain node degree (Clauset et al., 2009), which is not an effect that is explicitly modelled by our framework. Instead, our model averages the error in the low and high degree nodes. Another reason for underestimating the tails might be dense spots, which we also do not explicitly model. However, our model does capture power law behaviour with possible exponential cut-off in the tail. We see a similar trend for cond-mat1, but not cond-mat2. Based on the bipartite articles–authors graph, cond-mat1 uses the standard one-mode projection and sets a connection between two authors who have co-authored a paper; this projection clearly creates dense spots in the graph. In contrast, cond-mat2 uses Newman's projection method (Newman et al., 2001). This method constructs a weighted undirected graph by counting the number of papers that were co-authored by two scientists, where each count is normalized by the number of authors on the paper. To construct the undirected graph, an edge is created if the weight is equal to or greater than 1; cond-mat1 and cond-mat2 thus have a different number of edges and nodes, as only nodes with at least one connection are considered. It is interesting that the projection method that was used for the cond-mat data set has a clear influence on the sparsity of the resulting graph, cond-mat2 being less dense than cond-mat1 (see Figs 13(h) and 13(i)). The degree distribution for cond-mat1 is similar to that of internet, thus inheriting the same issues as previously discussed. Overall, it appears that our model better captures homogeneous power law behaviour with possible exponential cut-off in the tails than it does a graph with perhaps structured dense spots or power law after a point behaviour.
9. Conclusion
We proposed a class of statistical network models building on exchangeable random measures. Using this representation, we showed how it is possible to specify models with properties that are different from those of models based on exchangeable adjacency matrices. As an example, we considered a model building on the framework of CRMs that can yield sparse graphs while maintaining attractive exchangeability properties. For a choice of CRMs, our fully generative formulation can yield networks ranging from dense to sparse, as tuned by a single hyperparameter.
In this paper, exchangeability is in the context of random measures for which we appealed to the Kallenberg representation in place of the Aldous–Hoover theorem for exchangeable arrays. Using this framework, we arrived at a structure that is analogous to the graphon, which opens up new modelling and theoretical analysis possibilities beyond those of the special case that is considered herein. Importantly, through the exchangeability of the underlying random measures and leveraging HMC sampling, we devised a scalable algorithm for posterior computations. This scheme enables inference of the graph parameters, including the parameter determining the sparsity of the graph. We examined our methods on a range of real world networks, demonstrating that our model yields a practically useful statistical tool for network analysis.
We believe that the foundational modelling tools and theoretical results that we presented represent an important building block for future developments. Such developments can be divided along two dimensions:
- a.
modelling advances, such as incorporating notions of community structure and node attributes, within this framework and
- b.
theoretical analyses looking at the properties of the corresponding class of networks.
For the latter, we considered just one simplified version of the Kallenberg representation; examining a more general form could yield graphs with additional structure. Building on an initial version of this paper (Caron and Fox, 2014), initial forays into advances on the modelling side can be found in Herlau et al. (2016) and Todeschini and Caron (2016) and theoretical analyses in Veitch and Roy (2015) and Borgs et al. (2016).
Discussion on the paper by Caron and Fox
Ginestra Bianconi (Queen Mary University of London)
I am delighted and honoured to open the discussion on this paper which provides an ideal starting point to reflect on the benefits provided by the interdisciplinary character of network science.
From brain research to social networks ‘big data’ in the form of networks are permeating the sciences and our own everyday life. Therefore there is an urgent need to develop new methods and techniques to extract information from network data. Network science has developed as the fast growing interdisciplinary field that addresses this problem with the aim of obtaining predictive models for social, physical and biological phenomena occurring in networks.
The success of network science (Barabási, 2016; Newman, 2010; Dorogovtsev and Mendes, 2002) is rooted in the following two characteristics of the field:
- a.
the ubiquitous presence of networks describing complex interacting systems in social, technological and biological contexts;
- b.
the ability of the field to adopt methods and techniques coming from different theoretical disciplines such as statistical mechanics, graph theory and statistical network modelling.
Although already at the beginning of the field the first aspect was shown to be essential for the characterization of the universal properties of networks (Barabási and Albert, 2009; Watts and Strogatz, 1998), more recently it has become clear that network science can lead to a comprehensive understanding of network phenomena only if the methods and techniques that are used to study networks reflect different theoretical perspectives.
Initially statistical mechanics has been the most prosperous and successful approach to network modelling. In this framework we distinguish between a non-equilibrium approach of growing network models evolving through the preferential attachment rule (Dorogovtsev and Mendes, 2002; Barabási and Albert, 2009) and equilibrium approaches characterizing network ensembles enforcing either hard constraints (such as the configuration model) or soft constraints (such as exponential random graphs) (Park and Newman, 2004; Anand and Bianconi, 2009, 2010).
Most statistical mechanics models describe the regime in which the average degree 〈k〉 does not depend on the number of nodes N, i.e. 〈k〉 = O(1). Particular focus has been addressed to scale-free networks in this regime having degree distribution P(k)≃Ck − γ for k≫1 and a power law exponent in the range γ ∈ (2, 3]. The particular emphasis given to the regime in which the average degree is constant with increasing network size is justified by the fact that indeed the vast majority of network data from the Internet to actor collaboration networks belong to this class of networks (Barabási, 2016).
However, there is evidence that several network data from social on-line networks (Seyed-Allaei et al., 2006) to neuroscience (Bonifazi et al., 2009) have a power law degree distribution P(k)≃Ck − γ with γ ∈ (1, 2] implying very heterogenous degrees, very significant hubs and a diverging average degree. Despite the recent interest in these classes of networks (Seyed-Allaei et al., 2006; Lambiotte et al., 2016; Timár et al., 2016) historically these networks have been disregarded or neglected in the statistical mechanics community (Barabási, 2016; Del Genio et al., 2011).
From the statistical perspective, recently we have been witnessing a new renaissance of network modelling (Goldenberg et al., 2010) emphasizing the relevance of having exchangeable and projective network models. Exchangeable models guarantee that the order in which nodes are sampled is irrelevant. Projective models guarantee that a network inference performed on a sample of N nodes can be used to infer properties of a sample obtained including more nodes. Interestingly the statistical mechanics models including non-equilibrium growing network models and network ensembles do not have either of these two properties.
Exchangeable projective models generated by a joint exchangeable adjacency matrix are described by the graphon and generate dense networks where the average degree is increasing linearly with the network size, i.e. 〈k〉 = O(N). Given a joint exchangeable adjacency matrix the Aldous–Hoover representation can be expressed as the existence of a graphon, implying that the network is dense 〈k〉 = O(N). Since the vast majority of real network data sets are sparse and have an average degree increasing sublinearly with the number of nodes, one of the major problems of the field was to overcome this limitation of the graphon.
In their paper Caron Fox formulate for the first time a generative exchangeable and projective model for sparse networks.
The major step to overcome the limitation of the graphon has been considering a point process on the plane instead of an exchangeable matrix. Thanks to the Kallenberg representation theorem for random measurable functions this model admits a representation as a mixture of random functions that naturally extend the graphon to the sparse regime.
Caron and Fox's model, based on the assumption of the existence of a latent space (the sociability of a node), achieves the following three major results.
- a.
The model generates either dense networks with 〈k〉 = O(N) or sparse networks with 〈k〉 = O(N θ) and θ ∈ (0, 1). Therefore the model constitutes a significant advance on exchangeable and projective statistical network modelling.
- b.The model generates scale-free networks with diverging average degreeand γ ∈ (1, 2]. Therefore the model shows that in the framework of statistical network modelling these networks emerge naturally, enabling us to describe networks that have been historically neglected in the statistical mechanics approach to networks.
- c.
The model enables an efficient method for inference of network data. Analysed network data include network data sets of sizes up to N = 300000 nodes. It should be noted that the code is freely available from the author's Web page.
In conclusion Caron and Fox's paper opens a new scenario in statistical network modelling enabling the treatment of exchangeable and projective sparse network models. Additionally this work is a beautiful example of the benefits that can be obtained by an interdisciplinary approach to network science.
Many prospects can be envisaged for future research. These include the extension to more general latent spaces (Todeschini and Caron, 2016), the proposal of new network exchangeable models to preserve privacy (Borgs et al., 2015) and the investigation of the regime 〈k〉 = O(1) for exchangeable network models that would finally clarify whether it is possible to have a global framework to study both networks with constant average degree and diverging average degree.
It therefore gives me great pleasure to propose the vote of thanks.
Karthik Bharath (University of Nottingham)
I applaud Caron and Fox's efforts in the paper for developing a novel approach to represent and model networks based on point processes, which since its first version in 2014 has inspired several new forays into the literature on graph limits. In a certain sense the authors’ approach is reminiscent of the objective method (Aldous and Steele, 2004) for combinatorial objects, and continuum limits of discrete objects, wherein of chief interest is the study of local properties of the graphs unlike global properties like sparsity. However, the utility of such probabilistically interesting continuous objects as statistical models for network or graph data is suspect; this was recently explored in Bharath et al. (2017) in the context of tree-structured data (acyclic graphs) using the continuum random tree (Aldous, 1993). My own view is that statistical models that aim to capture and quantify variability in characteristics of real world networks need necessarily be compatible with its inherently discrete nature, and the fundamental statistical notion of sampling from a population network (e.g. adding or deleting vertices).
I am hence sceptical of the conceptual underpinning of their approach from a purely modelling perspective: the appropriateness of the notion of exchangeability—which is the leitmotiv of the paper—of point processes (within the Kallenberg framework) for network models is unclear. Depending on the question of interest the basic statistical datum on a network is identified (vertex, edge, triangle or other motifs), and exchangeability is then defined as (probabilistic) invariance to their labelling, which usually is arbitrary. The authors’ continuous space representation of a network as a point process on the plane implies that the distribution of the number of points that fall within rectangles A i × A j where A i:=[h(i−1),hi],h>0, is invariant to relabelling of the rectangles formed by intervals. This begs the question: what exactly is exchangeable on an observed network with vertices and edges represented by the measure-valued process {Z a,α>0}? The answer becomes trickier when isolated vertices are deleted, as done in the paper. In fact the sequence of networks {Z α,α>0} is unlabelled for the observed atoms! A similar issue also arises in the desirable notion of projectivity under subsampling: for projectivity of the restriction of the law of to [0,α 2]2 is not compatible with operations such as deletion and addition of vertices or edges, and I am unsure whether a sensible analogous interpretation is available within the point process setting. In this context recently developed notions of edge and relational exchangeability appear particularly cogent (Crane and Dempsey, 2015, 2016).
Posterior samples of w i could conceivably be used in estimating the stationary distribution, which then allows for probabilistic statements of such extreme events. Assessing power law behaviour through the double-logarithm plot is shown to be dubious in Clauset et al. (2009); surprisingly, the authors use this method of assessment despite referring to Clauset et al. (2009). Can the overall model fit be assessed through a posterior predictive check?
The interpretation (and identifiability) of α in Z α within the context of an observed network is not clear since α is the upper bound for the —this is exacerbated in the sparse regime when α→∞. The interpretation of the credible intervals for the σ-parameter (and hence the question of sparsity) is also unclear since the parameter α is unobserved and is estimated. It might be interesting to explore the structural properties of the graphs if the completely random measure W is allowed to be inhomogeneous such that v(dw,dθ) = ρ(dw,θ)λ(dθ); indeed, this brings about additional computational and interpretability issues. Despite the elegant theoretical foundation, the point process model for a finite network is perhaps ‘too rich’—whereas the Lévy measure ρ and jumps w i are profitably used, the latent parameters (whose natural ordering on is discarded in the interests of the exchangeability) and α throw up some uncomfortable questions on model interpretability and identifiability.
The vote of thanks was passed by acclamation.
Wilfrid S. Kendall (University of Warwick, Coventry)
Caron and Fox are to be congratulated on producing an inspiring and elegant blend of the creative probabilistic theory of exchangeability and the practical demands of statistical network analysis. I am impressed by the way in which practical considerations of tunability guide them towards exchangeable and completely random measures and theorems of Kallenberg and Aldous–Hoover.
Here I share recent work on rather different random networks, having a pronounced spatial element. Aldous (Aldous and Ganesan, 2013; Aldous, 2014) proposed an axiomatization of what one might call a scale invariant random spatial network (SIRSN). Motivated by on-line maps, one postulates a random entity producing a unique set of connections N(x 1,…,x n) between any prescribed set of points x 1,…,x n on the plane: N(x 1,…,x n) is composed of routes r(x i,x j) between each pair of points x i and x j. It is required that the network be statistically invariant under translation, rotation and scaling, that the mean distance of a route between two specified points be finite and, finally, that if one considers a network connecting all points of an (independent) Poisson point process of intensity λ then the ‘long distance’ parts of the routes (say, r(x i,x j)∖ball(x i, 1)∖ball(x j, 1) should form a fibre process of finite mean length per unit area. (This is the ‘weak SIRSN property’; the ‘strong SIRSN property’ requires the bound on mean length per unit area to hold uniformly in λ.) In effect, long distance routes are reused by many different point pairs
It is a highly non-trivial matter to prove that SIRSNs can exist, but we now have two different examples. Aldous (2014) produced a construction based on randomized grids. Kahn (2016) and Kendall (2017) put together arguments which showed that an SIRSN can be constructed using Poisson line processes—improper scale invariant line processes marked by speeds, with routes formed using fastest possible paths in the network. Remarkably, the Poisson line process construction also works in 3-space and beyond.
It is salutary to compare this with the work of Caron and Fox. SIRSNs provide toy models for real world spatial transportation networks, and a theoretical justification for the helpful notion of ‘transit nodes’ (Bast et al., 2007). But, although there is typically a single scalar parameter γ expressing the scale invariance, current SIRSN models cannot be said to be very tunable, nor to accommodate possibilities for data modelling; a clear challenge for future work.
Benjamin Bloem-Reddy (Columbia University, New York)
Research in this area is progressing rapidly, and there are connections to some concurrent work. See also Janson (2017a), section 5.1, for related ideas. For every model based on an exchangeable random measure (ERM) (see also Veitch and Roy (2015, 2016), Borgs et al. (2016) and Janson (2016)), we can construct a so-called edge exchangeable model (Crane and Dempsey, 2015, 2016; Williamson, 2016; Cai et al., 2016; Janson, 2017a) that coincides with a finite restriction of the ERM model. The converse is true only in some cases (see Cai et al. (2016) for an example). I consider the completely random measure (CRM) case for concreteness.
The inequality reflects a fundamental difference in how the two model classes encode the notion of growth and offers guidance for choosing an appropriate model class. Edge exchangeable models posit that graphs grow one edge at a time; growth in ERM models is by a random number of edges as α increases. As a consequence, in an edge exchangeable graph growing to arbitrary size, an edge may occur between two vertices that were previously not connected. This is not true for ERM models, which require that new edges form only between new vertices, inducing a growing population of edges from which to sample. Conversely, edge exchangeable models sample from a fixed (possibly infinite) population.
Patrick Rubin-Delanchy (University of Bristol)
This is a very creative and thought-provoking paper on network modelling, opening many avenues for future research. With this second point in mind, it would be useful to set down formally what this new definition of graph exchangeability should be in general: is an arbitrary model for an infinite graph (z ij)i,j = 1, 2… exchangeable in this sense if there is a coupling so that, marginally, z follows the specified model and () is a unit rate Poisson process, and, jointly, the point process is exchangeable?
Some popular approaches to network modelling rely on latent positions (X i)i = 1, 2,… for each of the nodes, where, conditionally on X i, we have z ij∼ind Bernoulli {f(X i,X j)} for each i and j for some function f. Under the usual infinite exchangeability assumption, the Aldous–Hoover theorem suggests a Bayesian treatment of f, and the X i are independent uniform positions on [0, 1]. However, Hoff et al. (2002) assumed that f(X i,X j) is logistic in the distance ‖X i − X j‖ (ignoring covariates), with the X i independent and identically distributed (IID) on (this being part of the prior rather than the model), and the random dot product graph (Nickel, 2006; Young and Scheinerman, 2007; Athreya et al., 2016) assumes that , with the X i IID on a convex set . The stochastic block model (Holland et al., 1983) and mixed membership stochastic block model (Airoldi et al., 2008) can also be written as latent position models. In the present paper, such models were mentioned as possible extensions where, if I understand correctly, the could represent these latent positions. Instead, I think of the w i as latent positions, and indeed in a follow-up paper these parameters live in higher dimensions (Todeschini and Caron, 2016). What I expect in general is that the assumption of IID latent positions, present in each of the earlier models, must be dispensed with, and instead the X i must now form a more complicated point process on (or whichever space is appropriate), just as the w i are not IID but instead form a more complex point process on the positive line.
It is then interesting to ask how are estimates of these positions affected, starting with the simplest question of whether even the w i in this paper are estimated consistently? But, to give a more interesting example, for the random dot product graph, Athreya et al. (2016) have shown that the estimation error under adjacency spectral embedding goes as the usual rate n −1/2 and is asymptotically Gaussian, but here the IID assumption on the X i is important. If the graph is ‘sparsified’, using the ideas of this paper, how are these spectral estimates affected?
The following contributions were received in writing after the meeting.
Julyan Arbel (Inria Grenoble Rhône-Alpes)
The paper by Caron and Fox is a very fine methodological work which illustrates once again the huge modelling flexibility and versatility of discrete Bayesian non-parametric priors. They target here sparsity in graphs, the level of which can be neatly assessed according to the stability parameter σ of the discrete process under consideration.
In recent research (Arbel et al., 2017), we introduce conditions on the Lévy intensity of the completely random measure to ensure consistent estimation in species sampling problems. Though the setting is quite different, our conditions are similar to the tail assumptions made by Caron and Fox in the derivation of and asymptotic behaviours. Admittedly, the consistency assumption of a true generative model with a given fixed level of sparsity is an idealized assumption which cannot account for real world graphs oddities such as local effects underlined by the authors: dense subgraphs (spots) and community structure. However, we believe that consistency properties could shed some light on why σ were estimated to be negative for most of the real world applications in the paper, thus concluding on dense graphs.
Sayantan Banerjee (Indian Institute of Management Indore) and Subhashis Ghosal (North Carolina State University, Raleigh)
Caron and Fox are to be congratulated for an interesting construction of random graphs through completely random measures on the real line and also for extending the ideas to directed multigraphs through Poisson processes with intensity measures as the product of completely random measures on the real line. Sociability parameters are used to generate edges which are further generated through completely random measures. The most interesting aspect of such constructions is the dichotomy of density or sparsity, depending on finite or infinite mass of the intensity measure of the completely random measure driving the construction. In particular, the example of the generalized gamma process is fascinating since the whole spectrum of activity can be captured through a single tuning parameter. In this context, it will be interesting to characterize the normalized graph Laplacian and its eigenvalues which give valuable information on the number of connected components. Using some appropriate graph partitioning algorithm (see for example, Von Luxburg (2007)), different communities in the graph are possible to detect. It will be interesting to study the probabilistic behaviour of the communities thus obtained from a random graph constructed by the authors. This may facilitate scale-dependent community detection in graphs, where the different scales refer to the local or global neighbourhood of the individual nodes of the graph. For example, in the case of a Swiss roll manifold generated from Gaussian mixtures, some scales can recover the underlying Gaussian components and some scales can detect communities which respect the underlying geometry of the manifold.
From a more general perspective, the model proposed by the authors seems to be an interesting way of selecting a particular graphon model P(z ij = 1) = g(U i,U j) for the probability of an edge through latent variables U 1,U 2,…. By setting a particular point a reference point, tying the graphon to zero at it and making the latent variables concentrate near this point should induce sparsity. Can this general graphon model be obtained through a transformation on the plane from the model proposed by the authors? If so, can this connection be used to characterize the sparsity-generating mechanism in the general graphon model?
Another question potentially of interest and topic of further study is the limiting distribution of normalized numbers of edges. A normal approximation in the dense case and a Poisson approximation in the sparse case are expected.
Marco Battiston (Oxford University) andStefano Favaro (University of Turin and Collegio Carlo Alberto, Turin)
To our knowledge, the only attempt to study an ε-node private mechanism is in Chayes et al. (2015). They considered sparse graphs obtained by rescaling a dense graphon with the network size, and they proposed a mechanism that releases as output a step graphon that satisfies condition (78). We believe that an interesting line of research would be to study how privacy constraints could be handled within the sparse graphs setting proposed by Caron and Fox. Specifically, is ε-node privacy a good measure of disclosure for graphs or are better notions needed in the sparse regime? How do we construct mechanisms satisfying these privacy notions? Will the privatized network data set that is obtained by this mechanism preserve enough statistical utility? As pointed out in Narayanon and Shmatikov (2009), sparsity facilitates disclosure and at the same time it makes statistical inference more difficult. Therefore, on the one hand, we might need quite a stringent notion of privacy for sparse graphs, but on the other hand this requirement may drastically affect the statistical utility of the released network data set. As a consequence, a clear trade-off between privacy guarantees and statistical utility arises, particularly in sparse settings. Natural questions are how to formalize this trade-off mathematically and then how to solve it.
C. Borgs and J. T. Chayes (Microsoft Research)
This paper is an important contribution to statistical network modelling, giving a framework to formulate exchangeability for sparse graphs by embedding them in random measures on . It has already inspired several other works, including ours with H. Cohn and N. Holden (Borgs et al., 2016).
We put the Caron–Fox model into a broader context of graphon processes as follows. First, rather than viewing w i as a ‘sociability’ parameter of node i, we think of it as a feature in some σ-finite space Ω (e.g. R d with a locally finite measure, a discrete but not necessarily finite space, or a combination of the two). Second, rather than a product function, we consider an arbitrary two-variable function (graphon) W:Ω×Ω→[0, 1] over the feature space.
We construct a time-dependent family (G t)t⩾0 of graphs by first generating a Poisson point process with intensity tρ, where ρ is an arbitrary σ-finite measure on the feature space, then connecting two points with features w i and w j with probability W(w i,w j) and finally deleting points unconnected up to time t.
If we choose the feature space to be R + and W(w,w ′)=1− exp (−2ww ′), we obtain the Caron–Fox model. If we take the space to be R +×[k], and W the product of the Caron–Fox graphon with a k × k matrix, we obtain the block model version of Herlau et al. (2016). Taking it to be R + times the k-simplex, we obtain a sparse, degree-corrected mixed membership model, etc. Hence, our work represents a substantial extension of the modelling aspect of the authors’ work.
Our work also sheds light on the theoretical questions alluded to by the authors. First, it generalizes both their model and the standard, dense exchangeable model. Second, it completes the Caron–Fox picture by giving an if and only if characterization of exchangeability, whereas previous work (except for the simultaneous independent work of Veitch and Roy (2015)) gave only an if statement. Explicitly, labelling edges by the birth times of their end points, we obtain an exchangeable random measure on from an arbitrary graphon process. More significantly, we prove the only if statement: under mild decay conditions, all such measures can be obtained from graphon processes, extending the classic Aldous–Hoover theory for dense graphs to graphon processes.
Alexandre Bouchard-Côté and Creagh Briercliffe (University of British Columbia, Vancouver)
First, we congratulate Caron and Fox for this impressive contribution, which is already starting to have significant influence in the field. In what follows, we discuss some computational aspects of posterior inference.
In the general case, the authors describe that the computational bottlenecks of posterior inference lie in updating the weights w i and the latent counts . Although there are more latent count variables than weight variables, sampling the former can be trivially parallelized, so, for sufficiently sparse graphs, we expect the weight updates to dominate.
The authors used Hamiltonian Monte Carlo within Gibbs sampling to perform the joint sampling of w 1:Nα given the rest of the variables, which seems a reasonable approach validated by solid experimental results. At the same time, given the prevalence of graph data of increasing size, we believe it is productive also to keep the door open to alternatives. In particular, the target log-density has a simple structure in the generalized gamma process case, so it is tempting to ask whether this structure can be leveraged.
We investigate one potential direction to do so by using two variants of the bouncy particle sampler (BPS) algorithm (Bouchard-Côtéet al., 2017). Since the local variant of the BPS has been shown to outperform state of the art Hamiltonian Monte Carlo algorithms in sparse factor graphs, this appears a priori an appropriate tool for inference in a sparse network model. Surprisingly, however, the factor graph induced by sampling the w is is in fact fully connected. We therefore resort to the global variant of the BPS algorithm.
We use the same reparameterization as the authors and compute collision times via superposition and adaptive thinning. We run the global BPS algorithm and Stan (Stan Development Team, 2016), each on a simulated graph of nodes, with the other parameters fixed to the values used to generate the data (w * = 6.5427,τ = 1 and σ = 0.5). We observe that the latter outperformed the former (3.2 and 5.8 effective samples per second respectively). This is consistent with previous observations that BPS algorithms are most attractive when they can exploit sparsity (Bouchard-Côté et al., 2017).
Perhaps sparsity of the inference problem might reappear in more complex models that build on the work of Caron and Fox. For example, it might be interesting to modify the follow-up work of Todeschini and Caron (2016) to allow some community-specific weights to be 0. Alternative constructions designed to accommodate covariates or node attributes might also potentially create additional sparsity structure.
Trevor Campbell and Tamara Broderick (Massachusetts Institute of Technology, Cambridge)
We congratulate Caron and Fox on their paper, which has already inspired further theory (Veitch and Roy, 2015; Borgs et al., 2016; Palla et al., 2016), an alternative class of models using independent and identically distributed data sampling of edges from a fixed random measure (Cai et al., 2016; Crane and Dempsey, 2015, 2016; Campbell et al., 2016a; Janson, 2017a), and models for dynamic networks (Palla et al., 2016), link prediction (Williamson, 2016), and block structure learning (Herlau et al., 2016).
The authors focus on the case where ‘vertex sociabilities’ are generated by the jumps of a Poisson process on . They marginalize the latent Poisson process to devise an urn scheme and thereby a tractable Markov chain Monte Carlo sampler. However, in previous work, practitioners have often found it useful to represent the latent process explicitly: it is required by modern methods such as variational inference and Hamiltonian Monte Carlo sampling (Blei and Jordan, 2006; Neal, 2011); it can make inference simpler or tractable for hierarchical models; and it often facilitates parallel computing. The authors’ results, though, imply that there is a challenge to instantiating the Poisson process in the case of sparse graphs. In particular, they show that, if a graph sequence is sparse, the Poisson process must have a countable infinity of atoms in any restriction window [0,α] (theorem 2). But we cannot store infinitely many values in memory, or update infinitely many values in finite time, and therefore we must use an approximation.
One approximation involves replacing the full Poisson process with only finitely many jumps. The authors suggest weight-based thresholding (Muliere and Tardella, 1998; Argiento et al., 2016) where jumps x ∈ [0,ε) are removed, but this requires dealing with non-standard truncated probability distributions on . We highlight another option: to truncate a sequential representation of the Poisson process (Campbell et al., 2016b). This technique has the advantage that it typically involves only well-known exponential family distributions and thereby allows variational algorithms with simple closed form updates (Blei and Jordan, 2006; Doshi-Velez et al., 2009). The approximation error of truncated sequential representations has recently been thoroughly characterized (Campbell et al., 2016b). However, these results do not immediately extend to the present network model, as it involves the product of a Poisson process with itself. Nonetheless, we conjecture that variational inference based on a truncated sequential representation would enjoy similar benefits for the network model as for previous applications in Bayesian non-parametrics.
Roberto Casarin (University Ca’ Foscari of Venice), Matteo Iacopini (University Ca’ Foscari of Venice and Université Paris 1—Panthéon-Sorbonne) and Luca Rossini (University Ca’ Foscari of Venice and Free University of Bozen-Bolzano)
Caron and Fox are to be congratulated on their excellent research, which has culminated in the development of a new class of random-graph models. The node degree and the degree distribution fail in giving a unique characterization of network complexity (Estrada, 2010). For this reason global connectivity measures, such as communicability (Estrada and Hatano, 2008, 2009) and centrality (Borgatti and Everett, 2006) are used to analyse a graph. In this discussion we contribute to the analysis of the generalized gamma process (GGP) model compared with the Erdoós-Rényi and the preferential attachment (Barabási and Albert, 1999) models. Our analysis is far from being exhaustive but shows that more theoretical aspects of the GGP model are to be investigated.
Finally, given the partition of the network into two non-overlapping subgraphs (core and periphery) that maximizes the number or weight of within-core-group edges, we compute the share of nodes in the core.
According to Fig. 15 (a), the GGP couples with the preferential attachment model and performs slightly worse than the Erdoós–Rényi random graph in terms of the number of connected components. Figs 15(b) and 15(c) highlight that the clustering structure of GGP does not vary too much with σ. The clustering coefficient is in line with the two benchmarks while the assortativity of the Erdoós–Rényi model is not attained. For σ = 0.5, 0.8, the GGP exhibits a lower share of nodes in the core (Fig. 15(d)) than in the benchmarks and mimics the preferential attachment model for σ = 0.

Netwok statistics versus number of nodes for the GGP undirected network (), the Erdoós–Rényi (▴) and the preferential attachment model of Barabási and Albert (1999) (▾): (a) number of connected components; (b) clustering coefficient; (c) assortativity coefficient; (d) share nodes core
Overall, the GGP can replicate typical behaviours of real world sparse networks and some fundamental features of random graphs generated from the preferential attachment model, making it suitable for a variety of applications in different fields.
We are very pleased to thank the authors for their work.
I. Castillo and T. Rebafka (Université Pierre et Marie Curie—Paris 6)
The random-graph model proposed by Caron and Fox has some remarkable properties, among others the possibility to combine sparsity and exchangeability and the scalability of computations within the associated Bayesian framework. In our view, their work raises some interesting questions on statistical inference. Adopting a frequentist approach, what can be said about posterior convergence assuming the data have been generated under a ‘true’ parameter?
Consider the behaviour of Π[·|Z] under two frequentist settings:
- a.
well specified, where for an unknown fixed , and
- b.
possibly misspecified, where for an arbitrary graph distribution Q.
Stochastic block model with equiproportions and compatible link function, true ψ 1 = 0.3938 and ψ 2 = 0.1026: mean lengths of 95% credible or confidence intervals and bias of posterior mean and frequentist estimates of ψ 1 and ψ 2 over 120, 120, 90 and 60 simulated graphs
Number of nodes n . | Credible interval length . | bias . | interval length . | bias . | Credible interval length . | bias . | interval length . | bias . |
---|---|---|---|---|---|---|---|---|
30 | 0.0794 | −0.0025 | 0.0816 | −0.0012 | 0.0502 | −0.0058 | 0.0485 | −0.0053 |
50 | 0.0470 | −0.0011 | 0.0486 | −0.0010 | 0.0302 | −0.0043 | 0.0322 | −0.0037 |
100 | 0.0234 | 0.0003 | 0.0234 | 0.0001 | 0.0153 | −0.0018 | 0.0149 | −0.0013 |
300 | 0.0078 | −0.0003 | 0.0078 | −0.0004 | 0.0051 | −0.0007 | 0.0054 | −0.0005 |
Number of nodes n . | Credible interval length . | bias . | interval length . | bias . | Credible interval length . | bias . | interval length . | bias . |
---|---|---|---|---|---|---|---|---|
30 | 0.0794 | −0.0025 | 0.0816 | −0.0012 | 0.0502 | −0.0058 | 0.0485 | −0.0053 |
50 | 0.0470 | −0.0011 | 0.0486 | −0.0010 | 0.0302 | −0.0043 | 0.0322 | −0.0037 |
100 | 0.0234 | 0.0003 | 0.0234 | 0.0001 | 0.0153 | −0.0018 | 0.0149 | −0.0013 |
300 | 0.0078 | −0.0003 | 0.0078 | −0.0004 | 0.0051 | −0.0007 | 0.0054 | −0.0005 |
Stochastic block model with equiproportions and compatible link function, true ψ 1 = 0.3938 and ψ 2 = 0.1026: mean lengths of 95% credible or confidence intervals and bias of posterior mean and frequentist estimates of ψ 1 and ψ 2 over 120, 120, 90 and 60 simulated graphs
Number of nodes n . | Credible interval length . | bias . | interval length . | bias . | Credible interval length . | bias . | interval length . | bias . |
---|---|---|---|---|---|---|---|---|
30 | 0.0794 | −0.0025 | 0.0816 | −0.0012 | 0.0502 | −0.0058 | 0.0485 | −0.0053 |
50 | 0.0470 | −0.0011 | 0.0486 | −0.0010 | 0.0302 | −0.0043 | 0.0322 | −0.0037 |
100 | 0.0234 | 0.0003 | 0.0234 | 0.0001 | 0.0153 | −0.0018 | 0.0149 | −0.0013 |
300 | 0.0078 | −0.0003 | 0.0078 | −0.0004 | 0.0051 | −0.0007 | 0.0054 | −0.0005 |
Number of nodes n . | Credible interval length . | bias . | interval length . | bias . | Credible interval length . | bias . | interval length . | bias . |
---|---|---|---|---|---|---|---|---|
30 | 0.0794 | −0.0025 | 0.0816 | −0.0012 | 0.0502 | −0.0058 | 0.0485 | −0.0053 |
50 | 0.0470 | −0.0011 | 0.0486 | −0.0010 | 0.0302 | −0.0043 | 0.0322 | −0.0037 |
100 | 0.0234 | 0.0003 | 0.0234 | 0.0001 | 0.0153 | −0.0018 | 0.0149 | −0.0013 |
300 | 0.0078 | −0.0003 | 0.0078 | −0.0004 | 0.0051 | −0.0007 | 0.0054 | −0.0005 |

95% credible intervals (―) for log(w i) and true values (•) in a graph with 100 nodes (nodes are ordered by increasing values of log(w i))
It would be interesting to determine which aspects of w (e.g. real-valued functionals, or the complete vector w) can be estimated at minimax rate by using priors as in the paper. Another question would be to extend the model and priors to cover arbitrary link functions, and possibly more general forms of the law of w.
Suman Chakraborty (University of North Carolina at Chapel Hill)
It seems that theorems 2 and 3 are the main advantage of this modelling approach. Roughly it enables us to produce graphs with the desired level of sparsity by tuning a single parameter. This property can be appealing for applications. Also the sparsity parameter might be useful to obtain scalings of important functions of the graph, such as the number of triangles. The projectivity is also a salient feature of the model as it can be used to draw inference on the unobserved part of the population. It would be interesting to analyse how much information is necessary to make valid inference about the whole network.
The form of exchangeability of the model is cleverly devised and enabled the authors to use known representations of Kallenberg (1990), which is a crucial component of the paper. It will be helpful for the reader to have a few more comments from the authors on the statistical interpretations and scope of the notion of exchangeability used here. Finally as suggested by Fig. 8(a) the power-law-like tail behaviour of the empirical degree distribution, the projectivity and tunable sparsity parameter make the model attractive to both theoretical and practical applications.
Harry Crane (Rutgers University, Piscataway)
Caron and Fox tout their proposal as
‘the first fully generative and projective approach to sparse graph modelling [ …] with a notion of exchangeability that is essential for devising our scalable statistical estimation procedure’
(at the end of Section 4; emphasis added). In calling theirs the first such approach, they brush aside prior work of Barabási and Albert (1999), whose model is also generative and projective, and produces sparse graphs. The Barabási–Albert model is not exchangeable, but nor is the authors’. And, although the Barabási–Albert model is inadequate for most statistical purposes, the model proposed is not obviously superior, especially with respect to the criteria highlighted above.
Generative
Though amenable to simulation, the obscure connection between Kallenberg's theory of exchangeable completely random measures and the manner in which real networks form makes it difficult to glean much practical insight from this model. At least Barabási and Albert's preferential attachment mechanism offers a clear explanation for how sparsity and power law might arise in nature. I elicit no such clarity from the Caron–Fox model.
Projective
Proiectivity is important for relating observed network to unobserved population and is therefore crucial in applications for which inferences extend beyond the sample. Without a credible sampling interpretation, however, the statistical salience of projectivity is moot. Here projectivity involves restricting point processes in to bounded rectangles [0,α]2, whose best-known interpretation via p-sampling (Veitch and Roy, 2016) seems unnatural for most conceivable networks applications, including those in Section 8.
Exchangeability and sparsity
A certain nonchalance about whether and how this model actually models real networks betrays an attitude that sees sparsity as an end in itself and exchangeability as a means to that end. Even the authors acknowledge that ‘exchangeability of the point process [ …] does not imply exchangeability of the associated adjacency matrix’ (on the third page). So why is there all the fuss about exchangeability if its primary role here is purely utilitarian? To me, the pervasiveness of ‘exchangeability’ throughout the paper is but a head fake for unsuspecting statisticians who, unlike many modern machine learners, understand that exchangeability is far more than a computational contrivance.
Final comment
The Society's Research Section is all too familiar with the Crane–Dempsey edge exchangeable framework, which meets the above four criteria while staying true to its intended application of interaction networks. For brevity I refer the reader to Crane and Dempsey (2015, 2016) for further discussion.
Daniele Durante (University of Padova)
There is increasing interest in modelling network data under a Bayesian non-parametric perspective, and the paper from Caron and Fox provides an appealing contribution in this direction. In fact, as they effectively demonstrate, many Bayesian non-parametric representations not explicitly motivated by network data can indeed provide powerful methods for studying relevant patterns underlying real world connectivity structures.
Consistent with this focus, the growing demand for more flexible and realistic statistical models for networks is a main motivation underlying the stochastic point process representation of Caron and Fox, implying less restrictive exchangeability assumptions. Their model is carefully parameterized to provide flexible inference on sparsity and the degree distribution, which represent the quantities of main interest. However, a natural question when combining Bayesian non-parametric representations and complex data is whether the resulting Bayesian formulation induces large support priors for a wider set of expected network features—e.g. clustering coefficient and average shortest path length. These are relevant measures in characterizing other realistic properties, such as small world architectures, and an improved characterization of prior support may provide insights on which properties the model can flexibly incorporate.
Looking at this problem from a broader perspective, Caron and Fox consider settings in which a single network is observed. In such cases the information is clearly not sufficient to learn highly flexible representations, and although restricting the focus to a specific class of distributions—by incorporating reasonable structure—is a clever approach, it does not protect from misspecification issues. For instance the novel exchangeability assumption proposed by the authors effectively improves flexibility in characterizing sparsity but could substantially restrict the support on other relevant classes of distributions. Therefore, looking at the model from an applied perspective, it is fundamental to understand the extent to which their Bayesian non-parametric formulation can be viewed as truly non-parametric. Indeed, without prior support about the true distribution of the random graph, the posterior can never concentrate around the truth, thereby affecting quality of inference.
A possible answer to this issue is to generalize the model from Caron and Fox to obtain a fully flexible characterization of the entire distribution for random graphs. Motivated by the availability of replicated networks, Durante et al. (2017) addressed this goal with a focus on latent space models (e.g. Hoff et al. (2002)), and I am excited to see how the inspiring contribution from Caron and Fox can open new avenues also in this direction.
Chao Gao (University of Chicago)
I congratulate Caron and Fox for their results that give a solid foundation for exchangeable network analysis. This paper has already inspired some important subsequent works (Veitch and Ray, 2015; Borgs et al., 2016; Crane and Dempsey, 2016; Broderick and Cai, 2015). My discussion will focus on the potential influence on the development of methods and algorithms under this new framework.
Within-sample inference
This is the same likelihood function as in Karrer and Newman (2011) under the node exchangeable framework (or its non-exchangeable sparse version with a scaling parameter (Bickel and Chen, 2009). Another important problem is community detection. An analogous block model was proposed by Herlau et al. (2016) under the framework of exchangeable random measures. Again, given (θ [1:n],w [1:n],n), the conditional likelihood of Herlau et al. (2016) is the same as that of Karrer and Newman (2011) for degree-corrected block models. The two examples show that, for within-sample inference, existing methods, algorithms and even theorems in the framework of Bickel and Chen (2009) can generally be applied to the new framework of Caron and Fox without change, at least from a frequentist perspective.
Out-of-sample inference
Given the discussion in the previous paragraph, I believe that the development of new methods and algorithms should be emphasized on out-of-sample inference. The exchangeability framework of Caron and Fox makes rigorous inference on the population or future observations possible for the first time in sparse networks, for which they should be congratulated again. After all, the models of Caron and Fox and later extensions (Herlau et al., 2016; Todeschini and Caron, 2016) are defined on the entire population, whereas the degree-corrected block model (Karrer and Newman, 2011) is defined only on a finite set of nodes.
Subgraph densities
Recently, it has been shown by Gao and Lafferty (2017) that statistics based on small subgraph frequencies can lead to optimal testing procedures between Erdős–Rényi models and stochastic block models. Empirical studies also illustrate that structures of social networks can be reflected by statistics of small subgraph frequencies (Ugander et al., 2013). It is not very clear to me how analogous ideas can be developed in the framework of exchangeable random measures. For example, let T be the total number of triangles in a given network Z α. To define the empirical triangle frequency, should T be normalized by , by or by α 3? What is a proper definition of the population triangle density? Inference for population subgraph densities are out of sample and thus deserves future investigations under the framework of Caron and Fox.
Jim E. Griffin and Fabrizio Leisen (University of Kent, Canterbury)
We believe that this paper will become a cornerstone of network modelling as well as a natural application of Bayesian non-parametric techniques.
Lancelot F. James (Hong Kong University of Science and Technology) and Creighton K. Heaukulani (Hong Kong)
Svante Janson (Uppsala University)
I find the random-graph model introduced here by Caron and Fox very interesting. Apart from its potential use in applications, it has novel and interesting mathematical properties. Moreover, it has been an inspiration of important generalizations developed after the first version of the present paper by, in particular, Borgs et al. (2016) and Veitch and Roy (2015).
The relationship with Kallenberg's characterization of exchangeable random measures is interesting, and presumably useful in further developments of the theory, but I would like to stress that, for the content of the present paper, Kallenberg's highly technical theorem may serve as a (possibly important) inspiration for the model, but it is not needed for the formal construction of the model and the study of its properties.
Furthermore, the basic construction can be stated in several, equivalent, ways. I prefer to see the basic construction in the paper as follows, including generalizations by Borgs et al. (2016) and Veitch and Roy (2015). Let be a σ-finite measure space, and let F(x,y) be a fixed symmetric measurable function . Generate a random countable point set of points in by taking a Poisson point process in with intensity μ × λ. Regard the as (labels of) vertices, and add an edge with probability F(w i,w j), independently for all pairs with i⩽j. (Finally, eliminate isolated vertices.) The version in the present paper constructs by a completely random measure, which is equivalent to choosing with μ the Lévy measure; furthermore, F is chosen as F(x,y)=1− exp (−2xy) (for x≠y). Kallenberg's theorem yields the same random graphs by a canonical choice , but a different F; see Section 5.1. Other choices of F yield generalizations of the model. Other choices of yield the same random graphs but are sometimes useful, so it seems convenient to allow an arbitrary choice and not to fix it in advance.
Finally, in connection with theorems 3 and 5, note that, if , then almost surely as α→∞. This follows easily because the loops can be ignored and, if denotes the number of non-loop edges and the edges are defined by the events U ij⩽F(w i,w j) for an independent and identically distributed array (U ij)i⩽j, then is a reverse martingale with respect to the σ-fields generated by .
Kuldeep Kumar (Bond University, Gold Coast)
One of the important contributions of this paper is the development of a scalable algorithm which enables inference of the graph parameters determining the sparsity of the graph. For visual graph display the graphs can be turned from dense to sparse based on a single parameter for a specific choice of completely random measure. One of the problems which is inherent in these kinds of algorithms is the choice of parameters (α,δ,γ). There are many developments in the area of machine learning algorithms specifically in the area of tree decomposition like random forests and stochastic gradient boosting (treenet). Have Caron and Fox compared this algorithm which relies on tree decomposition or the pebble game algorithm as developed by Lee and Streinu (2008)?
Antonio Lijoi (Bocconi University, Milan), Ramsés H. Mena (Universidad Nacional Autónoma de México, Mexico City) and Igor Prünster (Bocconi University, Milan)
We congratulate Caron and Fox for proposing a clever construction of random graphs for networks, which enables us to achieve sparsity in an effective way. This represents another successful instance of Bayesian non-parametric modelling based on completely random measures (CRMs).
Among several potential developments, of particular interest is the extension to a multisample context with data recorded from two or more networks. Assuming that the data are still conditionally independent but not identically distributed, a natural problem is the derivation of testing procedures to verify whether the probability distributions, or some of their features, are shared across samples. In the generalized gamma process (GGP) case, the parameter σ plays a pivotal role: it not only controls the sparsity of the graph, but it also influences posterior clustering of the data. See Lijoi et al. (2007). Hence, homogeneity between two networks, directed by independent GGPs, may be assessed by testing for equality of their respective σ-parameters. Along these lines, Lijoi et al. (2008) defined a Bayesian test on the discount parameter of the Pitman–Yor process, which plays the same role as σ for the GGP.
Alternatively, network comparisons can be faced by assuming a richer model with dependent CRMs (W 1,α,W 2,α) accommodating a wide spectrum of dependence structures across networks, ranging from exchangeability (i.e. distributional homogeneity) to unconditional independence. In this framework, we may test whether the two distributions themselves are equal. Recently, unrelated to network applications, Bacallado et al. (2015) have addressed the issue within a parametric model and provided an insightful discussion on the notion of approximate exchangeability. A natural model that may serve the purpose is based on ideas in Rodríguez et al. (2008), where the nested Dirichlet process is introduced for clustering probability curves. Similarly, this approach may be useful for clustering networks on the basis of distributional similarity. Here, however, we need a model that can handle CRMs and a potentially fruitful approach is proposed in Camerlenghi (2015). Let be a discrete random probability measure on the space of boundedly finite measures on , and q 0 is the probability distribution of a CRM on . If , define . Discreteness of implies that with positive probability , which in turn yields W 1 = W 2. This corresponds to similarity of the networks as they have the same distribution.
Jorge Mateu (University Jaume I, Castellón) and Matthias Eckardt (Humboldt-Universität zu Berlin)
Caron and Fox are to be congratulated on a valuable contribution and thought-provoking paper on sparse random graphs which examines the exchangeability for a continuous space representation of networks. In particular, they propose a new framework to study random-graph models based on the notion of modelling the graph as an exchangeable random measure, and appealing to a Kallenberg representation for such exchangeable random measures in connection with planar point processes. For a regularly shaped square lattice, the authors define a point process to be exchangeable if the counted number of edges per grid site is exchangeable for any arbitrary square lattice. This is a very timely topic bridging random-graph models to the field of spatial point processes. Our discussion focuses on the linkage of random-graph structures to the spatial domain.
For spatial point processes, we consider a realization of a random sequence of locations on a complete separable metric space equipped with a suitable Borel σ-algebra such that the cardinality of locations that fell in a given area contained in the observation window can be expressed by means of a random measure. This counting measure can refer to a purely spatial point process or, more sophisticated, to a marked spatial point process, e.g. a multivariate spatial point pattern. For such data, one is interested in the structural exploration and the detection or extraction of the characteristics and features within and between distinct sets of events. Although a large body of literature on the analysis of spatial point patterns exists, applications of random-graph models still remain very limited. One example of such random graphs for spatial point patterns are neighbouring networks which include random structures in the calculation of planar point pattern characteristics (see, for example, Marchette (2004)). See also some recent approaches in Eckardt and Mateu (2017a, b, c).
We note that the concept of exchangeability is quite close to quadrat counting and the definition of complete spatial randomness in point processes. The authors comment on counting the number of edges per grid site, whether they are directed or not (see Section 3.2). We have several versions of counting or sampling in point processes that do have a clear counterpart in the corresponding edge counting process in graphs. Network data are becoming increasingly available and connections between graph models and planar point processes are a welcome line of research, together with extensions to spatiotemporal planar processes.
Matthew Moores and David Firth (University of Warwick, Coventry, and Alan Turing Institute, London)
Caron and Fox have achieved a major breakthrough in computationally tractable inference for random graphs with hundreds of thousands of nodes. This will enable simulation and Bayesian analysis of data sets that were previously infeasible. The scalability of their approach is due to the representation of the discrete graph structure using a latent continuous model, the generalized gamma process. The sociability parameters w i of each node can be estimated by using Hamiltonian Monte Carlo sampling, since the gradient of the conditional log-posterior is available in closed form. Further improvements in scalability might be achieved by taking advantage of parallelism, as well as by rewriting portions of the code in a compiled language.
The pair potentials assume independence between in-degree and out-degree, even in the case of directed multigraphs. This implies a very different generative process from models of adversarial networks, where nodes compete with each other for edges. Examples include the Bradley–Terry model of citation networks (Varin et al., 2016) or the Plackett–Luce model of ranking data (Gormley and Murphy, 2006). A non-parametric Plackett–Luce model has recently been proposed by Caron et al. (2014). The contour plot in Fig. 17 compares the distribution of w i for the World Wide Web data set (Albert et al., 1999) with the PageRank of each node (Brin and Page, 1998). PageRank was computed by using the igraph package (Csárdi and Nepusz, 2006; Kolaczyk and Csárdi, 2017) for R (R Core Team, 2017). The difference in emphasis between these two approaches might explain why the tail of the degree distribution was underestimated for this data set, due to violation of the model assumptions. Nevertheless, the authors have demonstrated close agreement between the posterior predictive and the empirical degree distributions of a variety of real world graphs, both dense and sparse.

Logarithmic contour plot of PageRank against the sociability parameters w i for the World Wide Web data set, with marginal densities
A curious aspect of the generalized gamma process is the parameter , which represents the left-over sociability once all of the nodes with degree 1 or greater have been accounted for. This parameter is only partially identified, since it conflates degree 0 nodes with the potentially infinite number of nodes that might enter the network at a future time. Nodes with degree 0 do not contribute to the likelihood in any way, but it seems feasible to sample sociability parameters for these nodes from the prior, as with finite network models (Caron and Doucet, 2012). The parameter w * could be viewed as an upper bound on the sociability of these nodes, particularly in the case where the network is assumed to be a closed system.
Yang Ni and Peter Müller (University of Texas at Austin)
We congratulate Caron and Fox on a very interesting paper. Our discussion highlights a particular use of the proposed models that we felt was missing in the paper. Implicit in the paper is an assumption that (part of) the random graph D (or a derived undirected graph Z) is observable. Although this is common for social network data, it is less common in biomedical inference where the goal is often to infer an unknown latent network structure.
We illustrate our suggestion with a small simulation study and an application. Both are based on directed cyclical graphs (Ni et al., 2016), a special case of reciprocal graphical models (Koster, 1996). The directed cyclic graph allows inference on a directed graph G, possibly including cycles, by setting up a simultaneous equation model and interpreting a directed edge (l,i) in the graph G as an indicator for a non-zero coefficient of y l in the equation for y i. In this context we explore the use of a generalized gamma process (GGP) prior p(D|ϕ), including a mapping of a multigraph D to a directed graph G by mapping .
Table 4 reports summaries for a simulation study with four alternative priors p(D|ϕ). The GGP wins. Fig. 18 shows the estimated graphs in inference for a gene network reported in Ni et al. (2016) under the sparsity prior used there (thresholded prior) versus the new GGP prior. Also under the realistic conditions of this data analysis the choice of prior matters. Importantly, implementation of posterior inference was quite straightforward, as described in the paper.
Simulation study: true positive rate TPR and false discovery rate FDR under four prior models—Erdős–Rényi graph with p = 0.5, ER; Erdős–Rényi graph with p∼Be(0.5, 0.5), ER+beta; GGP; thresholded prior, TP
. | ER( ) . | ER+beta . | GGP . | TP . |
---|---|---|---|---|
TPR | 1.00 | 1.00 | 0.87 | 0.87 |
FDR | 0.58 | 0.72 | 0.07 | 0.18 |
. | ER( ) . | ER+beta . | GGP . | TP . |
---|---|---|---|---|
TPR | 1.00 | 1.00 | 0.87 | 0.87 |
FDR | 0.58 | 0.72 | 0.07 | 0.18 |
Simulation study: true positive rate TPR and false discovery rate FDR under four prior models—Erdős–Rényi graph with p = 0.5, ER; Erdős–Rényi graph with p∼Be(0.5, 0.5), ER+beta; GGP; thresholded prior, TP
. | ER( ) . | ER+beta . | GGP . | TP . |
---|---|---|---|---|
TPR | 1.00 | 1.00 | 0.87 | 0.87 |
FDR | 0.58 | 0.72 | 0.07 | 0.18 |
. | ER( ) . | ER+beta . | GGP . | TP . |
---|---|---|---|---|
TPR | 1.00 | 1.00 | 0.87 | 0.87 |
FDR | 0.58 | 0.72 | 0.07 | 0.18 |

Estimated gene networks for TCGA colon cancer data under (a) the GGP prior and (b) the thresholded prior (Ni et al., 2016): the posterior expected false discovery rate is controlled to be less than 10% for both estimations
Peter Orbanz (Columbia University, New York)
Caron and Fox have set out to address misspecification problems of graphon models, described in Orbanz and Roy (2015). Are they solved? Exchangeable graphs are dense and ‘amorphous’ collections of edges. Caron and Fox's non-compact modification makes them sparse, but still amorphous, and I would hence argue it addresses one symptom of a deeper problem. Nonetheless, a complete solution would be much to ask for, and I believe that Caron and Fox have made a very valuable contribution.
From this perspective, misspecification of exchangeable graphs is a consequence of selecting vertices independently of the edge set. To obtain more structure, sampling can ‘follow edges’ (e.g. extract neighbourhoods), but finding concisely represented models then becomes much harder. The subsampling implicit in Caron and Fox's model is site percolation (Veitch and Roy, 2016): each vertex is selected or deselected independently with fixed probability, which can yield sparsity, but still selects vertices ignoring edges.
If it remains to be seen how widely applicable the proposed models are in actual network analysis problems, I see the merit of this work elsewhere: Veitch and Roy (2016) is a part of a beautiful body of work (Veitch and Roy, 2015; Borgs et al., 2016; Janson, 2016) expanding Caron and Fox's non-compact modification into a fully fledged generalization of graphon theory. Caron and Fox have introduced an idea that has enriched the landscape of network theory, and I wholeheartedly congratulate them.
Konstantina Palla and Xenia Miscouridou (University of Oxford)
We congratulate Caron and Fox for this excellent work on sparse graphs. It is of great interest to the whole community in statistical machine learning (and beyond)!
The authors have proposed a novel statistical model for networks that builds on exchangeable random measures. Their construction accounts for sparsity; an extremely important property in real world scenarios. In a graph context, a graph is defined as dense if the number of edges grows quadratically with the number of nodes and sparse if it scales subquadratically. Then, in the sparse regime, two nodes chosen at random are unlikely to be connected. Whereas many real world networks are believed to be sparse (Newman, 2009) most of the popular Bayesian models used in network analysis account for dense graphs, i.e. the number of edges grows quadratically in the number of nodes. The fundamental reason for this misspecification is the classical representation of the graph as a random exchangeable array, i.e. the adjacency matrix (Orbanz and Roy, 2015). Exchangeability in the graphs domain has been historically defined as distribution invariance to the permutation of the order that the nodes appear, i.e. relabelling the nodes does not change the distribution of the graph, and is known as vertex exchangeability. However, as a corollary of the Aldous–Hoover theorem (Aldous, 1981; Hoover, 1979), exchangeable random arrays are dense or empty and thus not appropriate for most real applications. In an attempt to account for sparse graphs, several models have been proposed but with undesirable properties. These are models that give up either exchangeability (Barabási and Albert, 1999) or projectivity (Bollobás et al., 2007; Bollobás and O’Riordan, 2009; Wolfe and Olhede, 2013; Borgs et al., 2014). In this excellent work, the authors propose a model that represents graphs as infinite point processes on giving rise to a class of random graphs ranging from dense to sparse, as tuned by a single parameter. For the associated notion of exchangeability of point processes, Kallenberg (2005), chapter 9, provides a representation theorem as the continuous space counterpart of the Aldous–Hoover theorem, enabling a strong theoretical foundation.
The work, proposed by François Caron and Emily Fox, has stimulated exciting research: the framework is being used as the building block for a series of interesting extensions. One example is incorporating structure in the network in terms of community affiliations of the nodes (Todeschini and Caron, 2016) or accounting for node attributes. Also, the dynamic version of the model is of interest, where the node parameters might evolve over time (Palla et al., 2016).
Daniel M. Roy and Victor Veitch (University of Toronto)
Caron and Fox have made several fundamental contributions, leading to significant progress in our understanding of statistical network modelling. In particular, their work has inspired the development of a complete generalization of the dense exchangeable (graphon) framework to a framework that can also capture sparse graphs (Veitch and Roy, 2015; Borgs et al., 2016; Veitch and Roy, 2016; Janson, 2017).
The new framework makes use of Caron and Fox's identification of random graphs and random measures, and the associated notion of exchangeability. An application of Kallenberg's representation theorem for exchangeable measures on the plane implies that the distribution of every extremal sparse exchangeable graph admits a simple representation in terms of a structure we call a graphex, by analogy with the graphon underlying dense exchangeable graphs.
The essential character of the dense exchangeable model is preserved: edges are formed independently of one another, conditionally on latent variables associated with each vertex. Whereas graphons are associated with probability spaces (and independent and identically distributed sequences of latent representations), graphexes are associated with σ-finite measure spaces (and Poisson point processes of latent representations). Dense (graphon) models can be seen to correspond to graphexes where the measure space is finite. Another key difference is the appropriate notion of size: nested dense graph sequences can be indexed by the integer number of vertices, whereas the number of vertices in nested sparse graph sequences grows by random increments.
The analysis by Caron and Fox and in follow-up work by other researchers shows that the sparse exchangeable model can capture a wide range of the empirical phenomena observed in real world networks but outside the purview of the dense exchangeable model. As dense graphon models are in routine use in graph limit theory, statistical network analysis and machine learning, sparse exchangeable models yield novel extensions and generalizations of practical interest. Further, the completely random-measure approach introduced by Caron and Fox gives a practical method for defining sparse exchangeable models, which has already been used to extend important model classes (Herlau et al., 2016; Todeschini and Caron, 2016).
Victor Veitch and Daniel M. Roy (University of Toronto)
Caron and Fox identify random graphs with random measures and use this construction to define sparse exchangeability. This notion of exchangeability is central both to their model and to the generalizations it has inspired (Veitch and Roy, 2015; Borgs et al., 2016). A different perspective on the random-measure construction and the associated notion of sparse exchangeability helps to clarify the strengths and limitations of the new modelling approach, and its relationship to exchangeable dense graphs.
Probabilistic symmetries are closely connected to sampling design (Orbanz, 2017). Sparse exchangeability can be understood in terms of p-sampling, a subgraph sampling scheme whereby one produces a subgraph from a larger graph by including each vertex independently with probability p>0, and dropping any isolated vertices in the induced subgraph (Veitch and Roy, 2016). For every β>α>0, if ξ β is a size β graph generated by a graphex , then an α/β-sampling of ξ β is equal in distribution to a size α graph ξ α generated by . By contrast, for dense exchangeable graphs, if G n is an n-vertex random graph generated by a graphon W, then a random subgraph induced by sampling k vertices at random from G n is equal in distribution to a k-vertex graph generated by W.
The sampling perspective highlights a key distinction between the sparse and dense exchangeable models: the former excludes isolated vertices. This is often not a problem in practice, so the flexibility of the Caron–Fox model and its descendants seems to come at little cost.
Exchangeability has considerable utility: p-sampling yields a graph analogue of the empirical measure, which has been shown to be a consistent non-parametric estimator for the graphex parameter underlying the sparse exchangeable graphs. This estimator is a powerful tool, and its construction relies critically on sparse exchangeability.
The sampling perspective also highlights a weakness of the sparse exchangeable approach. Independent vertex sampling schemes, including p-sampling, are rarely good models for the data collection process underlying real world network data. The sparse exchangeable models thus retain a key pathology of the dense exchangeable models and, accordingly, seem unlikely to resolve fully the practical problems that motivated Caron and Fox. However, the dense exchangeable models have proven to be very useful in practice, so there is reason to believe that the applicability of the sparse exchangeable model goes beyond its literal remit. The sparse exchangeable model constitutes a significant advancement, and we expect that its development will point the way towards further novel attacks on the core problems.
Priyantha Wijayatunga (Umea University)
This can be generalized for continuous variables; however, with Hellinger distance the required normalizing constant is 1. Therefore one can use the measure similar to that proposed in Granger et al. (2004), where f represents the respective probability density functions. In these cases, it is tried to measure strengths of dependences accurately; therefore it is interesting to see differences between network models constructed from such measures and those obtained from traditional linear measures of dependences.
Mingyuan Zhou (University of Texas at Austin)
In addition to controlling for sparse graphs how the number of edges increases as a function of the number of nodes, another topic worth further investigation is how to introduce structured sparsity patterns to the graph adjacency matrices, including modelling dense on-diagonal but sparse off-diagonal blocks, sparse on-diagonal but dense off-diagonal blocks or a mixture of both. It would be interesting to find out whether the new notion of exchangeability could be maintained while achieving these network modelling goals.
The authors replied later, in writing, as follows.
We thank the discussants for the very interesting and thought-provoking comments. As recalled by Bianconi, one of the successes of network science—besides the ubiquitous presence of network data—lies in ‘the ability of the field to adopt methods and techniques coming from different theoretical disciplines’. We certainly agree with this statement. The very diverse panel of discussants for this paper, with research interests in statistical mechanics, graph theory, applied probability, mathematical statistics, Bayesian statistics and machine learning, demonstrates this and provides a wide range of insights on our paper.
Follow-up theoretical work and generalizations of the graphon
Borgs et al. (2016) considered a framework where M:S × S→[0, 1] with S some potentially high dimensional feature space—not necessarily —and the points of a Poisson point process on with mean measure dθμ(dϑ). Although both constructions lead to the same family of random graphs, it may be more natural, as noted by Janson, and Borgs and Chayes, to work with the higher dimensional feature space as this may lead to more interpretable representations.
More structured models based on exchangeable random measures
In this paper, we have introduced the general framework of representing random graphs by exchangeable random measures. Whereas the graphon framework involves a transformation of uniform random variables, hence leading to models with random vectors as building blocks, our framework involves transformations of unit rate Poisson processes, hence leading to models with multivariate point processes or completely random measures as building blocks. We focused on a particular class of models within this framework, defined by equations (6)–(8). We showed that it is possible to capture both sparse and dense graphs with interpretable parameters and scalable inference algorithms.
The class of models that we considered is quite simple, with a single parameter w i tuning the sociability of a node i. Although this model can capture sparsity and heavy-tailed degree distributions, it cannot capture more local structure, such as latent communities. As shown in Section 5.1, our considered model is just a particular example within Kallenberg's representation theorem, which can be used as a recipe to construct more structured graphs, multigraphs or weighted graphs based on exchangeable random measures. As discussed by Griffin and Leisen, Palla and Miscouridou, Roy and Veitch, and Rubin-Delanchy, more structured models have already been proposed within this framework, including a sparse stochastic block model (Herlau et al., 2016) and a mixed membership stochastic block model (Borgs et al., 2016), and a generalization to of the model defined by equations (6)–(8) in this paper to capture overlapping communities (Todeschini and Caron, 2016). Zhou also suggests exploring other stochastic block models with structured sparsity properties.
Besides the above work on more structured models for simple graphs, a few discussants mentioned other possible extensions. Durante, and Lijoi, Mena and Prünster suggest models for collections of networks, building on dependent completely random measures. James and Heaukulani suggest an extension of the model to deal with hypergraphs, building on the Indian buffet process. Palla and Miscouridou consider extensions to dynamic networks. Finally, Bharath and Rubin-Delanchy suggest the use of inhomogeneous completely random measures to deal with covariates. These ideas represent interesting directions to explore and demonstrate how our framework provides a foundation on which it is possible to expand in many ways.
Sampling and projectivity
Many discussants commented on the projectivity properties and sampling mechanism that is associated with our network process. In our setting, each node i is associated an arrival time , and the model is invariant over any measure preserving transformation of time. Our construction is therefore exchangeable in this sense. For some α>0, we consider only interactions between nodes that arrived before time α; therefore, α tunes the sample size of the graph.
The notion of sampling that is naturally associated with this class of graph processes has been defined in the follow-up work of Veitch and Roy (2016). As recalled in the comments of Veitch and Roy, and Orbanz: for a given graph, sample a subgraph by choosing each vertex with probability p ∈ (0, 1); then discard the isolated nodes. If we consider an exchangeable graph of size α, then p-sampling gives a subgraph with the same distribution, but of size pα. As mentioned by Bharath, Veitch and Roy, Orbanz and Crane, this sampling scheme highlights a structural weakness of the exchangeable graph process approach: independently sampling vertices may appear unrealistic as a network sampling strategy in some situations.
Nonetheless, as recalled by Veitch and Roy, the (dense) graphon models share the same limitations but have proven to be extremely useful in a large number of applications. An important advantage of the approach, which we emphasize in this paper, is that the exchangeability property enables the design of scalable inference algorithms. We believe that this is a key aspect for the method to be widely applied.
Applications and extensions
We provide a simple illustration of our approach in Section 8.2 by assessing the sparsity of the graph. As pointed out by Bharath, this applied analysis is rather limited, notably for brevity, and because the focus of the paper is on the theoretical developments of the exchangeable graph process framework and its associated Markov chain Monte Carlo (MCMC) sampling. The point of this section is primarily to demonstrate that the algorithms developed can be applied to quite large real world graphs. We investigate the fit of the model by looking at posterior predictive degree distributions with the goal of showing that the model provides a reasonable fit, not to test whether or not the real networks have power law degree distributions as in Clauset et al. (2009).
As mentioned earlier in this rejoinder, more structured models have been used to uncover latent structure and to link prediction (Herlau et al., 2016; Todeschini and Caron, 2016). A few discussants mentioned other interesting areas in which the framework proposed can be applied. Ni and Müller suggest using the model as a prior for structure learning in probabilistic graphical models, i.e. our graphs encode the latent conditional independence statements that one aims to infer from a set of observations. Bharath suggests an application to extreme event detection, whereas Battiston and Favaro discuss an application to privacy disclosure. We believe that there is a vast set of possible applications where our defined statistical network model—and extensions thereof—will prove useful.
Related approaches
A few discussants have mentioned connections with various related network models. Bianconi and Crane discuss the connection of our approach with the Barabasi–Albert (BA) preferential attachment model, which is a generative model that can capture sparsity and power law degree distributions. As mentioned by Bianconi, the BA model can achieve asymptotic power law degree distributions with exponent γ ∈ (2, 3]. Follow-up work (Caron and Rousseau, 2017) showed that one can achieve asymptotic power law degree distributions with exponent γ ∈ (1, 2) in our framework. Crane questions the usefulness of the proposed approach compared with the BA model.
- a.
As mentioned earlier, this paper is part of a line of work that offers a generalization of graphon models to sparse graphs and therefore offers a rich framework for building sparse graphs. Many references have already built on our work to construct more structured models (Herlau et al., 2016; Borgs et al., 2016; Todeschini and Caron, 2016). Although some extensions of the BA model to capture community structure exist, they are not straightforward to define and do not fit in a unified framework. Therefore, as mentioned by Crane, this has limited their use as statistical models.
- b.
In the (non-exchangeable) BA model, one needs to know the ordering of the nodes to make inference on the parameters of the model. If this ordering is not known, it needs to be imputed as a latent variable, severely limiting the scalability of the resulting algorithm. Because of the exchangeability property of our approach, we avoid this issue. The labelling of the nodes is irrelevant in our approach, and we can additionally make use of the hierarchical structure to design scalable algorithms, as demonstrated in Section 7.
Another related line of work that was mentioned by several discussants (Bharath, Crane, Campbell and Broderick, and Bloem-Reddy) are edge exchangeable models (Crane and Dempsey, 2015, 2016; Cai et al., 2016; Williamson, 2016; Janson, 2017a). This line of work—which first appeared in arXiv (Crane and Dempsey, 2015) nearly 2 years after our initial arXiv posting—considers a different notion of exchangeability for ‘interaction data’ defining multi(hyper)graphs (Crane and Dempsey, 2016); as with our method, this approach can also produce sparse graphs.
We believe that both approaches have merit, and that one does not subsume nor replace the other. Both serve as important building blocks for statistical network models. As has been noted by several researchers (Cai et al., 2016; Janson, 2017a; Williamson, 2016) and in the discussion of Bloem-Reddy, there are some similarities between the two approaches, especially as relate to our construction of directed multigraphs in Section 3.2. Although we present a construction for multigraphs, the focus of our paper is on simple graphs. However, the general framework still provides a notion of exchangeability that applies to both multigraphs and simple graphs. Interestingly, this is not so for edge exchangeable models. Simple graphs are obtained from multigraphs by discarding multiple edges (Crane and Dempsey (2016), section 5.4). Through this process, the edge exchangeability property is lost.
Alternative ideas for constructing non-projective sparse multigraphs are presented by Kumar. Kendall, and Mateu and Eckardt discuss other related approaches based on stochastic geometry. Mateu and Eckardt mention the similarities with geometric or neighbour graphs, which also use (spatial) point processes for random graphs. Finally, Kendall discusses the class of scale invariant random spatial networks, which has a different notion of (spatial) invariance and uses (improper) Poisson line processes. Though these constructions are quite different from our own, they offer another illustration of the usefulness of stochastic processes for building network models.
Further properties of the class of random graphs
In this paper, we focused on the sparsity of the resulting networks defined by equations (6)–(8) when ρ has finite mean. The sparsity properties are not tied to the specific choice of the link function (see the comment by Zhou) and follow-up work studied the sparsity properties of general Kallenberg graphs. Veitch and Roy (2015) studied the expected number of edges and nodes for the general model, showing that sparsity is achieved whenever the function M is integrable with unbounded support; Caron and Rousseau (2017) provide strong laws for the asymptotic number of edges and nodes and degree distributions when M is integrable and satisfies some regular variation assumptions. Bianconi asks whether it is possible to achieve the extremely sparse regime in this framework, almost surely. Caron and Rousseau (2017) showed that it is possible to obtain rates of where l * is a slowly varying function with . If ρ has finite mean, or more generally if M is integrable, ; hence one cannot achieve the extremely sparse regime.
It would be of interest to characterize other properties of the special class of graphs defined by equations (6)–(8), and more generally of graphs based on Kallenberg's representation. Casarin, Iacopini and Rossini investigate empirically the distribution of the assortativity coefficient and clustering coefficient (also discussed by Durante and Chakraborty) for the generalized gamma process model. Banerjee and Ghosal suggest investigating the characterization of the graph Laplacian and the limit distribution of the normalized number of edges. These represent some important possible directions of exploration.
Consistency and rates of convergence
Questions regarding the convergence of posterior distributions of the parameters, hyperparameters or functionals have been highlighted by a few discussants. Arbel suggests an approach to show posterior consistency for the sparsity parameter σ. Such results, in a well-specified and misspecified setting, would be useful, as this would justify the use of the method that is described in Section 8.1 as a sparsity test. Castillo and Rebafka also study empirically the asymptotic behaviour of the posterior distribution on σ in a misspecified scenario, when simulating from a sparse, but non-projective, model. They also investigated empirically the convergence of the posterior of the edge density and density of triangles.
Although we provided some important analyses of the properties of our modelling framework, there are many other interesting aspects to study relating to the asymptotic properties of inference and estimation.
Computational aspects
This paper presented an MCMC algorithm with Hamiltonian Monte Carlo (HMC) and Metropolis–Hastings updates. The key step underlying our scheme is a data augmentation that allows a very simple expression for the gradient in the HMC update and tailored proposals for the hyperparameters. There is, however, plenty of room for improvement in the efficiency of our MCMC algorithm. Our HMC implementation is very simple, with a fixed number of leapfrog steps, and the step size adapted during part of the burn-in period. More elaborate HMC strategies could be used here. As mentioned by Moores and Firth and briefly discussed in Section 7.2 of our paper, the algorithm is very amenable to parallelism. Alternatives to HMC sampling have been implemented by Bouchard-Côté and Briercliffe, showing that additional gains could be obtained through such modifications as well. Alternatively, Campbell and Broderick suggest using variational Bayesian instead of MCMC algorithms, applying a truncation scheme for the completely random measure. This also represents a promising alternative to our MCMC algorithm.
Acknowledgements
The authors thank Bernard Bercu, Arnaud Doucet, Yee Whye Teh, Stefano Favaro, Dan Roy, Lancelot James, Jennifer Chayes, Christian Borgs, Judith Rousseau, George Deligiannidis and Konstantina Palla for helpful discussions and feedback on earlier versions of this paper.
FC acknowledges the support of the European Commission under the Marie Curie intra-European fellowship programme. This work was partially supported by the Alan Turing Institute under Engineering and Physical Sciences Research Council grant EP/N510129/1 and by the BNPSI Agence Nationale de la Recherche project ANR-13-BS-03-0006-01.
EBF was supported in part by Defense Advanced Research Projects Agency grant FA9550-12-1-0406 negotiated by the Air Force Office of Scientific Research and Air Force Office of Scientific Research grant FA9550-12-1-0453.
References
References in the discussion
Appendix A: Further background
A.1. Aldous–Hoover theorem and graphons
In theorem 7, we present the Aldous–Hoover theorem for the case of joint exchangeability—i.e. symmetric permutations of rows and columns—which is applicable to matrices Z where both rows and columns index the same set of nodes. Separate exchangeability allows for different row and column permutations, making it applicable to scenarios where one has distinct node identities on rows and columns, such as in the bipartite graphs that we considered in Section 3.4. Extensions to higher dimensional arrays are likewise straightforward. For a more general statement of the Aldous–Hoover theorem, which holds for separate exchangeability and higher dimensional arrays, see Orbanz and Roy (2015).
Theorem 7(Aldous–Hoover representation of jointly exchangeable matrices (Aldous, 1981; Hoover, 1979)). A random matrix is jointly exchangeable if and only if there is a random measurable function f:[0, 1]3→Z such thatwhere and with U ij = U ji are a sequence and matrix respectively of IID uniform[0, 1] random variables.(46)
A.2. Regularly varying Lévy measures
Here we provide a formal definition of a regularly varying Lévy measure ρ that is referred to in theorems 2 and 5.
Definition 1(regular variation). A Lévy measure ρ on (0,∞) is said to be regularly varying if its tail Lévy intensity is a regularly varying function (Feller, 1971), i.e. it satisfiesfor σ ∈ [0, 1) and l a slowly varying function satisfying for any a>0.(48)
For example, constant and logarithmic functions are slowly varying.
Appendix B: Proof of proposition 1
Appendix C: Proofs of results on the sparsity
C.1. Probability asymptotics notation
We first describe the asymptotic notation that is used in the remainder of this section, which follows the notation of Janson (2011). All unspecified limits are as α→∞.
The equivalence notation is used for lim x→0 f(x)/g(x)=1 (not to be confused with the notation ‘∼’ alone for ‘distributed from’).
C.2. Proof of theorem 3
C.3. Proof of theorem 4
C.3.1. Finite activity case
We first consider the case of a finite activity CRM. In this case, the number of nodes is bounded below by the square root of the number of edges, and bounded above by the (finite) number of jumps of the CRM. Let , 0<T<∞. Let J α denote the number of points such that . (J α)α⩾0 is a homogeneous Poisson process of rate T, and thus J α/α→T almost surely. Since , it follows from result (56) that .
C.3.2. Infinite activity case
We now consider the infinite activity case where . First note that, by monotone convergence, .

Illustration of the proof of theorem 4 in the infinite activity case (, support points of the measure D): in this example, the coloured regions indicate the intersection of A 2n−1 with for n = 1 (
), n = 2 (
), and n = 3 (
); from this, we see that X 1 = 0 (the count in blue), X 2 = 2 (the count in red) and X 3 = 1 (the count in green)
Proof of result (59)
C.4. Proof of theorem 5
Appendix D: Technical lemmas
The following lemma is a corollary of theorem 3, page 239, in Feller (1971).
Lemma 1Let (X n)n = 1, 2,… be a sequence of mutually independent random variables with arbitrary distribution and such that . Let . Ifthen
ProofAssume for simplicity that for all n (otherwise, consider the subsequence of random variables with strictly positive mean). We haveby Riemann integration. The result then follows from theorem 3, page 239, in Feller (1971) with .
Lemma 2(relating tail Lévy intensity and Laplace exponent) (Gnedin et al. (2007), propositions 17 and 19). Let ρ be a Lévy measure, be the tail Lévy intensity and its Laplace exponent. The following conditions are equivalent:(64)where 0⩽σ<1 and l is a function slowly varying at ∞, i.e. satisfying l(cy)/l(y)→1 as y→∞ for every c>0.(65)
Lemma 3(Resnick (1987), chapter 0, proposition 0.8). If U is a regularly varying function at 0 with exponent , and f is a positive function such that f(t) ∼t↓0 tc, for some constant 0<c<∞, then U{f(t)} ∼t↓0 c σ U(t).
Proposition 2(Tauberian theorem) (Feller (1971), chapter XIII, section 5, theorems 3 and 4). Let U(dw) be a measure on (0,∞) with density u monotone in some interval (0,x 0). Assume thatexists for t>0. If l is slowly varying at ∞ and 0 < a<∞, then the following two relationships are equivalent:(66)(67)
Appendix E: Proof of theorem 6
The proof of theorem 6 relies on results on posterior characterization with models involving normalized CRMs. We first state a corollary of lemma 5 by Pitman (2003) and theorem 8.1 by James (2002). Similar results appear in Prünster (2002), James (2005) and James et al. (2009). The corollary involves the introduction of a discrete random variable R, conditional on which the CRM has strictly positive mass.
Corollary 1Let W α be a (finite or infinite) CRM on [0,α] without fixed atoms nor deterministic component, with mean measure ρ(dw)dθ. Denote , with probability distribution . Let R ∈ {0, 1, 2,…} be a discrete random variable such that, for r⩾0,with ζ 0(0)=1. The condition ensures that, conditionally on R>0, almost surely, and the normalized CRM below is properly defined.
Appendix F: Details on the Markov chain Monte Carlo algorithms
The undirected graph sampler that was outlined in Section 7.2 iterates as follows.
Step 1: update given the rest with HMC sampling.
Step 2: update (α,σ,τ,w *) given the rest by using a Metropolis–Hastings step.
Step 3: update the latent counts given the rest by using either the full conditional or a Metropolis–Hastings step.
F.1. Step 1: update of
F.2. Step 2: update of w *,α,σ and τ
F.3. Step 3: update of the latent variables
Concerning the latent , the conditional distribution is a truncated Poisson distribution (44) from which we can sample directly. An alternative strategy, which may be more efficient for a large number of edges, is to use a Metropolis–Hastings random-walk proposal.