A generating-function approach to modelling complex contagion on clustered networks with multi-type branching processes

Understanding cascading processes on complex network topologies is paramount for modelling how diseases, information, fake news and other media spread. In this paper, we extend the multi-type branching process method developed in Keating et al., 2022, which relies on homogenous network properties, to a more general class of clustered networks. Using a model of socially-inspired complex contagion we obtain results, not just for the average behaviour of the cascades but for full distributions of the cascade properties. We introduce a new method for the inversion of probability generating functions to recover their underlying probability distributions; this derivation naturally extends to higher dimensions. This inversion technique is used along with the multi-type branching process to obtain univariate and bivariate distributions of cascade properties. Finally, using clique cover methods, we apply the methodology to synthetic and real-world networks and compare the theoretical distribution of cascade sizes with the results of extensive numerical simulations.

1. Introduction.Recent developments in technology have lead to social network data becoming more available and more analysed than ever before.For example, the Twitter API V2 allows academic researchers to access up to 10 million tweets per month.With this wealth of social network data, we need to develop appropriate tools for its analysis.In this paper, we focus on learning about the interplay between the network structure and the dynamics on the network, such as the spread of behaviour, information or a disease.Social networks are highly clustered [30,38,37,26], this means that they contain a higher number of triangles than random networks, reflecting the fact that "a friend of my friend is likely to be a friend of mine"; however, most models used to model diffusion in online social networks assume that the network is locally tree-like and thus ignoring the clustering [12,9,30,21].It is important that we develop tools for analysing network dynamics which account for this clustering.
In an online experiment, Centola [7] observed that repeated exposures to a health behaviour from network neighbours made its adoption more likely, given that the individual had not already adopted, this adoption mechanism is known as a complex contagion.Complex contagion dynamics have also been observed in the use of Skype add-ons [16], the spread of information and politically-controversial hashtags on Twitter [23,31] and the spread of online fads [33].Clustering in the underlying network has been shown to inhibit simple contagion dynamics [32], this is because the clustering leads to redundancy in the links for transmission; however, clustering drives the reinforcement mechanism of the complex contagion [7,28] and therefore larger outbreaks can occur in clustered networks for a complex contagion.Most commonly, complex contagions are modelled using threshold models [36].In a such models, each node is assigned a threshold and when the number of neighbours who have adopted exceeds the threshold, the node adopts with certainty.In Keating et al. [17], we introduced an alternative model for complex contagion which is an extension of the independent cascade model (ICM) [18].We continue to use this model here as it can be easily incorporated into the branching process framework, it allows us to control the effects of social reinforcement through a single parameter, α ∈ [0, 1], and if we set α = 0 we get the ICM of [18] which is a simple contagion; i.e, we can study both simple and complex contagions through this model.
Branching processes are discrete-time stochastic processes that have been used to model diffusion cascades [10,13,14,11,2,20], the spread of infectious diseases [34,35] and population dynamics in ecology [5].Simple branching processes have proven a fruitful means of capturing important network properties, Gleeson et al. [12] analytically derived cascade properties from data including the cascade size distribution, cascade lifetime distribution, expected average tree depth (EATD) and structural virality [15].The authors focused on the case where the network is assumed to be locally tree-like; i.e., unclustered, and considered simple-contagion dynamics only.The branching process theory for modelling cascades on networks assumes an infinitely large network; however, as has been shown in numerous previous papers [12,17,27] and in this paper, branching processes can give very accurate results even when the network size is clearly finite.In previous work [17], we developed a method for modelling complex contagion on clustered networks using multi-type branching processes (MTBPs).We concentrated on average measures of the cascades and analytically calculated the cascade condition and the expected cascade size for a very specific class of networks.In this paper, for clustered networks, we analytically derive results including the complete distribution of cascade sizes, the distribution of cascade lifetimes and the EATD.We also extend the MTBP theory to more general network distributions, making the method more applicable to real-world networks.In particular, we use a family of network distributions proposed independently by Newman [25] and Miller [22] in 2009; described by the distribution of triangle and single-link membership of the nodes, we refer to these networks as Newman-Miller networks.By incorporating clustering into the model through the Newman-Miller networks, we can analytically study both simpleand complex-contagion spreading processes.
Within the branching process framework, we use probability generating functions (pgfs) to derive distributions of the quantities of interest.When we have a pgf for a quantity such as cascade size, we are interested in recovering its full probability distribution from the pgf.The most common method of numerically finding the probability distribution in the network dynamics literature is that of Cavers [6,27,14], this involves using the z-transform inversion which has the form of a contour integral and approximating the integral using the trapezoidal rule.For bivariate pgfs Brummitt et al. [3] use three different methods of finding the probability distribution in a two-type branching process, but these methods use computer algebra systems and are restricted in the number of terms that they can retrieve.While Cavers' derivation has been extended to bivariate pgfs [1], it is not clear how it extends to higher dimensions.In this paper, we introduce an alternative derivation of the pgf inversion method.Our approach differs to that of Cavers [6] and Antal [1]; while they focus on the z-transform inversion, we look at the equivalence between the pgf evaluated at specific points and the discrete Fourier transform of the probability distribution.Our method has the same computational implementation as Cavers' method in one dimension but has the added advantage of naturally extending to the inversion of pgfs of any dimension.One of the contributions of this paper is to describe and show how to implement the inversion of univariate and bivariate pgfs, the same method may be straightforwardly extended to higher-dimensional distributions.
We are interested in understanding how the structure of a network and the dynamics interact with each other.It is important that we can apply these methods to real-world networks.Burgio et al. [4] introduced a clique cover method which allows us to approximate the (maximal) clique-membership distribution of a network while constraining that the cliques are edge-disjoint -cliques do not share edges -called the edge-disjoint edge clique cover (EECC).In section 6 we apply the EECC to synthetic and real-world networks and find the cascade size distribution according to the branching process theory, for comparison with Monte-Carlo simulations.Applying the MTBP to these empirical networks allows us to see where the theory gives accurate results and to identify areas for future work on the branching process methodology.
The rest of this paper is structured as follows; in section 2 we describe the complex contagion adoption dynamics, in section 3 we introduce the class of networks that we use in our calculations, in section 4 we show how the MTBP framework can be leveraged to derive distributions of cascade properties.Our examples in section 4 include the distributions of cascade size and cascade lifetimes and the joint distribution of cascade size and cumulative depth.In section 5 we derive a method for the inversion of pgfs to recover the probability distribution, in section 6 we discuss and show results for applying this method to real-world networks and in section 7 we discuss the capabilities and limitations of our approach.

Complex Contagion Adoption
Dynamics.When we study a diffusion process on a network, both the spreading mechanism and the network structure are required to accurately model the process.To model complex contagion using the MTBP methodology, we use the discrete-time model of [17], which is an extension of the ICM [18].Nodes in the network are assumed to be in one of three states; active, inactive or removed.The spreading process is initiated with a small number of active seed nodes that are randomly selected -in examples here we choose to select just one seed node, all other nodes are initially inactive.At each time step, active nodes have one chance to activate each of their inactive neighbours, activation occurs with probability p k , where k is the number of times that that inactive neighbour has been exposed.If the activation is successful, the inactive neighbour becomes active in the next time step, active nodes always become removed in the following time step and once a node enters the removed state, it never leaves that state.The dynamics are governed by two parameters; p 1 and α, where p 1 ∈ [0, 1] is the probability of adoption after a single exposure and α ∈ [0, 1] is a measure of the strength of social reinforcement.Higher values of α represent higher levels of social reinforcement and in the limiting case of α = 0 we recover exactly the (simple-contagion) independent cascade model.The probability that a node will become active after its k th exposure, given that it is inactive, is given by the following equations, (2.1) where q k is the probability that a node does not adopt directly after the k th exposure, given that it has not already adopted.This model was studied in [17] for a small number of networks with homogeneous Newman-Miller distributions.Some of the results included the cascade condition and the expected cascade size.
3. Newman-Miller Distributions.Along with the complex contagion adoption dynamics that we described in the previous section, to fully describe the MTBP we need a way of describing the network so that it can be incorporated into the MTBP.We model diffusion on networks where the clique membership of the nodes follows a joint probability distribution π st , which is defined as the probability that a node, chosen at random, is in s single links and t triangles (is in s 2-cliques and t 3-cliques).We refer to s as the link degree and t as the triangle degree.This family of networks was independently proposed by Newman [25] and Miller [22] in 2009 and we have elected to use these network models because they are commonly used to analytically study spreading processes on clustered networks [19,20].We use pgfs to describe the joint triangle-link (or Newman-Miller) degree distribution of the network.The pgf is defined as Here, we use the example of the doubly-Poisson distribution [25] which has probability mass function (pmf) where µ is the expected number of single links per node and ν is the expected number of triangles per node.We can express this distribution in the form of a pgf using Eq.(3.1), e −µ µ s s! e −ν ν t t! x s y t = e µ(x−1)+ν(y−1) .
When we explore the dynamics of the system, we often need to know the excess degree distributions from traversing a link or a triangle, to allow us to calculate the cascade size distribution, for example.If we traverse a single link then the joint excess degree is the number of triangles and single links that the node at the other end is a part of, not including the single link traversed itself.If we traverse a triangle, the joint excess degree is the number of single links and triangles that the node at the other end is part of, not including the triangle traversed.As described by Newman [25], we can find the joint excess triangle-link distributions from traversing a single link f q and from traversing a triangle f r as follows; (3.4) In the following sections, we use these network-structure distributions in our calculations of the distributions and values of quantities of interest.
4. Derivations of cascade properties.In this section, using pgfs, we theoretically derive distributions of cascade size, cascade lifetimes and the joint distribution of cascade size and cumulative depth.We also use the moments of the joint distribution to calculate the Pearson's correlation coefficient between cascade size and cumulative depth and the expected average tree depth (EATD).
4.1.Cascade Size Distribution.Here, we analytically derive the cascade-size distribution for complex contagion dynamics on a network constructed using the Newman-Miller approach described in section 3. Previously, we have calculated the average cascade behaviour using MTBPs [17].The types in the MTBP are clique motifs, as shown in Figure 4.1 (a), a clique motif is a clique with a specific number of active, inactive and removed nodes.We are only concerned with those motifs with at least one active node.In a Newman-Miller network there are 6 possible clique motifs, we denote by z i a type-i motif.In the paper [17], we showed how to calculate the expected cascade size; however, often it is the distribution of cascade sizes that is of interest, not just the expected value.Other works analytically derive this distribution for locally tree-like networks [12,13,14] but not for clustered networks.Here, we derive the distribution of cascade sizes for the complex contagion dynamics that we described in section 2 on (clustered) Newman-Miller networks.
We consider the size of a full cascade after n generations, Xn , in terms of its subtree sizes, where the size of a type-i subtree after n generations, X n,i , is the number of nodes in the tree seeded by a type-i motif that were once or are currently active or removed, excluding those active or removed nodes that were in the seed motif; this is illustrated in Figure 4.1 (b).Not including these active or removed nodes in the seed motif avoids counting a node in the full cascade more than once.
To find the cascade-size distribution, we start by finding the pgf for cascade size.By definition, the pgf for cascade size for a full tree that has grown from its seed for n generations, Kn (z), is where P ( Xn = l) is the probability that the size of the full cascade Xn is l.The full cascade size, Xn , can be expressed in terms of its subtree sizes, where s is the number of subtrees seeded with a type-5 motif and t is the number of subtrees seeded with a type-1 motif -the seed node is in t triangles and has s single links (see In this figure we assume that the number of generations that we allow the subtree to grow for n → ∞.We let n → ∞ because, for a sub-critical branching process, this assures that the sub-trees will be fully grown; i.e., there will be no further offspring in future generations.
n,1 seed node has s links and t triangles = z s,t To find Kn (z), we need to know the pgfs for the subtree sizes K n,1 (z), K n,2 (z) and K n,5 (z).
The pgf for a type-1 subtree, after n generations is given by, where X n,1 is a subtree seeded by a type-1 motif.The three terms in Eq. ( 4.4) come from the fact that a type-1 motif, in the next generation, either gives rise to no new active nodes, one new active node or two new active nodes respectively.These new active nodes must be counted in the cascade size along with the generation n − 1 subtrees that they seed.For example, if in generation 1 a type-1 motif has 1 offspring; i.e, produces a type-2 motif, the subtree size can be written in the form, 1 , where the 1 counts the new active node, X n−1,2 is for the size of the type-2 subtree produced and the i X n−1,5 is for the generation n − 1 type-1 and type-5 subtrees produced, where the number of type-1 subtrees is the excess triangle degree from traversing a triangle and the number of type-5 subtrees is the excess link degree from traversing a triangle.We can rewrite Eq. (4.4) as which becomes when we notice that these expectations and sums are equivalent to the pgfs for the offspring distribution from traversing a triangle, f r (x, y), and the pgfs for subtree size after n − 1 generations.Similarly, by using the offspring distributions from a type-2 and a type-5 motif we can get the subtree size pgfs for type-2 and type-5 subtrees: Initially (in generation n = 0) the subtree size is zero; i.e., X 0,1 = X 0,2 = X 0,5 = 0, giving us initial conditions for their respective pgfs of E z X 0,i = E z 0 = 1.We iterate through Eqns.(4.3) to (4.7) with initial conditions; K 0,1 (z) = K 0,2 (z) = K 0,5 (z) = 1, to find Kn (z).With this, we have a system of equations that describe the pgf for cascade size; in section 5 we show how to recover the probability distribution from the pgf, allowing us to plot the cascadesize distribution for a given Newman-Miller network with given values for p 1 and α in the adoption dynamics.In Figure 4.2 we show an example of the cascade size distribution derived from the pgf compared to simulations on a network with a doubly-Poisson distribution.In [17] we calculated the expected cascade size for sub-critical cascades, here we calculate the distribution of cascade sizes for sub-critical cascades in the limit as the number of generations n → ∞.The cascades are sub-critical if an infinitesimally small seed fraction of active nodes will not generate cascades of non-finite mean size, which is a non-vanishing fraction of the network as the network size tends to the large network limit.To find the pgf in the limit as n → ∞, we set a tolerance of 10 −5 such that we stop iterating once the pgf evaluated at generation n is within 10 −5 of the pgf evaluated at generation n − 1.

Distribution of cascade lifetimes.
Here, we show how to find the distribution of cascade lifetimes.The lifetime of a cascade is the number of generations it lives for before dying out; i.e., the lifetime is the maximum of the generations of all nodes present in the cascade.If a cascade has particles -by particle we mean an active node -in generation n − 1 but no particles in generation n, the lifetime of the cascade is n.To calculate the probability Ω n that the lifetime of a cascade is exactly n generations, we therefore find the probability that the number of particles in generation n is zero and subtract the probability that there are zero particles in generation n − 1.Let the number of particles in generation n be given by Zn , and define the corresponding pgf to be We write the pgf in terms the number of particles in generation n of its subtrees n,1 seed node has s links and t triangles (4.9) = s,t for n ≥ 1, where F n,5 (x) and F n,1 (x) are the pgfs for the number of particles in generation n for type-5 and type-1 subtrees respectively.For completeness, if n = 0 then F0 (x) = E x Z0 = E x 1 = x.To fully compute F n (x), we need F n,1 (x), F n,2 (x) and F n,5 (x), which we calculate in a similar way to the pgfs for subtree size in subsection 4.1, and (4.12) with the initial conditions: The initial conditions in Eqs.(4.13), (4.14) and (4.15) are derived from the probability distribution of offspring in the next generation for type-1, type-2 and type-5 motifs respectively.For example, we get Eq.(4.13) because there are no offspring with probability (1 − p 1 ) 2 , one offspring with probability 2p 1 (1 − p 1 ) and two offspring with probability p 2 1 .In the initial condition, each of these probabilities is multiplied by x 0 , x 1 and x 2 respectively, giving us E X Z 1,1 which is the pgf for the number of particles in generation 1 of a type-1 subtree F 1,1 (x).To find Ω k , the probability that a cascade has lifetime k, we use the property of pgfs that the probability that there are zero particles in generation n is Fn (0) and thus (4.16) Ω n = Fn (0) − Fn−1 (0).
In Figure 4.2 (b), we show the distribution of cascade lifetimes for a doubly-Poisson distributed network comparing this theory to simulations.

Joint Distributions.
Many interesting and practically-relevant properties of cascades depend on joint distributions of several quantities, and so it is important to move beyond univariate pgfs (as in subsection 4.1) to develop methods for inverting multivariate pgfs.For example, we may wish to calculate the expected average tree depth (EATD), which we can find from the joint distribution of cascade size and cumulative depth.The cumulative depth of a cascade is the sum of the depths of all of the nodes in the cascade.The average tree depth is the average depth of a node in the tree and the EATD is the expectation of this quantity over the ensemble of cascades.In a cascade, the average tree depth is given by the relationship (4.17We can write cascade size Xn and cumulative depth Ỹn after n generations in terms of the size and cumulative depth of the types 1 and 5 subtrees.A type-i subtree is a subtree seeded by a type-i motif, all six motif types are shown in Figure 4.1 (a).The sizes are denoted by X n,1 and X n,5 respectively and the depths are denoted by Y n,1 and Y n,5 respectively.In terms of subtree sizes, the cascade size is where the seed node is in t triangles and has s single links and the cumulative depth is This allows us to rewrite the pgf in terms of the joint pgfs of its type-1 and type-5 subtrees; = s,t where f (x, y) is the pgf for the joint Newman-Miller distribution of the network and H n,k (x, y) is the pgf for the joint distribution of cascade size and cumulative depth for a subtree of type k allowed to grow for n generations.When calculating the cascade size of a type-k subtree, X n,k , similarly to in subsection 4.1, we do not count the active or removed nodes in the seed motif because this would lead to counting some nodes multiple times when we consider ensembles of subtrees in the overall size.To calculate the full pgf, we will need expressions for H n,1 (x, y), H n,2 (x, y) and H n,5 (x, y), using a method similar to that for the subtree size pgfs in subsection 4.1 we get, where f r and f q are the pgfs for the excess Newman-Miller distribution from a triangle and a single link respectively, these are given in Eqs.(3.5) and (3.4).We can find the full joint pgf of cascade size and cumulative depth by iterating through these equations from n = 0 with initial conditions H 0,1 (x, y) = H 0,2 (x, y) = H 0,5 (x, y) = 1.To find the joint probability distribution from the joint pgf we use an inverse fast Fourier transform method which we describe next in section 5.While methods that use computer algebra systems have been used in [3], we are not aware of other work on cascade dynamics that uses this method to recover the joint probability distribution of cascade properties.The main issue with using computer algebra systems to calculate the probabilities from pgfs is that they do not allow us to symbolically calculate the pgf for a large number of generations.This limits us in the number of sizes that we can calculate the probabilities for and also reducing the accuracy in the approximation in the limit as the number of generations n → ∞.This has motivated us to propose the inverse fast Fourier method of recovering the probability distribution which we describe next in section 5, which does not have the same shortcomings as the aforementioned methods.

5.
Inverting multivariate pgfs to recover probabilities.Given a pgf (or at least an iterative method for calculating it), in many cases, we would like to access the probability distribution of the quantity that it describes.For one-dimensional pgfs, in the network dynamics literature, the most common inversion method used is that of Cavers [6].In Cavers' method, a link between the pgf and the z transform is utilised and the inversion is expressed in the form of an inverse z transform.The inverse z transform requires the computation of a contour integral in the complex plane, which is approximated using the trapezoidal rule.The trapezoidal-rule approximation is equivalent to taking the inverse discrete Fourier transform of the pgf evaluated at K evenly spaced points around the unit circle centred at zero on the complex plane.In practice, we use the inverse fast Fourier transform, which is a fast, efficient algorithm for computing inverses of discrete Fourier transforms, in the place of the inverse discrete Fourier transform.Here, we introduce an alternative derivation which more naturally extends to the inversion of higher-dimensional pgfs.Our method has the same computational implementation as Cavers' method in one dimension.Given the pgf (5.1) we aim to find the probabilities {p i }, i ∈ {0, 1, 2, ...}, where p i is the probability that the random variable, with probability distribution described by the pgf B(z), is equal to i.The discrete Fourier transform of the first K probabilities is denoted by the set {P l }, l ∈ {0, 1, ..., K − 1}, and is given by the relationship (5.2) which is approximately the pgf B(z) evaluated at e −2πi K l for K sufficiently large and l ∈ {0, 1, 2, ..., K − 1} if we assume that p k → 0 as k → ∞.In our examples, we let K = 100.For example, to evaluate the pgf for cascade size, we iterate through Equations (4.3) to (4.7) setting x = e −2πi K l and do this for each l, this gives us a vector holding values of the approximation of the discrete Fourier transform of the cascade size probabilities.Once we have evaluated the pgf at each of these points, we use the inverse fast Fourier transform on that set of transformed points to obtain the an approximation for the first K probabilities from the pgf, the larger a value for K that we choose, the closer the approximation will be to the true values.In Figure 4.2 (a) we show the theoretical distribution of cascade size which is found using this method to invert the pgf for cascade size that we described in subsection 4.1.This method naturally extends to two dimensions (and beyond), allowing us to retrieve, for example, the joint probability distribution of cascade size and cumulative depth from subsection 4.3.The joint probability distribution of random variables X 1 and X 2 is given by the set of probabilities {p k 1 ,k 2 } where p k 1 ,k 2 is the probability that X 1 = k 1 and X 2 = k 2 .The general joint pgf has the form and similar to the univariate case, if we evaluate the pgf for all combinations of x ∈ {1, e − 2πi /K 1 , e − 4πi /K 1 , ..., e − 2(K 1 −1)πi /K 1 } and y ∈ {1, e − 2πi /K 2 , e − 4πi /K 2 , ..., e − 2(K 2 −1)πi /K 2 }, we get the 2-dimensional discrete Fourier transform of the joint probabilities of all combinations of the first K 1 values for X 1 and the first K 2 values for X 2 ; i.e., k 1 varying from 0 to K 1 − 1 and k 2 varying from 0 to K 2 − 1; (5.4) for K 1 and K 2 sufficiently large.In the case of the joint pgf for cascade size and cumulative depth, this means iterating through Equations (4.21) to (4.24) with x = e − 2πil /K 1 and y = e − 2πim /K 2 for all l, m combinations, l ∈ {0, 1, ..., K 1 − 1} and m ∈ {0, 1, ..., K 2 − 1}.From Eq. ( 5.4) if we evaluate the pgf for all combinations of x ∈ {1, e − 2πi /K 1 , e − 4πi /K 1 , ...,e − 2(K 1 −1)πi /K 1 } and y ∈ {1, e − 2πi /K 2 , e − 4πi /K 2 , ..., e − 2(K 2 −1)πi /K 2 } and take the 2-dimensional inverse fast Fourier transform, we will recover an approximation for the joint probabilities p k 1 ,k 2 for all combinations of the first K 1 values for k 1 and the first K 2 values for k 2 .As for the uni-variate case, the accuracy in the approximation increases as K 1 and K 2 increase.If we use this method to recover the joint distribution for cascade size and cumulative depth from the joint pgf derived in subsection 4.3, for example, we can calculate theoretical properties of the cascades including, but not limited to, the EATD and the correlation between cascade size and cumulative depth that we will describe in subsection 5.1.
In N dimensions the pgf would take the form, (5.5) which, when evaluated for each combination of , ..., N } becomes approximately the N -dimensional discrete Fourier transform of the probabilities p k 1 ,k 2 ,...,k N .As for the univariate and bivariate cases, taking the inverse fast Fourier transform gives us an approximation to the N -dimensional joint distribution.

Pearson correlation coefficient and EATD.
In this subsection, we show how the Pearson correlation coefficient between cascade size and cumulative depth and the EATD are calculated from the joint probability distribution that we described in subsection 4.3.The joint distribution is recovered from its pgf in the way that we have just shown.The Pearson correlation coefficient can be expressed in terms of the moments of the distribution these moments can be calculated from the probability distribution.For unclustered networks, Gleeson et al. [12] described a method for calculating the EATD using an integral formula; however, their method does not extend to clustered networks.An alternative method of calculating the EATD is to use the joint probability distribution of cascade size and cumulative depth, this method is not restricted to unclustered networks.We compute the EATD using the formula In Table 5.1, we show the EATD and Pearson's correlation found using this method and compare to Monte-Carlo simulations on two synthetic networks and three real-world networks.We show results for a doubly-Poisson Newman-Miller network (µ = 1, ν = 4) with 10,000 nodes and for two empirical networks; the powergrid network [37], the largest connected component of the science co-authorship network [24] and the C. elegans metabolic network [8].For the empirical networks, in the calculation of the theoretical EATD and ρ, we estimated the Newman-Miller distribution (π st from section 3) using the edge-disjoint edge clique cover (EECC) described in Appendix B. 6. Application to real-world networks.To apply these methods to empirical networks, we need to recover the joint distribution of triangles and single links from the network, this allows us to find the pgf for the distribution f (x, y) and for the excess distributions f r (x, y) and f q (x, y), as given in section 3, for a Newman-Miller network with the same joint distribution as the empirical network.An assumption of the MTBP method for modelling complex contagion on clustered networks is that the cliques do not share edges -they are edge-disjoint or, equivalently, are connected in a locally tree-like manner -in practice, many real-world networks have cliques which are not edge-disjoint.In Burgio et al. [4] they propose the Edge-Disjoint Edge Clique Cover (EECC), a cover which imposes, as the name suggests, the constraint that the cliques are edge-disjoint.With the EECC, we can approximate f , and thus f q and f r .
In our examples here, we assume that the network is composed of triangles and links but not higher-order cliques.Since the cliques in real-world networks can have more than three nodes; i.e., a triangle, we approximate any higher-order cliques as a combination of triangles and links, as shown in Figure 6.1 (b). Figure 6.1 (a) shows the decomposition of the network into edge-disjoint cliques where there is no restriction on the size of the cliques, this is the EECC introduced by Burgio et al. [4].In Appendix B we present our alteration of the EECC algorithm introduced by Burgio et al. [4], which imposes that the cliques in the EECC have maximum size 3; i.e., are at most triangles, this is the algorithm that we use here.This decision was due to the fact that the equations that we set up in previous sections assume that the network is fully composed of triangles and single links; in theory, if these systems of equations are extended to network distributions with higher-order cliques then it would be natural to extend the algorithm to capture these larger cliques.However, it is possible that extending the systems of equations beyond relatively small clique sizes would be prohibitively tedious by hand and an extension to higher order cliques is likely to require the derivation of the equations to be automated.The EECC of the same network when we restrict the maximum clique size to be 3.

Implementation of the EECC.
In this subsection, we apply the EECC to a synthetic network generated from the doubly-Poisson distribution and compare the results to what would be attained from a single-edge decomposition (SED), which assumes that the network is fully composed of single edges and thus the spreading dynamics constitute a simple branching process.In Figure 6.2 we show the results of using the EECC on a synthetically generated network with a doubly-Poisson distribution with parameters µ = 1 and ν = 4 (mean degree 9).In this network, the assumptions of the MTBP theory are very close to being met; i.e., the cliques do not overlap and the network is very large (5000 nodes).We applied the EECC to the network to approximate its joint triangle-link distribution and use this to find the distribution of cascade sizes according to the method described in subsection 4.1.We compare the results of this approach to using the SED and the method for calculating the cascade size distribution given in [12], for completeness, this method is summarised in Appendix A.
In Figure 6.2 (a) and (b) we examine the distribution of cascade sizes for simple contagion dynamics where p 1 = 0.05 and α = 0.For a simple contagion on this network, we see that the theoretical distribution of cascade size matches well to the simulations when we account for the presence of triangles in the network.However, when we compare the MTBP which accounts for the the presence of triangles to a simple branching process approximation (using the SED) which does not, there is minimal difference in the probability distributions, both curves match the simulations very well -this result is not particularly surprising, Melnik et al. [21] found that tree-based theory for dynamical processes on various highly clustered networks yielded accurate results despite the locally tree-like assumption being clearly violated.In Figure 6.2 (c) and (d) we examine the cascade size distribution for complex-contagion dynamics where p 1 = 0.02 and α = 0.2, we compare simulations on the network to the theoretical distribution of sizes and find very strong agreement between the theory and simulations.In the figures we include the simple branching process approximation of the same dynamics and see that accounting for the clustering in the network gives a more accurate theoretical result.When we compare the result for the theory that accounts for the presence of triangles to the theory that does not, we see that for these parameters not accounting for the presence of triangles leads to smaller cascades than are observed in simulations, again showing the importance of accounting for clustering when modelling complex contagion on clustered networks.6.2.Application of the EECC to real-world networks.So far, we have shown that we can derive interesting statistics about complex contagion dynamics on clustered networksor at least networks which are structurally composed of cliques which are edge disjoint.While we remain in the (arguably unrealistic) regime where the maximum clique size is 3, we would like to know how these methods work when we apply them to real-world networks which may have correlations and clique size distributions which are not present in the synthetic distributions that we have worked with so far.We apply these methods to three well-studied real-world networks; the power-grid network [37] (4941 nodes, 6594 edges, C ∆ = 0.101 ), the largest connected component of the science co-authorship network [24] (379 nodes, 914 edges, C ∆ = 0.43) and the C. elegans metabolic network [8] (453 nodes, 4596 edges, C ∆ = 0.12).We chose these networks because they are relatively large -reducing the impact of finite-size effects -but small enough that the EECC algorithm finds a cover in a reasonable run time.In Figure 6.3 (a) and Figure 6.4 (a) we show the probability distribution and ccdf respectively of cascade size for the power-grid network with complex contagion parameters p 1 = 0.04 and α = 0.15, we compared 1 million simulations on the network to the theoretical distributions accounting for triangles with the MTBP and also not accounting for triangles using the SED.While the theory matches quite well to the simulated cascade size probability distribution, in Figure 6.3 (a) and Figure 6.4 (a), for smaller cascade sizes, the difference in what is predicted by the SED and by the MTBP with the EECC is very small.The tree-like approximation made in the SED does quite well despite the power-grid network being clustered.In the tails of the cascade size distribution both of the theoretical distributions do not match the simulations very well, we speculate that this is due to the larger cliques present in the network which are not accounted for in the theoretical approach.Similarly, we applied the methods to the largest connected component of the science co-authorship network [24], using complex contagion parameters p 1 = 0.01 and α = 0.1, we show the probability distribution and ccdf in Figure 6.3(b) and Figure 6.4 (b).The EECC was also applied to this network in Ref. [29].We found that while the simulations match the theory extremely well for cascade sizes up to size three, which accounts for 99.2% of all cascades, our theory with the EECC under-predicts the number of large cascades that we observe in simulations.We anticipate that this is because there are larger cliques in the empirical network that are not accounted for when we use the EECC with maximum clique size 3 to approximate the clique distribution of the network.Given that this network is not very large (379 nodes) and that branching processes assume an infinitely large network, it is important to check that the network size is not causing the discrepancy between the theoretical distributions and the simulated cascades.In Appendix C, we explore the impact of finite-size effects and find that these are minimal, again reinforcing our hypothesis that the difference between the theory and the simulations is caused by the theory not accounting for the higher-order cliques (with more than three nodes).Finally, we applied this method to the C. elegans metabolic network [8] for complex contagion with parameters p 1 = 0.005 and α = 0.005.For this network, the simple branching process theory poorly approximates the cascade-size distribution (probability distribution and ccdf) as shown by the red dashed line in Figure 6.3 (c) and Figure 6.4 (c) respectively.When we account for the presence of triangles (as shown by the black solid line) using the MTBP with the EECC to approximate the Newman-Miller distribution, the theory comes much closer to the simulations on the empirical network which suggests to us that it is the presence of higher-order structures in the C. elegans metabolic network that leads the simple branching process to poorly describe the dynamics.By accounting for links and triangles only, the MTBP substantially improves the theoretical distribution of cascade size, when we compare to simulations on the C. elegans metabolic network.
7. Discussion and Conclusions.In this paper, we have extended the multi-type branching process (MTBP) theory originally proposed in Ref. [17] in three main directions; first, the clustered networks studied in [17] were highly stylised and we showed how the more general Newman-Miller class of networks can be used to incorporate clustering into the branching process framework; then, we used the MTBP to derive more results, the focus here was on looking at full distributions of cascade sizes, cascade lifetimes and the joint distribution of cascade size and cumulative depth which enabled us to calculate the expected average tree depth and the correlation between cascade size and cumulative depth, this contrasts with a focus on the average behaviour of the cascades in our previous paper [17].Finally, we showed how to apply the theory to real-world networks.To apply the theory to real-world networks, we approximated the real-world network structure as a Newman-Miller network using the edge-disjoint edge clique cover (EECC) proposed by Burgio et al. [4].A second contribution of this paper is our derivation of the numerical inversion method for multi-variate probability generating functions (pgfs).While Cavers' method [6] has been extended to bivariate pgfs [1], our alternative derivation has the advantage of extending naturally to pgfs of any dimension.
In Keating et al. [17], we focused on a very stylised model of network clustering for ease of explanation.These networks were infinitely large and we assumed that the local structure of the network was invariant throughout the network; e.g., we looked at networks where every node was in two 4-cliques and a network where every node was in four 2-cliques and one 3-clique.Here, we showed how the method can be extended to incorporate any Newman-Miller network distribution describing the distribution of triangle and single-link membership of the nodes in a network.In our examples, we use the doubly-Poisson distribution as used by Newman [25].While we did not delve into it here, using the probability generating functions for the Newman-Miller network distributions, all of the results on the average behaviour of the cascades from Ref. [17] can be extended to this more general class of networks.For example, we can calculate the expectations from the offspring distribution pgfs and use this to find the elements of the mean matrix, allowing us to find the cascade condition and the expected cascade size using the methods described in Ref. [17].Furthermore, while we focused here on Newman-Miller networks, which have a maximum clique size of three, in principle the method can be extended beyond this restriction to multivariate distributions of cliques of any size, this would mean that we would need to incorporate different types into the MTBP for which the transition probabilities may be calculated either by hand or using a computer algorithm.To incorporate very large cliques into the MTBP, we expect that it will not be feasible to do this by hand and it would be necessary to write a computer algorithm to do this for us -we did not attempt to do this as part of this work.
Gleeson et al. [12] successfully used simple branching processes to model information cascades on Twitter.They analytically derived full distributions of cascade lifetimes and cascade size, and calculated the expected average tree depth (EATD) and structural virality.For complex-contagion dynamics on clustered networks, we here showed how, using pgfs, we can find the cascade size distribution, distribution of cascade lifetimes and the EATD.Our method for calculating the cascade size distribution, as in Ref. [12], involves finding the pgf for the cascade size distribution and recovering the probability distribution using an inverse fast Fourier transform method.In section 5, we explained how inverse fast Fourier transforms can be used to recover an accurate approximation of the probabilities, this naturally extends to higher-dimensional pgfs.We use inverse fast Fourier transforms to invert a 2-dimensional pgf for the joint distribution of cascade size and cumulative depth, allowing us to find the EATD.The inverse fast Fourier transform method is crucial for finding accurate distributions under the MTBP theory, other studies use computer algebra systems [3], which are limited by the number of terms that can be computed in the probability distribution and by the number of generations that can be accounted for in the branching process.The inability for computer algebra systems to recover cascade-size probabilities for branching processes that are not truncated after a short number of generations means that they can not accurately model many real-world processes that might survive beyond a small number of generations.
In section 6 of this paper, we applied the MTBP method to synthetic and real-world networks.Using the edge-disjoint edge clique cover (EECC) method from Burgio et al. [4], we recovered the Newman-Miller distribution of both a synthetically generated network (Figure 6.2) and a Newman-Miller approximation for the clique distribution of three real-world of size 3 to the EECC.Remove all edges present in the EECC from C HO , recalculate the maximal cliques and ρ considering all remaining elements.Appendix C. Finite-size effects.In Figure 6.3 and Figure 6.4,we see differences between the cascade-size distribution for the Monte-Carlo simulations on the empirical networks and the cascade-size distributions from the MTBP theory, especially in the tails of the distributions.One potential cause for this is that, since the networks are not very large (the science co-authorship network has 379 nodes), these could be finite-size effects.When we model cascades on networks using branching processes, we make the assumption that the network is infinitely large; however, this assumption is clearly violated in the case of these real-world networks.Another possible cause of this differences between the theory and the simulations is that the real-world networks contain higher-order cliques and highly connected structures that are lost in the MTBP model when we approximate the network structure using the EECC with maximum clique size of three.
All cliques with more than three nodes are decomposed into a series of edge-disjoint triangles and single links.To determine whether these discrepancies are predominantly due to finite-size effects, we generate a synthetic network of approximately the same size as the empirical networks from the Newman-Miller distributions calculated from their EECC and run 1 million Monte-Carlo simulations on each of these synthetic networks.These synthetic networks meet the assumption that is made in the theory that the network is fully composed of triangles and single links.Therefore, we can disentangle the impact of finite-size effects from other structural features present in the empirical network which are not within the assumptions of the MTBP theory.In Figure C.1, we show the cascade-size distributions for the powergrid network [37], the largest connected component of the science co-authorship network [24] and the C. elegans network [8] for both the MTBP theory (black solid line), the simulations on the empirical network (purple points) and simulations on a synthetic network for a similar size generated from the Newman-Miller distribution obtained from the EECC on that network (orange points).
We find that the differences between the theory and the simulations on the synthetic network are minimal in comparison to the differences between the theory and the simulations on empirical networks.This confirms that the impact of finite-size effects is minimal for the simulations and that the discrepancies between the distributions from the MTBP theory and the Monte-Carlo simulations on real-world networks are most likely due to the fact that there are differences other than size in the network structure between the empirical network and in the distribution that we use in the MTBP from the EECC.

Figure 4 . 1 :
Figure 4.1: (a) The 6 possible clique motifs in a Newman-Miller network, blue nodes are active, white nodes are inactive and red nodes are removed.(b)A visualisation of the node counting in subtrees for cascade size, we do not count active or removed nodes in the seed motif (dark blue) and only count all other nodes in the subtree (in white).In this figure we assume that the number of generations that we allow the subtree to grow for n → ∞.We let n → ∞ because, for a sub-critical branching process, this assures that the sub-trees will be fully grown; i.e., there will be no further offspring in future generations.

Figure 4 .
Figure 4.1 (a) for the motif types).The pgf for the full cascade size after n generations, Kn (z), is given by

Figure 4 . 2 :
Figure 4.2: (a) The theoretical complimentary cumulative distribution function (CCDF) of cascade size from the MTBP (solid line) and for a simple branching process (dashed line) and 1 million simulations on a network of size 10,000 (blue points).The CCDF for a random variable Y is defined as P (Y > y), the probability that the random variable Y is greater than a specific y.(b) The probability distribution of cascade lifetimes where Ω n is the probability that the cascade survives to generation n.The solid line represents the theoretical distribution and the orange points are the values from simulations.The results shown are for doubly-Poisson Newman-Miller networks with µ = 2 and ν = 2 (average degree 6) and for parameters p 1 = 0.05 and α = 0.2.
) average tree depth = i∈V depth of node i |V | = cumulative depth cascade size , where V is the set of all nodes in the cascade and |V | is the size of the set V ; i.e., the number of nodes in the cascade.To calculate the EATD, we need the joint distribution of cumulative depth and cascade size.We denote by Xn the size of a cascade after n generations and by Ỹn the cumulative depth of a cascade after n generations.The joint pgf of cascade size and cumulative depth is (4.18)Hn (x, y) = E x Xn y Ỹn .

Figure 6 . 1 :
Figure 6.1:(a) An edge-disjoint edge clique cover (EECC) of a network (black nodes and edges) showing the cliques included in the cover, the 3-cliques shaded in blue and 4-clique shaded in pink are included in the EECC, the edges that do not form part of these cliques are included in the EECC as 2-cliques.The dashed edges represent edges in a higher-order clique and the solid edges connect two nodes in a 2-clique.(b) The EECC of the same network when we restrict the maximum clique size to be 3.

Table 5 .
1: Table showing both the EATD and Pearson correlation, ρ, from the MTBP theory and from 1 million Monte-Carlo simulations.