Branching process descriptions of information cascades on Twitter

A detailed analysis of Twitter-based information cascades is performed, and it is demonstrated that branching process hypotheses are approximately satisﬁed. Using a branching process framework, models of agent-to-agent transmission are compared to conclude that a limited attention model better reproduces the relevant characteristics of the data than the more common independent cascade model. Existing and new analytical results for branching processes are shown to match well to the important statistical characteristics of the empirical information cascades, thus demonstrating the power of branching process descriptions for understanding social information spreading.


Introduction
The transmission of information via online social networks is increasingly ubiquitous.The volume of freely-available data offers unprecedented opportunities for data-driven mathematical modelling of human behaviour.Twitter, for example, is a directed social network wherein users "follow" other users in order to receive their broadcast transmissions, called "tweets".All tweets are public, making analysis of Twitter data particularly popular among data scientists.Twitter users may retweet messages they receive from the users they follow, and in this way cascades of information may stem from a single tweet event that we call the "seed".In the search for mathematical models to describe such structures, branching processes [1,2] are an appealing option.As stochastic processes, they can potentially capture the wide variability observed in tweeting patterns and human behaviour, while offering a wealth of theoretical results that can be tested against data from online social networks.
Branching processes have already been applied in several studies of Twitter and other online fora.The recent review by Aragón et al. [3] surveys models of discussion threads, including Twitter reply cascades.Several of the generative models cited in [3] are based on branching processes, but most find it necessary to modify classical branching processes with some novel features in order to match to data.For example, Nishi et al. [4] studied reply cascades in Twitter (as distinct from the retweet cascades that we examine here) that were seeded by celebrities and found that these could not be fitted by classical Galton-Watson processes, so they introduced a modified version of the branching process.On the other hand, Galton-Watson processes (albeit with special seed offspring distributions) were successfully applied to discussion trees from Reddit by Medvedev et al. [5], and time-dependent continuous-time branching processes were fitted to a viral marketing campaign in [6,7].Golub and Jackson [8] reanalysed data from [9] to show that although standard branching processes did not appear to reproduce the features of the email cascades studied in [9], when selection bias was added-to model the fact that large, viral, chains are more likely to be observed than small chains-then the biased Galton-Watson process fitted quite well.
Although branching processes have been fitted to data to form the basis for simulation and prediction in several studies, the application of analytical results from branching processes theory has been mostly limited to a small selection of features.The most common [3] is the cascade size, i.e., the accumulated number of tweets or replies to a single seeding post, for which the wellknown Galton-Watson result for the expected total number of progeny has been used for prediction, e.g., [10].However, determining the entire distribution of cascade sizes, not just its mean, is quite feasible [11,12], as are analytical (or semi-analytical) methods for the calculating the length and depth of cascade trees, as well as other measures.One such measure that we examine in Sec.4.3 is the "structural virality" of a cascade, as introduced by Goel et al. [13].Using this and other measures of cascade trees, Goel et al. performed large-scale numerical simulations of a simple transmission model on networks to fit to data from Twitter.The transmission model of [13] is a discrete-time version of the susceptible-infected-recovered disease-spread model, also known as the "independent cascade model" (ICM) [14].Other network-based simulation models [15,16] use variations of such dynamics (such as susceptible-infected-susceptible disease-spread models [17]) to understand the effects of network structure upon spreading.
In this paper we focus on three aspects of branching process models for Twitter retweet cascades, using a reanalysis of two previously-studied datasets [18,19].First, in Section 2, we extract the empirical offspring distributions from the tree structures and show that these remain approximately stable across a range of generations.This is a necessary condition for classical branching processes to provide accurate models for detailed features of cascade trees, and the simplicity of this result contrasts with models where explicit time-decay of novelty [20][21][22] or generation-dependent branching numbers [23] are required.
Secondly, we consider in Section 3 how the structure of the underlying social network and the modelling of user-to-user transmission mechanisms can affect the offspring distribution for cascade trees.By comparing with the empirical results of Sec. 2, we examine whether the offspring distribution is better modelled by the independent cascade model, or by an alternative model that accounts for limited attention of users of social media [24].
Finally, in Section 4, we use the branching process framework to derive predictions for features of cascades, focusing on both the distribution of metrics of interest across the entire dataset and on analytical results for expected values.For completeness, we first derive the well-known results for cascade sizes and durations and then build on this approach to derive results for the expected value of tree depths and the structural virality measure of Goel et al. [13], and apparent novelty decay factors [20,21].These results include integral expressions for expected structural virality and tree depth that we have not been able to locate in existing literature, and which are amenable to asymptotic analysis.We conclude the paper with a discussion of the results, limitations, and potential extensions in Section 5.

Data sources
As a motivation and test for branching process hypotheses, we reanalyse two independent Twitter datasets.Both datasets have been previously analysed and described in Refs.[18] and [19,25], but here we use the identified cascade structures to focus on the accuracy of branching process models.
The first dataset, which we call "Marref", is comprised of tweets related to the 2015 Irish samesex marriage referendum, collected between May 8 and May 23, 2015.As described in [18], all tweets containing either of the hashtags #marref or #marriageref were gathered and analysed.Our focus in this paper is on the tree structures of retweeting behaviour, and the extent to which this can be accurately modelled by branching processes.A "particle" or "node" in a tree (see Figure 1 for examples) represents a retweet event, which may cause further retweets, i.e., a next generation of particles in the tree, called the "children" particles of the "parent" node.The children of a given node event are identified using the tree-reconstruction methodology described in Goel et al. [13], as implemented in [26].This algorithm aims to identify a single parent node (tweet) for each retweet event, by using the text of the tweet and (in cases of multiple possible parents) the timestamps of the tweeting events.The output of the Goel et al. algorithm is a tree structure corresponding to each cascade of retweets, as described in Sec.2.2 below.The data collection procedure is restricted to cascades of size greater than one, so that seeds that generate no children are not recorded.
The second dataset, called "URL", is comprised of tweets containing Uniform Resource Locators (URLs) for internet addresses that were posted on Twitter during October 2010 [27].The collection and processing of the data is described in detail in [25] and [19].The URLs chosen for tracking were discovered by sampling tweets gathered using Twitter's Gardenhose API [19] and then searching for all appearances of these URLs.As for the Marref dataset, trees were reconstructed using the algorithm of Goel et al. [13].We note that there is some selection bias is this data as URLs with larger cascade sizes were more likely to be selected by the initial sampling than relatively unpopular URLs.Also, as in the Marref data, the URLs dataset only contains those cascades whose size (total number of tweets) is greater than one.
In order to ensure that each of the studied trees has had sufficient time to develop all its generations within the finite-length time window of data collection, we select only those trees where the "birth" of the tree (i.e., the timestamp of the seed node) occurs during the first half of the time window for data collection.In the case of the Marref data, this avoids trees that are born during the increased activity around the referendum day and its aftermath.Although the original tweets occur in continuous time, we confine our attention in this paper to the discrete generations of the tree structures.This choice simplifies the analysis to the discrete-time case of Galton-Watson processes, but it means that we do not investigate important aspects such as the time between tweet arrival and retweet, which would require an extension of our approach to continuous-time branching processes [6,7].The extracted tree structures for both datasets are made available in anonymised form [28].

Tree structures
In this section we analyse the characteristics of the trees extracted from the data as described in Sec.2.1.For each dataset, we consider an ensemble of M trees, with each tree made up of particles (or nodes) in multiple generations, see Fig. 1.We define Z m,n to be the number of particles in generation n of tree m (where m = 1, 2, . . ., M and n = 0, 1, . . ., ).The individual trees have very heterogeneous characteristics (size, number of generations, etc.), so we first consider the ensemble as a whole.Defining z n as the total number of generation-n particles observed across all trees, i.e., we plot in the top panels of Fig. 2 the dependence of z n on the generation n using log-linear scales.Figure 2(a) is from the Marref dataset (M = 7, 736) and Fig. 2(b) is from the URL dataset (M = 39, 547).An approximately exponential dependence of z n upon n is shown by the nearly linear shape of the function on log-linear axes; such a dependence is consistent with a subcritical branching process.Note that small-number fluctuations occur when z n is relatively small: we choose 10 3 as a threshold level (shown by the black dashed line in Figs.2(a) and (b)) and focus on z n values which are above this threshold.In Fig. 2(c) and Fig. 2(d) (for Marref and URL, respectively) we show the effective branching number [23] ξ which gives the average number of children particles for a particle of the nth generation.Observe that ξ n is approximately constant for the range of generations in which z n is sufficiently large (i.e., above the threshold marked in Figs.2(a) and (b)).Figure 2(d) shows that the early generations of the URL dataset exhibit a lot of fluctuations in the ξ n values, consistent with the possible biasing of the data towards larger trees (see Sec. 2.1).In both cases, the branching number ξ 0 of the seed generation appears to be anomalously high; this is partly due to the biasing introduced by the fact that no trees of size less than two are recorded (but see also the discussion leading to Eq. (10) below).The dashed green lines in Fig. 2 highlight the range of generations over which the branching number ξ n appears to be approximately constant.For each dataset we calculate an average value ξ of the branching number over the range shown by the dashed green line.The URL dataset, with a value ξ = 0.90, has a high virality (recall the critical branching number of 1 separates the regime of subcritical cascades from that of supercritical cascades), while the Marref dataset has lower virality ( ξ = 0.46).Next, we make a stronger test of the branching process hypothesis, by examining the empirical offspring distribution at each generation.For each particle i in generation n we record the number Z (i) n+1 of its offspring particles, i.e., the number of users in generation n + 1 that are identified as children of particle i. Gathering the ensemble of Z (i) n+1 values across all trees, we calculate the empirical offspring distribution of generation n as qℓ,n = Prob Z i.e., qℓ,n is the probability that a particle in generation n spawns ℓ children particles in generation n + 1 and we have used the fact that the maximum-likelihood estimate of the probability of having ℓ children is given by the fraction of nodes in the data with ℓ children [8].
In Figs.3(a) and (b) we plot the empirical offspring distributions for several generations.Because of the data collection restriction to cascades of size exceeding one (and also because of the network structure, see Sec. 3 below), the seed generation offspring distribution qℓ,0 differs substantially from the other generations.However, for the Marref data set (Fig. 3(a)), observe that the qℓ,n distributions for n = 1 through n = 4 (which is the range of generations giving z n values above threshold in Fig. 2(a)) are very similar to each other: the curves in Fig. 3(a) are almost indistinguishable.This collapse of the empirical offspring distributions is consistent with a branching process model in which the offspring distributions are identical for all generations with n ≥ 1, see Sec. 3.
In the URL data set, the low-generation distributions qℓ,n do not show as clean a collapse as seen in the Marref case, see the inset of Fig. 3(b).However, this may be due to the selection bias in the data collection, which means that small trees (those with fewer generations) are likely to have been omitted from the collected set of trees.Larger trees are more likely to be properly represented in the dataset, and these trees are also likely to consist of a large number of generations.Accordingly, we plot also the qℓ,n curves for n = 10, 15, 20, 25, 30, 35, 40 in Fig. 3(b) (note the range of generations chosen matches the green dashed line in Fig. 2(b) and (d)), and we observe a good collapse of these distributions, which is again consistent with a branching process model.
From the evidence of Figs. 2, 3(a) and 3(b), we conclude that a branching process model may give a good approximation to the heterogeneous cascades represented by the trees extracted from the data sets.In the next section we will derive a mathematical model that explicitly links the network structure and various hypotheses on the information-spreading mechanism to predict offspring distributions, which we then compare with the empirical results of Fig. 3(a) and 3(b).

Modelling information spread as a branching process
We consider a directed network whose structure is minimally described (in the configuration-model sense [29]) by the joint distribution p jk of nodes' in-degree j and out-degree k: in other words, p jk is the probability that a randomly chosen node has j friends and k followers1 .We model the dynamics of information spreading at the level of (j, k) classes also, defining the vulnerability v jk as the probability that a (j, k)-class node will retweet a message that it has received from one of its j friends [30].
Consider a message that is tweeted by a node to its followers.Under the configuration-model assumption, the probability that a follower is in the (j, k) class is given by j/ j p jk , where j = j,k jp jk is the mean in-degree (mean number of friends) over the network.This follower will retweet the message if he is vulnerable, which occurs with probability v jk , and in doing so, he will expose all k of his followers to the message.Thus, the probability that a randomly-chosen follower will retweet a message he receives is given by If we know that a follower has retweeted the message (i.e., if we condition on retweeting) then the probability that he is in the (j, k) class is In particular, the probability that a retweeter has k followers is given by summing over all possible j values: Assuming each of the k followers to be independently vulnerable with probability ρ, the number ℓ of followers who themselves retweet has the binomial distribution Combining these probabilities, we have derived the offspring distribution q ℓ which gives the probability that a retweeting by a node will lead to ℓ further retweets by followers of that node as The corresponding pgf for the offspring distribution, f (x) = ℓ q ℓ x ℓ , is In the derivation of Eq. ( 9), we began by considering a node that receives the message from one of its friends.However, the initial source (or seed) of the cascade has a different dynamic, meaning that the seed generation of the branching process has an offspring distribution different from Eq. ( 8).We assume that the seed node for a cascade is chosen uniformly at random from all the nodes.This means that the seed node is in the (j, k) class with probability p jk .As above, the number ℓ of its k followers who will retweet the message is given by Eq. ( 7), and so the seed-generation offspring distribution is2 with corresponding pgf Having defined how the offspring distribution of the branching process is determined by the network structure (p jk ) and the dynamics (via the vulnerability v jk ), we next examine two possible models for contagion dynamics.

The independent cascade model
In the independent cascade model (ICM) [14] each "infected" node (i.e., node who tweets or retweets the message of interest) gets one attempt to infect each of its out-neighbours; the infection attempt is successful (meaning that the follower also retweets the message) with probability C, where C is the single parameter of the model.In our modelling framework, this implies that the ICM vulnerability of every node is equal to C, regardless of the node's (j, k) class: Note that in this case, the retweet probability ρ is determined from Eq. ( 4) to be ρ = C.Moreover, in the special case of uncorrelated in-and out-degrees (i.e., if the number of friends j and the number of followers k of a node are uncorrelated), the joint distribution p jk factorises into the product p in j p out k and the offspring q ℓ and q ℓ are identical3 .However, in the more realistic case where the in-and out-degrees of nodes (the numbers of friends and followers of users) are correlated (see, for example, Fig. 2 of [31]), the offspring distribution of the seed generation differs from the offspring distributions of subsequent generations.

The limited attention model
A number of researchers have pointed out that the limitations of human cognition impose an effective limit on how much information can be absorbed and shared by an individual.For a user on Twitter, having a larger number of friends j leads to a faster influx of information into the user's stream, with a consequent dividing of attention among the many tweets.Empirical analyses [19,24] and models of information-sharing dynamics [11,12,32] both indicate that the probability that a user retweets a particular piece of information she has received can be modelled as being approximately inversely proportional to the number j of her friends.In our notation, the vulnerability v jk of a (j, k)-class user in the limited attention model (LAM) is inversely proportional to j: where B is a parameter of the model, and we assume no nodes have j = 0.In the LAM, the probability ρ of a random follower retweeting is given in terms of B by Eq. ( 4): Interestingly, under the assumption that the network has no nodes with j = 0 then the LAM offspring distributions for the seed generation and for later generation are identical, even if the inand out-degrees of nodes are correlated (unlike the ICM model):

Comparing ICM and LAM with empirical offspring distributions
Using the empirical network structure for the Marref and URL datasets, specifically the in-degree j i and out-degree k i of each node i in the network, we construct the offspring distribution predicted by the independent cascade model and by the limited attention model, using Eqs.( 12) and ( 13), respectively, in Eqs. ( 4), ( 8) and (10).In each case, we fit the parameters C and B by matching the branching number to the average value ξ calculated in Sec.2.2.The sums over j and k are replaced by sums over the N nodes: Equation (4), for example, becomes where j is the sample mean of the in-degrees: j = 1 N N i=1 j i .(In effect, we replace p jk by 1/N and replace sums over j and k by a sum over all nodes.) The black (for ICM) and magenta (for LAM) curves in Fig. 3 show how these predictions compare with the empirical offspring distribution.Evidently, the LAM predictions are closer to the empirical offspring distributions than the ICM predictions, at least for the relatively low values of ℓ in Figs.3(a) and (b).To examine the empirical offspring distributions at higher values of ℓ we reduce the low-number fluctuations by averaging the distributions over the generations marked with the green line in Fig. 2(a) and (b), i.e., those generations for which the effective branching number is approximately constant.This averaged offspring distribution is shown by the blue symbols in Figs.3(c) and (d): note we plot ℓ + 1 on the horizontal axis in order to make the ℓ = 0 case visible on the logarithmic scale.
Noting the near-linear decay of the offspring distribution on the log-log plot, we fit the empirical averaged offspring distribution with a truncated power law: This distribution is chosen for its good fit and analytical convenience 4 ; calculations with this distribution can be more easily reproduced than by using the full ICM or LAM distributions, which require knowledge of the full set of node degrees (j i , k i ).To fit the parameters β and θ in Eq. ( 17), we match the first and second moments of the distribution with the corresponding moments of the averaged empirical distribution.The fitted parameters are given in Table 1, and the red curves in Figs.3(c) and 3(d) show that the fitted offspring distribution is reasonably close to the empirical distribution.A similar procedue is used to fit a seed generation offspring distribution q ℓ , using the form of Eq. ( 17) with parameters β and θ replaced by β 0 and θ 0 , and with the domain restricted to ℓ > 1 because every seed node in the empirical dataset has at least one child 5 .
To summarize this Section: we have derived a general formulation for the offspring distribution that results from cascades on a network with a given distribution p jk of in-and out-degrees.We used the vulnerability v jk to describe different models of information transmission, focussing on comparing the ICM with the LAM.In Fig. 3 we see that there are observable differences between the offspring distributions predicted by the two models, with the LAM case generally closer to the empirical observations.Finally, we fitted a standard distribution (Eq.( 17)) to the empirical distribution to make our results in the next section more tractable and readily reproducible.Note, however, that in principle the data on the structure of the network (e.g., the p jk distribution) and , where Li β is the polylogarithm function of order β. 5 The pgf for the seed generation is x − e  17), fitted to the first and second moment of the averaged empirical distributions.
the assumed vulnerability v jk suffice to determine the offspring distribution, and this opens the possibility of examining further hypotheses on the dependence of information spreading on the nodes' in-and out-degrees [33].
It is also worth noting that the network structure, through the correlations in the p jk distribution, strongly affects the offspring distribution (and hence, as we show in the next Section, the predictions of the cascade structure); this point has recently been recognised by Ma et al. [31].We point out that the p jk distribution of the network should therefore be included, when possible, in analysis of information spreading.This is not current practice: in Refs.[12,13], for example, large-scale simulations are performed on synthetic networks with specified out-degree distributions but without considering the correlation structure between the in-and out-degrees of nodes.

Application of branching process model
In this Section we focus on analytical predictions of branching process theory that can be compared to statistical features of the two datasets.We begin with a discrete-time branching process, whereas in Sec.3-the number of offspring of the seed particle is distributed according to pgf f (x) = ∞ ℓ=0 q ℓ x ℓ while all later generations of the tree have offspring numbers generated by f (x) = ∞ ℓ=0 q ℓ x ℓ .We consider the seed of the tree to be generation 0, and we are interested in various properties of the trees as observed a number of generations later.In Sections 4.1 and 4.2 we use a slightly unusual approach to derive known results on the distribution of cascade durations and sizes.We then extend this methodology to the calculation of other metrics in Secs.4.3 and 4.4.

Number of particles in generation n; distribution of cascade lifetimes
As a first example, we define the random (non-negative integer) variable Z n to represent the number of particles in generation n of the tree (the small nodes in Fig. 4).As schematically represented in Fig. 4, these particles are the descendants of the generation-0 seed node, observed n generations after the seed.They can also be considered as the sum of the particles contained in all the subtrees that are seeded at generation 1 and which are observed n − 1 generations later.Conditioning on the number k of particles in generation 1, we define Z (i) n−1 to be the number of particles in the subtree that is seeded by the ith particle in generation 1, as observed n − 1 generations after the subtree is born (i.e., at generation n of the parent tree).Since all the subtrees are i.i.d., each of the k random variables Z   4: Schematic of a tree generated by a seed particle; note the number of children of the seed particle is generated by f .We define the pgf F n for the random variable Z n as where E denotes expectation over the ensemble of trees and s is a dummy variable.If we condition on the number k of particles in generation 1, we can write Z n as the sum of the k subtree variables Z (i) n−1 (the superscript i denotes the ith i.i.d.copy): and so where we have used the independence of the subtrees and the i.i.d.nature of the Z (i) n−1 variables.Writing F n−1 (s) for the pgf E s Z n−1 and summing over all possible values of k (recall that q k is the probability that there are k children of the seed particle, i.e., k particles in generation 1) yields Figure 5: Schematic of a subtree generated from a particle that is not a seed; note the number of children of the particle is generated by f .This equation relates the pgf for Z n to the pgf for the subtree quantities Z n−1 .The next step is to derive an equation that recursively links Z n (the number of particles in a subtree n generations after its birth) to Z n−1 .Figure 5 is a schematic view of this relationship.The main subtree in Fig. 5 is born with the first particle shown (left of the Figure ) and we condition on the number k of children particles of this first particle; recall that k is a random variable with pgf f (x).The number Z m of particles in the main subtree after m generations is equal to the sum of the k i.i.d.variables Z and so as in Eq. (21).Summing over the possible values of k then yields Equation ( 24) gives a recursion relation for the pgf F m (s), starting from the initial condition F 0 (s) = s, corresponding to the tree being seeded from a single particle.Using the result of the recursion Eq. ( 24) in Eq. ( 21) then gives the pgf for the number of nodes in generation n of the tree.This characterization of the branching process is called the backward approach in [34], in analogy with the backward Chapman-Kolmogorov equation of Markov processes.An alternative forward approach-wherein the states of particles in generation n + 1 is predicted from the state of the process after n generations-is often used to derive Eq. ( 24), but we will find the backward approach easily generalizable to other quantities of interest.25), using the offspring distribution of Eq. ( 17).
The probability that the tree is terminated at or before generation n is equal to the probability of the tree having zero nodes in generation n, which is F n (0).The probability that the tree terminates precisely at generation n (i.e., that there are a nonzero number of particles in generation n but each of these has zero offspring) is therefore where F n (0) is calculated by iteration from Eq. ( 21) and the initial condition F 0 (0) = 0. We call Ω n the lifetime distribution of trees, as it gives the probability that the observed lifetime of a tree is n generations.See Figure 6 for a comparison of the empirical lifetime distribution with the predictions of Eq. ( 25), using the offspring distribution fitted in Eq. ( 17).

Cascade size
A similar approach can be applied to calculate the distribution of tree (cascade) sizes, i.e., the total number of particles that are in all generations of the tree, from the seed at generation 0 up to the last generation of the tree (this quantity is sometimes called the total progeny of the tree).We define the random variable X n to be the size of the tree observed n generations after its seed particle is born.As before, X n can be decomposed into the sum of contributions from each of the subtrees born in generation 1. Conditioning on the seed node having k children particles, we write where n−1 represents the ith i.i.d.subtree size as observed after n − 1 generations, and the first term counts the seed node of the tree (see Fig. 4).Using identical arguments to those leading to Equations ( 20) and ( 21), we obtain and the pgf G n (x) = E x Xn = ∞ j=0 Prob X n = j x j is then given by where G n−1 (x) = E x X n−1 is the pgf for the size of a subtree after n − 1 generations.Referring now to Fig. 5, the recursion relation for the subtree sizes is derived by first assuming k children in the first generation of the subtree: and then proceeding as in Equations ( 24) and ( 28) to obtain the recursion relation with initial condition G 0 (x) = x.By iterating Eq. ( 30) for m = 1, 2, . . ., n − 1 and then substituting G n−1 (x) into Eq.( 28), we obtain the desired pgf G n (x) describing the distribution of cascade sizes after n generations.In order to invert the pgf to obtain the distribution of cascade sizes, we iterate Eqs. ( 30) and ( 28) for a set of x values that are uniformly spaced around the unit circle in the complex x-plane, and use a fast Fourier transform to approximate the Cauchy integral as in section S2 of [11].
Figure 7 shows the large-n limit of the cascade size distribution, and compares it with the empirical distribution.The good agreement between this theoretical prediction and the empirical results gives further support to the usage of branching process descriptions for such data.

Average tree depth and structural virality
In this subsection, we build on the approach used in Sec.4.1 to derive results for measures of the shape of cascade trees, which are of considerable interest in analyses of Twitter [13,18].We focus on the distribution (and expected value) of two quantities [13]: the average depth of a tree, and the structural virality of a tree.

Average tree depth
To calculate the average depth of a sample tree, we first sum the depths (generation numbers) of all particles in the tree to obtain the cumulative depth of the tree, and then divide this by the size of the tree (the total number of particles in the tree), see Fig. 8.In this subsection we generalize the methods used in Sec.4.2 to calculate the joint distribution of tree size and cumulative depth, and hence to find a formula for the expected average tree depth (EATD).In the ensemble of trees   8: This is a tree of size 5.Each of the 5 particles is labelled by its depth (its generation plus 1).The cumulative depth of the tree is 0+1+1+2+2=6, and so the average depth of this tree is 6/5.Note that the cumulative depth of the top subtree is 0+1+1=2, also that each node in the subtree has a depth that is one larger than its value when considered as part of the main tree: See Eq. (33).generated by the branching process, each tree has its own average depth, and the EATD is the mean of the average depths over all trees in the ensemble.We believe the formula we derive (Eq.( 44)) is novel.
We extend the approach of Sec.4.2 to consider the joint distribution of X n (the tree size after n generations) and of Y n , which is the random variable giving the cumulative depth of the tree after n generations.We define the two-variable pgf H n (x, y) as As in earlier sections, we relate the variables X n and Y n to subtree quantities, and begin by assuming that the seed node (in Fig. 4 for example) has k children.Each of the k children generates a subtree with (after n − 1 generations of the subtree) respective random-variable pairs X n−1 for i = 1, 2, . . ., k.
The relationship between X n and X n−1 is given by Eq. ( 26) but we must now also find an analogous expression for Y n .We define Y (i) n−1 to be the cumulative depth of the ith i.i.d.subtree.Notice (see Fig. 8) that when we add the Y (i) n−1 values for all the subtrees, each node of the subtree has a depth that is one less that its depth in the main tree.Therefore, the ith subtree contributes to Y n a total of Y n−1 , where the second term adds one for each node in the subtree.From this relationship, we obtain and with Eq. ( 26) we find the pgf relations where Addressing the recursion relation for the subtrees in a similar fashion leads (as in Sec.4.2) to with initial condition H 0 (x, y) = x (since a single particle is a tree of size 1, with zero depth).Iterating Eq. ( 35) for m = 1, 2, . . ., n − 1 and substituting into Eq.( 34) yields the pdf H n (x, y) for the joint distribution of trees size and cumulative depth after n generations.
We can use this joint distribution to calculate the EATD for trees of n generations as as can be verified by term-by-term differentiation and integration of the series in Eq. (32).Taking the n → ∞ limit in order to include all trees, Eq. ( 35) give a self-consistent equation for H ∞ (x, y) and it can be differentiated with respect to x to yield where, for simplicity, we drop the subscript from H ∞ for the remainder of this section.Similarly, differentiation of Eq. ( 35) with respect to y gives which can be solved for ∂H ∂y y=1 , after substituting for ∂H ∂x y=1 from Eq. ( 37): Differentiating the n → ∞ limit of Eq. ( 34) with respect to y yields and substituting from Eqs. ( 37) and (39) gives Thus, the expected average tree depth over all trees is given by Eq. ( 36) as Noting that Eq. ( 35) relates H(x, 1) to x through the implicit relation we make the change of integration variable x → h defined implicitly by x = h/f (h) (with dx = (f (h) − hf ′ (h)) /f 2 (h)dh), to yield a simple integral formula for the EATD: This remarkably simple formula is easily evaluated once the offspring distributions f and f of the branching process are given.In Table 2 we show that it agrees with Monte Carlo simulations and also gives quite a reasonably accurate estimate of the values found from the empirical data.Figure 9: Ccdfs of average tree depth (top panels) and structural virality (bottom panels) for Marref (left) and URL (right).Blue symbols are empirical distributions; red symbols are from Monte Carlo simulations of branching processes with offspring distribution q ℓ given by Eq. ( 17), and by the corresponding distribution q ℓ for the seed's offspring.51), compared with Monte Carlo simulations (10 5 realizations) and the data values.Bootstrap intervals given for the latter two cases show quantile 0.025 to quantile 0.975 (i.e., 95% of cases) for the expected structural virality, using 10 3 bootstrap samples.

Structural virality
The structural virality of a tree with size n > 1 was introduced in Goel et al. [13] as where d ij is the graph distance from node i to node j.The distribution of this metric across an ensemble of trees was used to fit models to data in [13].
As noted in [13], the structural virality of a tree is closely related to its Wiener index, defined by n i=1 n j=1 d ij .If we consider the expected value of the Wiener index across the ensemble of trees generated by the branching process, we can usefully adapt the approach of Entringer et al. [35], with the aim of calculating the expected structural virality for the exnsemble.Entringer et al. define a generating function W (x) so that the coefficient of x n in the power series is the contribution of trees of size n to the ensemble-averaged Wiener index (note that W (x) is not a probability generating function).Their Eq. (3.5) is where D(x) is our ∂H ∂y y=1 from Eq. (39) and G(x) = H(x, 1) is the m → ∞ cascade size pgf from Eq. (30).The first term in Eq. ( 46) comes from considering pairs of vertices u and v where one of u or v is the root of the tree.The second term arises from the case where u and v belong to the same subtree, and the third term stems from u and v belonging to different subtrees, see Sec. 3 of [35] for details.We extend the approach of Eq. ( 46) to the case where the seed node of the tree has offspring distribution with pgf f , to get an analogous equation for W (x): where as given by Eq. (41).
Solving Eq. (46) for W (x) and substituting into Eq.(47) enables us to determine W (x). The expected structural virality for the ensemble of trees is then given by where W n is the coefficient of x n in the power series of W (x). The value of s can be calculated from the generating function W (x) by a double integration: where the second equation follows from changing the order of integration, i.e., using the identity Combining these results and then making the same change of variable as for Eq. ( 44) yields an integral formula for the expected structural virality: Table 3 shows that this formula agrees with Monte Carlo simulations of the branching process, and also matches reasonably well to the average structural virality of the ensemble of empirical trees in both datasets.

Asymptotic analysis
The integral formulas derived for the expected average tree depth (Eq.( 44)) and the expected structural virality (Eq.( 51)) enable us to analytically study the impact of the spreading process upon these measures.Such understanding can assist in the fitting of information-spreading models to empirical data.In Figure 2 of Ref. [13], for example, large-scale numerical simulations are used to calculate the dependence of the expected structural virality on the branching number, and this information is then used to guide model parameter fitting.
We are therefore motivated to examine how the integrals in Eqs. ( 44) and (51) depend upon the form of the offspring distribution (through its pgf f (x)) and in particular on the branching number ξ = f ′ (1).For simplicity we will restrict ourselves in this section to the case where f (x) = f (x), i.e., assuming that the seed node's offspring distribution is the same as that of the later generations.
First we note that both integrals may be performed exactly in the special case of a binary fission process [1], where each parent has either zero or two children: The exact integrals for EATD and expected structural virality in this case are and each shows a logarithmic divergence as the branching number ξ approaches the critical value of 1 from below (see dashed curves in Fig. 10).
In fact, this logarithmic divergence as ξ → 1 is not unique to the exactly-solvable binary fission example.Indeed, asymptotic analysis of the integrals shows that a similar divergence occurs for any offspring distribution that has a finite value of f ′′ (1), meaning that the second moment of the offspring distribution is finite.The integrands in Eqs. ( 44) and ( 51) are singular at h = 1, and the form of the singularity can be understood using the expansion of f (h) about h = 1: The integrand of Eq. (44), for example, has leading-order expansion and so the integral diverges logarithmically as ξ → 1; the same asymptotic behaviour is found for the integrand in Eq. (51).Hence the behaviour of the dashed curves in Fig. 10 is quite generic for offspring distributions with finite second moments.
Offspring distributions with infinite second moments are also of interest, as they relate to heavytailed follower distributions in the Twitter network [12,36].An important example is the case of a power-law tail, i.e., q ℓ ∼ D ℓ −γ as ℓ → ∞, for constant D and for values of the exponent between 2 and 3.The asymptotic series for f (h) as h → 1 − is given in this case by [11,37] where Γ(•) is the Gamma function.Using this asymptotic series, the integrands in both Eqs. ( 44) and (51) have the leading order behaviour ∼ (1 − h) 2−γ as h → 1 − at the critical value of ξ = 1.and K 0 (x, z) = xz.
We observe that if we modify the second argument of K as follows then we can write, analogous to Eq. ( 36), where J n (x, z) = K n x, z x .The iteration equation for J n (x, z) is obtained from Eq. (63) as where and J 0 (x, z) = z.
To evaluate the integral in Eq. ( 68), it is convenient to define the single-argument function Then we obtain from Eq. (69) that L n can be expressed as where L n (x), defined as obeys the iteration equation with L 0 (x) = 1/x.Iterating Eqs. ( 72), ( 74) and (70) for values of x that partition the interval [0, 1] enables us to calculate the integral using the trapezoidal rule.Thus, we have shown how a subcritical branching process model can give rise to an apparent novelty decay factor, even though the offspring distribution does not change from generation to generation.The "apparent" nature of the decay in the novelty factor does not reflect any change in the likelihood of retweeting by a user who receives the tweet; rather it is the mechanism needed in the multiplicative process model of Eq. (58) to deal with the finite lifetimes of cascades.At each generation of the branching process fewer trees remain alive, and so the growth rate of the total number of tweets must decline with n, and in the multiplicative process this is mediated by the decay of the novelty factor r n .61); red lines are the predictions of the theoretical result (75), using the offspring distribution of Eq. ( 17).

Discussion
In Section 2 we demonstrated that two datasets from Twitter can be approximately described by branching processes, at least when we examine the discrete generation-by-generation structure.An examination of the details of a continuous-time branching process that could produce these structures is left for further work.In Section 3 we argued that the observed offspring distributions were better fitted by a model based on the assumption that Twitter users have limited attention-so those who follow many others are less likely to notice and retweet any single message they receivethan by the more usual independent cascade model, with its assumption of equal transmission probability for each infection attempt.
Taking the fitted offspring distributions as inputs, in Section 4 we derived analytical and semianalytical results using branching process theory.We began with well-established results on the distribution of cascade lifetimes and of cascade sizes, and then extended the arguments used to derive novel results for other measures.We derived integral formulas for the expected average tree depth (equation (44)) and for the expected structural virality (equation (51)) and showed that these provide a good match to the data.The integral formulas are also amenable to asymptotic analysis to understand the behaviour of the metrics as the branching number approaches the critical value.These results should assist in the fitting of transmission models to large-scale datasets, as was done (albeit using billions of numerical simulations rather than analytical methods) in Goel et al. [13].Finally, we derived a formula that enables the calculation of the apparent novelty factor, as would be used in a multiplicative stochastic model for the cascades under study.In the branching process model, information does not decrease in its transmission likelihood over generations, but the fact that the processes are subcritical means that the number of users who receive a cascading tweet decreases over time (Figure 3).In a multiplicative model, the stochastic lifetimes of cascade trees must be imposed through the assumption of novelty decay, and our results in Sec.4.4 show how the two modelling approaches can be directly compared.We believe that the insights of the branching process approach will help inform applications of the multiplicative model, while the formula linking the offspring distribution to the apparent novelty decay (equation (75)) will allow the application of branching process theory to datasets that previously were studied only via the multiplicative model.
Our study has, of course, several limitations.The nature of cascades on Twitter is that they are rather short-lived, so our observation of a stable offspring distribution might not generalize to cascades on other social media where the attention given to topics is longer-lived, and hence where novelty decay might be more likely.We have implicitly assumed that all cascade topics are equally attractive to the Twitter users and so the identification of cascade-specific "fitnesses" [38] has not been addressed here.As noted above, a study based on continuous-time branching processes could potentially extend our results to include age-dependent effects [7], but we expect that the results presented here would remain valid in the long-time limit where all cascades have reached their final state.In conclusion, we hope that the results and the methodology presented here will prove useful to researchers investigating those aspects of human behaviour that are mediated by online social networks.

Figure 1 :
Figure 1: Schematic of the ensemble of trees, indicating the Z m,n values for the first two trees in the ensemble.

Figure 2 :
Figure 2: Number of nodes (top panels) and effective branching number (bottom panels) in the data, as defined by Eqs.(1) and (2).Here, and in most subsequent figures, the left panels ((a) and (c)) show results for the Marref dataset while the right panels ((b) and (d)) are the results from the URL dataset.

Figure 3 :
Figure 3: Empirical offspring distributions.(a) Offspring distributions for generations 0 (black symbols) through to 4 (coloured symbols) of Marref dataset.The magenta curve is the LAM prediction; the black curve is the ICM prediction.(b) Offspring distributions for generations 10 through to 40 in steps of 5 (with generations 0 to 5 in inset) from URL dataset, with LAM and ICM theory curves in magenta and black, respectively.(c) CCDF of offspring distribution for Marref; blue symbols show the averaged empirical distribution (averaged over generations 1 through 4), curves are LAM (magenta) and ICM (black) predictions, with the fitted distribution of Eq. (17) in red.(d) As panel (c), but for the URL datset, with the averaged empirical distribution averaged over generations 10 through 40.
(i)n−1 has the same distribution.

Figure
Figure4: Schematic of a tree generated by a seed particle; note the number of children of the seed particle is generated by f .

Figure 6 :
Figure 6: Lifetime distribution of cascades in Marref (left) and URL (right) datasets.Blue symbols are empirical values; red line shows the theoretical distribution from Eq. (25), using the offspring distribution of Eq. (17).

Figure 7 :
Figure 7: Cascade size distributions: pdfs (top panels) and ccdfs (bottom panels) for Marref (left) and URL (right).Blue symbols are empirical values; red line shows the theoretical distribution from Sec. 4.2, using the offspring distribution of Eq. (17).

Figure
Figure8: This is a tree of size 5.Each of the 5 particles is labelled by its depth (its generation plus 1).The cumulative depth of the tree is 0+1+1+2+2=6, and so the average depth of this tree is 6/5.Note that the cumulative depth of the top subtree is 0+1+1=2, also that each node in the subtree has a depth that is one larger than its value when considered as part of the main tree: See Eq.(33).

Figure 10 :
Figure 10: Results of the integral formulas in Eqs.(44) and (51) for Expected Average Tree Depth (left) and Structural Virality (right) for trees with binary fission offspring distribution (dashed curves) and with power-law (tail exponent γ = 2.5) offspring distribution (solid curves).

Figure 11 :
Figure 11: Novelty function r n in Marref (left) and URL (right) datasets.Blue symbols are empirical values using Eq.(61); red lines are the predictions of the theoretical result (75), using the offspring distribution of Eq. (17).

Table 2 :
Expected average tree depth (EATD) from the integral formula of Eq. (44), compared with Monte Carlo simulations (10 5 realizations) and the data values.Bootstrap intervals given for the latter two cases show quantile 0.025 to quantile 0.975 (i.e., 95% of cases) for the expected average tree depth, using 10 3 bootstrap samples.

Table 3 :
Expected structural virality from the integral formula of Eq. (