-
PDF
- Split View
-
Views
-
Cite
Cite
Fan Chen and others, Targeted Sampling from Massive Block Model Graphs with Personalized PageRank, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 82, Issue 1, February 2020, Pages 99–126, https://doi.org/10.1111/rssb.12349
Close - Share Icon Share
Summary
The paper provides statistical theory and intuition for personalized PageRank (called ‘PPR’): a popular technique that samples a small community from a massive network. We study a setting where the entire network is expensive to obtain thoroughly or to maintain, but we can start from a seed node of interest and ‘crawl’ the network to find other nodes through their connections. By crawling the graph in a designed way, the PPR vector can be approximated without querying the entire massive graph, making it an alternative to snowball sampling. Using the degree-corrected stochastic block model, we study whether the PPR vector can select nodes that belong to the same block as the seed node. We provide a simple and interpretable form for the PPR vector, highlighting its biases towards high degree nodes outside the target block. We examine a simple adjustment based on node degrees and establish consistency results for PPR clustering that allows for directed graphs. These results are enabled by recent technical advances showing the elementwise convergence of eigenvectors. We illustrate the method with the massive Twitter friendship graph, which we crawl by using the Twitter application programming interface. We find that the adjusted and unadjusted PPR techniques are complementary approaches, where the adjustment makes the results particularly localized around the seed node, and that the bias adjustment greatly benefits from degree regularization.
1 Introduction
Much of the literature on graph sampling has treated the entire graph, or all of the people in it, as the target population. However, in many settings, the target population is a small community in the massive graph. For example, a key difficulty in studying social media is to gather data that are sufficiently relevant for the scientific objective. A motivating example for this paper is to sample the Twitter friendship graph for accounts that report and discuss current political events. (See our website http://murmuration.wisc.edu, which does this.) This corresponds to sampling and identifying multiple communities, each a potentially small part of the massive network. In such an application, the graph is useful for two primary reasons. First, via link tracing, we can find potential members of the target population. Second, the graph connections are informative for identifying community membership. Throughout, we presume that the sampling is initiated around a ‘seed node’ that belongs to the target community of interest.
A personalized PageRank (called ‘PPR’) can be thought of as an alternative to snowball sampling, which is a popular technique for gathering individuals close to the seed node. For some d ⩾ 0, snowball sampling gathers all individuals who are d friends away from the seed. This process has two competing flaws for our application which are addressed by PPR. First, snowball sampling fails to account for the density of common friendships. For example, perhaps i and j are both one friend removed from the seed, but i has 10 friends in common with the seed, whereas j has only one friend in common. It seems natural to suppose that i is closer than j to the seed. Hence, the metric for snowball sampling can be misleading. Second, the snowball sample size grows very quickly with d. For example, under the ‘six degrees of separation’ phenomenon (Watts and Strogatz, 1998; Newman et al., 2006), snowballing gathers the entire graph if d ⩾ 6.
PPR gives a sample that is more localized around the seed node. The PPR vector is defined as the stationary distribution of what we call a personalized random walk (Page et al., 1999). At each step of the personalized random walk, the random walker returns to the seed node with probability α, called the teleportation constant, and, with probability 1 − α, the random walker goes to an adjacent node that is chosen uniformly at random. Consider the stationary distribution of this process as giving the inclusion probability for a sample of size 1. This is the PPR vector. PPR naturally leads to a clustering algorithm, where the cluster is made up of the nodes with a large inclusion probability. To approximate the PPR vector quickly, Berkhin (2006) proposed an algorithm that examines only nodes with large inclusion probabilities (i.e. nodes near the seed). As such, PPR is particularly useful for its computational efficiency—the running time and the amount of data that it requires are nearly linear in the size of the output cluster, which is typically much smaller than that of the entire graph. Because of the local nature of the algorithm, it can be used to study large graphs such as Twitter where the entire graph is not available, but where one can query to find the connections to any small set of nodes.
One way to conduct local clustering is by exploring and ranking the nearby nodes of a seed node (Andersen and Lang, 2006; Andersen and Peres, 2009; Alamgir and Von Luxburg, 2010; Gharan and Trevisan, 2012). Spielman and Teng (2004) pioneered local clustering by defining nearness as the landing probability of a random walk starting from the seed node. Their algorithm’s guarantee was improved in follow-up work by Andersen et al. (2006) which proposed the use of an approximate PPR vector. Local algorithms can be applied recursively to solve more complicated problems such as graph partitions (k-way partitions) (Spielman and Teng, 1996; Karypis and Kumar, 1998) and have many fruitful applications (Jeh and Widom, 2003; Macropol et al., 2009; Liao et al., 2009; Gupta et al., 2013; Gleich, 2015), particularly when it comes to sampling and studying massive graphs.
Along with the widespread use of PPR, there has been recent work to study its statistical estimation properties under a statistical model with latent community structure. Beyond the scope of local clustering, Kloumann et al. (2017) showed that the PPR vector is asymptotically equivalent to optimal linear discriminant analysis under the stochastic block model (SBM) (Holland et al., 1983), assuming a symmetry condition on the block structure. We add to this statistical understanding of PPR by providing a simple and more general representation for PPR vectors that allows for different block sizes, more than two blocks, degree heterogeneity and directed edges. To understand the effects of heterogeneous node degrees, this paper uses the degree-corrected stochastic block model (DCSBM) (Karrer and Newman, 2011) and examines when the PPR clustering recovers nodes within the same block as the seed node (local cluster). Breaking the symmetry that was imposed by Kloumann et al. (2017) reveals additional insight. In particular, given a seed node in the first block, we show that PPR is likely to contain high degree nodes outside that block. We study an adjustment that was previously proposed in Andersen et al. (2006). We show how this adjustment can correct for the bias. We illustrate these ideas with examples from the Twitter friendship graph.
1.1 An illustrative example in social media
Local clustering using PPR is particularly well suited to studying current political events on Twitter because
- (a)
the accounts that discuss politics or current events are a small part of the entire Twitter graph,
- (b)
it is reasonable to believe that the accounts in our target population are well connected to one another in the Twitter friendship graph and,
- (c)
although the entire Twitter graph is not publicly available, the way that PPR (algorithms 1 and 3 in Sections 2.1 and 3.2 respectively) queries the graph matches the Twitter application programming interface protocol which is the primary mode of access for researchers.
Although we do not suppose that the Twitter friendship graph is sampled from a DCSBM, Twitter does have all of the heterogeneities that our results identify as important. The Twitter friendship graph is composed of users who can freely follow others but will not necessarily be followed back, or friended. Such asymmetry between following and friending forms a directed graph where follower count indicates status—some popular or high status nodes command millions of followers whereas the majority of nodes are followed by far fewer.
The theoretical results in this paper suggest that such degree heterogeneities will make the PPR vector biased for detecting block memberships (theorem 1 in Section 3.1). We propose a way to adjust for this bias (algorithm 2 in Section 2.1) and show that it is a consistent estimator (corollary 1 in Section 4). Not surprisingly, this section demonstrates that PPRs with and without the bias adjustment give fundamentally different results on the Twitter graph. However, depending on the application, the biases in the PPR vector might be advantageous. In this way, PPRs with and without the bias adjustment are complementary, not competing, approaches.
To illustrate, Table 1 displays the top 15 handles ranked by the PPR vector (without adjustment) for three different seed nodes: @CNN, @BreitbartNews and @dailykos, which are the Twitter accounts of three types of media outlets that exhibit distinct political leanings (legacy broadcast news, on-line right wing and on-line left wing). For @CNN, all top 15 handles ranked by the PPR vector are its subsidiary accounts and its celebrity reporters and anchors (like Wolf Blitzer and Anderson Cooper), except for one account, Pope Francis, who enjoys an extremely larger following. The top 15 handles for @BreitbartNews are a mixed bag of influential conservatives (like Sean Hannity and Ann Coulter) and Breitbart’s editors or writers. However, the top 15 handles returned to @dailykos by the PPR vector are all famous liberal personalities who are not directly affiliated with Daily Kos, except one: its founder Markos Moulitsas. Those people range from democratic politicians to liberal media personalities and journalists, such as Hillary Clinton, Stephen Colbert and Rachel Maddow. All the handles align with the characteristics of their respective media outlets, attesting to the clustering effectiveness. However, it is worth noting that the top handles ranked by the PPR vector tend to be popular handles with millions of followers. This shows that the PPR vector’s preference is for high in-degree nodes.
Top 15 handles by PPR clustering†
| Rank . | @CNN . | @BreitbartNews . | @dailykos . |
|---|---|---|---|
| 1 | CNN Breaking News | Alex Marlow | Hillary Clinton |
| 2 | CNN International | AndrewBreitbart | Stephen Colbert |
| 3 | Wolf Blitzer | Big Hollywood | Rachel Maddow MSNBC |
| 4 | Anderson Cooper | Big Government | Jake Tapper |
| 5 | Christiane Amanpour | James O’Keefe | Joy Reid |
| 6 | Pope Francis | Sean Hannity | Chris Hayes |
| 7 | Dr Sanjay Gupta | Raheem | Emma Gonzlez |
| 8 | CNNMoney | Joel B. Pollak | Markos Moulitsas |
| 9 | Jake Tapper | Ann Coulter | Maggie Haberman |
| 10 | Brian Stelter | Allum Bokhari | Sarah Silverman |
| 11 | CNN Newsroom | Ben Kew | Lin-Manuel Miranda |
| 12 | Dana Bash | Brandon Darby | Elizabeth Warren |
| 13 | CNN Politics | Noah Dulis | Jon Favreau |
| 14 | BBC Breaking News | Michelle Malkin | Michelle Obama |
| 15 | Brooke Baldwin | Nate Church | Bill Clinton |
| Rank . | @CNN . | @BreitbartNews . | @dailykos . |
|---|---|---|---|
| 1 | CNN Breaking News | Alex Marlow | Hillary Clinton |
| 2 | CNN International | AndrewBreitbart | Stephen Colbert |
| 3 | Wolf Blitzer | Big Hollywood | Rachel Maddow MSNBC |
| 4 | Anderson Cooper | Big Government | Jake Tapper |
| 5 | Christiane Amanpour | James O’Keefe | Joy Reid |
| 6 | Pope Francis | Sean Hannity | Chris Hayes |
| 7 | Dr Sanjay Gupta | Raheem | Emma Gonzlez |
| 8 | CNNMoney | Joel B. Pollak | Markos Moulitsas |
| 9 | Jake Tapper | Ann Coulter | Maggie Haberman |
| 10 | Brian Stelter | Allum Bokhari | Sarah Silverman |
| 11 | CNN Newsroom | Ben Kew | Lin-Manuel Miranda |
| 12 | Dana Bash | Brandon Darby | Elizabeth Warren |
| 13 | CNN Politics | Noah Dulis | Jon Favreau |
| 14 | BBC Breaking News | Michelle Malkin | Michelle Obama |
| 15 | Brooke Baldwin | Nate Church | Bill Clinton |
Column names represent seed nodes, and the sampled nodes are ranked by PPR values, with teleportation constant α = 0.15 uniformly. Through the PPR vector, the top 15 handles returned to each of the three seed nodes fit well with the characteristics of the seed nodes. They are popular or high status handles either directly related to the seed nodes or align with their political leanings. This shows the effectiveness of clustering via the PPR vector. It also shows the PPR vector’s preference for highly connected nodes.
Top 15 handles by PPR clustering†
| Rank . | @CNN . | @BreitbartNews . | @dailykos . |
|---|---|---|---|
| 1 | CNN Breaking News | Alex Marlow | Hillary Clinton |
| 2 | CNN International | AndrewBreitbart | Stephen Colbert |
| 3 | Wolf Blitzer | Big Hollywood | Rachel Maddow MSNBC |
| 4 | Anderson Cooper | Big Government | Jake Tapper |
| 5 | Christiane Amanpour | James O’Keefe | Joy Reid |
| 6 | Pope Francis | Sean Hannity | Chris Hayes |
| 7 | Dr Sanjay Gupta | Raheem | Emma Gonzlez |
| 8 | CNNMoney | Joel B. Pollak | Markos Moulitsas |
| 9 | Jake Tapper | Ann Coulter | Maggie Haberman |
| 10 | Brian Stelter | Allum Bokhari | Sarah Silverman |
| 11 | CNN Newsroom | Ben Kew | Lin-Manuel Miranda |
| 12 | Dana Bash | Brandon Darby | Elizabeth Warren |
| 13 | CNN Politics | Noah Dulis | Jon Favreau |
| 14 | BBC Breaking News | Michelle Malkin | Michelle Obama |
| 15 | Brooke Baldwin | Nate Church | Bill Clinton |
| Rank . | @CNN . | @BreitbartNews . | @dailykos . |
|---|---|---|---|
| 1 | CNN Breaking News | Alex Marlow | Hillary Clinton |
| 2 | CNN International | AndrewBreitbart | Stephen Colbert |
| 3 | Wolf Blitzer | Big Hollywood | Rachel Maddow MSNBC |
| 4 | Anderson Cooper | Big Government | Jake Tapper |
| 5 | Christiane Amanpour | James O’Keefe | Joy Reid |
| 6 | Pope Francis | Sean Hannity | Chris Hayes |
| 7 | Dr Sanjay Gupta | Raheem | Emma Gonzlez |
| 8 | CNNMoney | Joel B. Pollak | Markos Moulitsas |
| 9 | Jake Tapper | Ann Coulter | Maggie Haberman |
| 10 | Brian Stelter | Allum Bokhari | Sarah Silverman |
| 11 | CNN Newsroom | Ben Kew | Lin-Manuel Miranda |
| 12 | Dana Bash | Brandon Darby | Elizabeth Warren |
| 13 | CNN Politics | Noah Dulis | Jon Favreau |
| 14 | BBC Breaking News | Michelle Malkin | Michelle Obama |
| 15 | Brooke Baldwin | Nate Church | Bill Clinton |
Column names represent seed nodes, and the sampled nodes are ranked by PPR values, with teleportation constant α = 0.15 uniformly. Through the PPR vector, the top 15 handles returned to each of the three seed nodes fit well with the characteristics of the seed nodes. They are popular or high status handles either directly related to the seed nodes or align with their political leanings. This shows the effectiveness of clustering via the PPR vector. It also shows the PPR vector’s preference for highly connected nodes.
In contrast, for each of the three seeds, adjusted PPR finds accounts that are more central to the internal functioning of these organizations. Table 2 lists those accounts. The bias adjustment also greatly benefits from a degree regularization (Qin and Rohe, 2013). For @CNN, those handles include primarily its own staff, producers or journalists (like Elissa Weldon, Chris_Dawson and Grace Bohnhoff) and a freelance journalist (Tess Eastment). The pattern is similar for @BreitbartNews and @dailykos, their top 15 handles including their own journalists and editors as well as related writers, campaigners and activists. The general pattern is that the adjustment returns editors, journalists and staff working within each media outlet. As such, the adjustment is useful for identifying a more localized cluster.
Top 15 handles by adjusted PPR (with regularization) sampling†
| Rank . | @CNN . | @BreitbartNews . | @dailykos . |
|---|---|---|---|
| 1 | PowerZ | Robert | Two Thanks |
| 2 | Elissa Weldon | Lee Peace | Catherine Daligga |
| 3 | Tess Eastment | Wynn Marlow | exmearden |
| 4 | Chris_Dawson | Logan Churchwell | Faith Gardner |
| 5 | carol kinstle | Peter Schweizer | Andrew Thornton |
| 6 | erinmclaughlin | Breitbart Sports | UnreasonableFridays |
| 7 | Taylor Ward | Jon Fleischman | DKos Top Comments |
| 8 | Jennifer Z. Deaton | Nate Church | 2016 relitigator |
| 9 | Pam Benson | Daniel Nussbaum | Daily Kos |
| 10 | amy entelis | Noah Dulis | Walter Einenkel |
| 11 | Grace Bohnhoff | Jon David Kahn | Candelaria Vargas |
| 12 | kate lazarus | Breitbart California | Mara Schechter |
| 13 | Newstron | Ken Klukowski | Emi Feldman |
| 14 | Becky Brittain | pam key | The Soulful Negress |
| 15 | CNN Ballot Bowl | Auntie Hollywood | Kim Soffen |
| Rank . | @CNN . | @BreitbartNews . | @dailykos . |
|---|---|---|---|
| 1 | PowerZ | Robert | Two Thanks |
| 2 | Elissa Weldon | Lee Peace | Catherine Daligga |
| 3 | Tess Eastment | Wynn Marlow | exmearden |
| 4 | Chris_Dawson | Logan Churchwell | Faith Gardner |
| 5 | carol kinstle | Peter Schweizer | Andrew Thornton |
| 6 | erinmclaughlin | Breitbart Sports | UnreasonableFridays |
| 7 | Taylor Ward | Jon Fleischman | DKos Top Comments |
| 8 | Jennifer Z. Deaton | Nate Church | 2016 relitigator |
| 9 | Pam Benson | Daniel Nussbaum | Daily Kos |
| 10 | amy entelis | Noah Dulis | Walter Einenkel |
| 11 | Grace Bohnhoff | Jon David Kahn | Candelaria Vargas |
| 12 | kate lazarus | Breitbart California | Mara Schechter |
| 13 | Newstron | Ken Klukowski | Emi Feldman |
| 14 | Becky Brittain | pam key | The Soulful Negress |
| 15 | CNN Ballot Bowl | Auntie Hollywood | Kim Soffen |
Column names represent seed nodes, and the sampled nodes are ranked by adjusted PPR values, with teleportation constant α = 0.15 uniformly. After adjustment, PPR returns a more localized cluster. Instead of the highly visible public faces of the three seed organizations, the individuals in this table serve a central role to the internal organization (e.g. editors and writers). Depending on the application, one might prefer the results in Table 1 or Table 2.
Top 15 handles by adjusted PPR (with regularization) sampling†
| Rank . | @CNN . | @BreitbartNews . | @dailykos . |
|---|---|---|---|
| 1 | PowerZ | Robert | Two Thanks |
| 2 | Elissa Weldon | Lee Peace | Catherine Daligga |
| 3 | Tess Eastment | Wynn Marlow | exmearden |
| 4 | Chris_Dawson | Logan Churchwell | Faith Gardner |
| 5 | carol kinstle | Peter Schweizer | Andrew Thornton |
| 6 | erinmclaughlin | Breitbart Sports | UnreasonableFridays |
| 7 | Taylor Ward | Jon Fleischman | DKos Top Comments |
| 8 | Jennifer Z. Deaton | Nate Church | 2016 relitigator |
| 9 | Pam Benson | Daniel Nussbaum | Daily Kos |
| 10 | amy entelis | Noah Dulis | Walter Einenkel |
| 11 | Grace Bohnhoff | Jon David Kahn | Candelaria Vargas |
| 12 | kate lazarus | Breitbart California | Mara Schechter |
| 13 | Newstron | Ken Klukowski | Emi Feldman |
| 14 | Becky Brittain | pam key | The Soulful Negress |
| 15 | CNN Ballot Bowl | Auntie Hollywood | Kim Soffen |
| Rank . | @CNN . | @BreitbartNews . | @dailykos . |
|---|---|---|---|
| 1 | PowerZ | Robert | Two Thanks |
| 2 | Elissa Weldon | Lee Peace | Catherine Daligga |
| 3 | Tess Eastment | Wynn Marlow | exmearden |
| 4 | Chris_Dawson | Logan Churchwell | Faith Gardner |
| 5 | carol kinstle | Peter Schweizer | Andrew Thornton |
| 6 | erinmclaughlin | Breitbart Sports | UnreasonableFridays |
| 7 | Taylor Ward | Jon Fleischman | DKos Top Comments |
| 8 | Jennifer Z. Deaton | Nate Church | 2016 relitigator |
| 9 | Pam Benson | Daniel Nussbaum | Daily Kos |
| 10 | amy entelis | Noah Dulis | Walter Einenkel |
| 11 | Grace Bohnhoff | Jon David Kahn | Candelaria Vargas |
| 12 | kate lazarus | Breitbart California | Mara Schechter |
| 13 | Newstron | Ken Klukowski | Emi Feldman |
| 14 | Becky Brittain | pam key | The Soulful Negress |
| 15 | CNN Ballot Bowl | Auntie Hollywood | Kim Soffen |
Column names represent seed nodes, and the sampled nodes are ranked by adjusted PPR values, with teleportation constant α = 0.15 uniformly. After adjustment, PPR returns a more localized cluster. Instead of the highly visible public faces of the three seed organizations, the individuals in this table serve a central role to the internal organization (e.g. editors and writers). Depending on the application, one might prefer the results in Table 1 or Table 2.
1.2 Main contributions
The main contributions of the paper are a simple and interpretable form for the PPR vector and a statistical guarantee for clustering with the adjusted PPR vector.
- (a)
This paper reveals a simple two-stage form of the PPR vector under the population (expectation) DCSBM. Consider the vth element of the PPR vector as the probability of sampling node v in a sample of size 1 from the stationary distribution of the personalized random walk. This inclusion probability is akin to stratified sampling:
The inclusion probability for node v is the product of two separate probabilities: first, the probability that the personalized random walk samples any node in v’s block; second, the probability that the personalized random walk selects node v, conditionally on sampling that block.
Both of these probabilities have simple expressions. If there are K blocks in the graph, then the blockwise probability comes from the PPR vector of a graph with K vertices, with edge weights specified by the ‘block connection matrix’ in the DCSBM. The second probability is proportional to the degree of node v. In addition to the population results, theorem 2 in Section 4 demonstrates that, when the graph is random, the PPR vector concentrates around its population (expectation) under certain conditions.
- (b)
This paper identifies two sources of bias of using a PPR vector for local clustering under the DCSBM—the ancillary effects of heterogeneous node degrees and block degrees. With this finding, the paper examines a simple bias adjustment that remedies the two biases simultaneously and suggests conditions when the adjusted PPR can be used to return the correct local cluster. In other words,
PPR clustering with the adjustment achieves the precise identification of the local cluster, provided that the graph is sufficiently dense.
These results establish statistical performance (consistency) of PPR clustering under the DCSBM, in the sparse regime where the minimum expected degree grows logarithmically with the number of nodes in the network. Our results provide an elementwise perturbation bound for PPR vectors, that allows the number of clusters to grow with the size of graphs, and generalize to a directed graph setting as Google PageRank does.
The rest of the paper proceeds as follows. Section 2 formally introduces the PPR method and some of the known results. Section 2 also introduces the DCSBM. Section 3 gives a population analysis of the PPR clustering under directed block model graphs. Section 4 provides concentration results for the PPR vector when the graph is random and provides a statistical guarantee on the PPR local clustering method. Section 5 presents several numerical results showing the effectiveness of the PPR clustering. Section 6 illustrates the PPR clustering through the massive Twitter friendship graph and demonstrates the benefits of a smoothing step in the PPR adjustment.
2 Preliminaries
2.1 Personalized PageRank and the local clustering algorithm
Proposition 1For any fixed α ∈ (0, 1], the PPR vector p is
- (a)
the left leading eigenvector of Q, associated with the simple eigenvalue 1, and
- (b)
the infinite sum of landing probability with weights ,(2)
Algorithm 1: approximate PPR vector (undirected) (Andersen et al., 2006)
| Require: undirected graph G, preference vector π, teleportation constant α and tolerance ε |
| Initialize p←0, r←π, α′←α/(2 − α) |
| while ∃ u ∈ V such that ru ⩾ εdudo |
| uniformly sample a vertex u satisfying ru ⩾ εdu |
| pu←pu + α′ru |
| for v:(u, v) ∈ E do |
| rv←rv + (1 − α′)ru/(2du) |
| end for |
| ru←(1 − α′)ru/2 |
| end while |
| Return: ε-approximate PPR vector p |
| Require: undirected graph G, preference vector π, teleportation constant α and tolerance ε |
| Initialize p←0, r←π, α′←α/(2 − α) |
| while ∃ u ∈ V such that ru ⩾ εdudo |
| uniformly sample a vertex u satisfying ru ⩾ εdu |
| pu←pu + α′ru |
| for v:(u, v) ∈ E do |
| rv←rv + (1 − α′)ru/(2du) |
| end for |
| ru←(1 − α′)ru/2 |
| end while |
| Return: ε-approximate PPR vector p |
Algorithm 1: approximate PPR vector (undirected) (Andersen et al., 2006)
| Require: undirected graph G, preference vector π, teleportation constant α and tolerance ε |
| Initialize p←0, r←π, α′←α/(2 − α) |
| while ∃ u ∈ V such that ru ⩾ εdudo |
| uniformly sample a vertex u satisfying ru ⩾ εdu |
| pu←pu + α′ru |
| for v:(u, v) ∈ E do |
| rv←rv + (1 − α′)ru/(2du) |
| end for |
| ru←(1 − α′)ru/2 |
| end while |
| Return: ε-approximate PPR vector p |
| Require: undirected graph G, preference vector π, teleportation constant α and tolerance ε |
| Initialize p←0, r←π, α′←α/(2 − α) |
| while ∃ u ∈ V such that ru ⩾ εdudo |
| uniformly sample a vertex u satisfying ru ⩾ εdu |
| pu←pu + α′ru |
| for v:(u, v) ∈ E do |
| rv←rv + (1 − α′)ru/(2du) |
| end for |
| ru←(1 − α′)ru/2 |
| end while |
| Return: ε-approximate PPR vector p |
Proposition 2(entrywise approximation error (Andersen et al., 2006))
Let p be a PPR vector, and let be an approximate PPR vector computed by algorithm 1 with a tolerance ε > 0. For any vertex u that is sampled in algorithm 1,
Algorithm 2: PPR clustering (undirected)
| Require: undirected graph G, seed node and the desired size of local cluster n |
| Step 1: calculate the approximate PPR vector p (algorithm 1) |
| Step 2: adjust the PPR vector p by node degrees, |
| Step 3: rank all vertices according to the adjusted PPR vector p* |
| Return: local cluster—n top ranking nodes |
| Require: undirected graph G, seed node and the desired size of local cluster n |
| Step 1: calculate the approximate PPR vector p (algorithm 1) |
| Step 2: adjust the PPR vector p by node degrees, |
| Step 3: rank all vertices according to the adjusted PPR vector p* |
| Return: local cluster—n top ranking nodes |
Algorithm 2: PPR clustering (undirected)
| Require: undirected graph G, seed node and the desired size of local cluster n |
| Step 1: calculate the approximate PPR vector p (algorithm 1) |
| Step 2: adjust the PPR vector p by node degrees, |
| Step 3: rank all vertices according to the adjusted PPR vector p* |
| Return: local cluster—n top ranking nodes |
| Require: undirected graph G, seed node and the desired size of local cluster n |
| Step 1: calculate the approximate PPR vector p (algorithm 1) |
| Step 2: adjust the PPR vector p by node degrees, |
| Step 3: rank all vertices according to the adjusted PPR vector p* |
| Return: local cluster—n top ranking nodes |
2.2 Stochastic block model
In an SBM, each node belongs to one of K blocks. The presence of each edge corresponds to an independent Bernoulli random variable, where the probability of an edge between any two nodes depends only on the block memberships of two nodes (Holland et al., 1983). The formal definition is as follows.
Definition 1For a vertex set V = {1, 2, … , N}, let z:{1, 2, … , N}→{1, 2, … , K} partition the N nodes into K blocks, so z(v) is the block membership of vertex v. Let B be a K × K matrix with all entries’ range in [0, 1]. Under the SBM, the probability of an edge between u and v is , i.e.
3 Population analysis of PageRank
In this section, we analyse the PPR vector of the expected adjacency matrix under the DCSBM. This provides a simple representation of the PPR vector that motivates
- (a)
the bias adjustment and
- (b)
the generalization of algorithm 1 and 2 to directed graphs.
We use three distinct typefaces to denote three classes of objects. A script typeface is given to the population version of any observable quantities in random graphs, such as graph adjacency matrix and node degrees (e.g. equation 3). The normal typeface is given to unobserved model parameters, such as block membership and degree parameters θi. Bold is given to all block level quantities and parameters like B and .
Define the population graph adjacency matrix,
to be the expectation of random adjacency matrix A. Let be the block membership matrix with Zvi = 1 if and only if vertex v belongs to block i, and define diagonal matrices Θin and Θout with entries θin and θout respectively. Then, under the directed DCSBM with K blocks and parameters , can be compactly expressed as
Accordingly, we define the population node degrees and the population transition matrix, , and , where and are the diagonal matrices of the population node in-degrees and out-degrees respectively. Let be the population PPR vector (i.e. the solution to the equation ) and let be the population APPR vector.
3.1 A representation of personalized PageRank vectors
Theorem 1(explicit form of PPR vectors)
Under the population directed DCSBM with K blocks and parameters ,
- (a)
the population PPR vector has elementswhere p is the blockwise PPR vector in equation (5), and
- (b)
the population APPR vector has elements
(6)where .
Theorem 1 demonstrates that the PPR vector decomposes into block-related information p and node-specific information Θ. Within each block, the PPR values are proportional to the node degree parameters θv and sum to the blockwise PPR value of the block. The proof of theorem 1 ( Appendix A) relies on a key observation that the powers of the population transition matrix, for s = 1, 2, … , have a similarly simple form and the node-specific information components (i.e. z(v) and θv) are invariant in s.
To justify the adjustment (step 2) in algorithm 2, we observe that the seed always has the highest population APPR score. This turns out to be a key feature that facilitates the APPR vector to recover a local cluster correctly, so we state it in the following lemma.
Lemma 1(the largest entry of APPR vector)
Under the population DCSBM, assume that the minimum expected degree is positive, i.e. . Then, for any fixed α > 0, the population APPR vector has the strictly largest entry corresponding to the seed node,In contrast, this is not generally true for a PPR vector.
When α = 0 (i.e. no teleportation), the PPR vector becomes the limiting distribution of a standard random walk and all entries of the APPR vector are equal ( Appendix A). Lemma 1 (applied to blockwise PPR vectors) and theorem 1 together identify two sources of bias for PPR vectors and suggest a justification for the degree adjustment, which we discuss in order.
- (a)
Both node degree heterogeneity Θ and block size imbalance D confound the identification of local clusters by the PPR vector. In particular, suppose that vertex v belongs to a block z(v) = i other than 1. The PPR vector assigns it a score , where is the blockwise PPR of block i, and θv is the parameter specifically controlling the degree of v. Then, node v may rank at the top, if θv is sufficiently large. Furthermore, lemma 1 implies that is not necessarily the largest because of block degree heterogeneity. Specifically, if block i has an exceedingly high block degree, it is likely that fails to downrank node v vis-à-vis those nodes of block 1.
- (b)
APPR removes the node and the block degree heterogeneity simultaneously and perfectly recovers the local cluster. To see this, note that is the adjusted version of the blockwise PPR vector. From lemma 1, is the largest entry of . From equation 6, the APPR vector assigns any vertex v a score . Hence, nodes with the highest value of belong to block 1, which is precisely the local cluster desired.
Note that the PPR vector can still be biased for local clustering even under the classical SBM. To see this, set the matrix Θ to the identity matrix in theorem 1. In this case, the heterogeneous block degrees still confound the PPR vector (Section 5.2); there is generally no guarantee for to appear on the top (because of lemma 1), unless there are further symmetry conditions. Kloumann et al. (2017) used such a scenario. As a by-product of our analysis, we extend their results under the DCSBM with the symmetric conditions (see the on-line supplementary materials section S3 to the paper).
3.2 Local clustering on directed graphs
In light of the clean form of PPR vectors under the DCSBM, one can modify algorithms 1 and 2 to operate on a directed graph accordingly. For this, note that the transition matrix of a directed graph requires node out-degrees; hence algorithm 1 examines only the edges leaving visited nodes. Consequently it suffices to replace the dus in algorithm 1 by s (algorithm 3: Table 5). Proposition 2 applies to algorithm 3 as well, and one can approximate the PPR vector provided that the out-degrees of visited nodes can be observed and the tolerance parameter ε > 0 is sufficiently small.
Algorithm 3: approximate PPR vector (directed)
| Require: directed graph G, preference vector π, teleportation constant α and tolerance ε |
| Initialize p←0, r←π, α′←α/(2 − α) |
| while ∃ u ∈ V such that do |
| sample a vertex u uniformly at random, satisfying |
| pu←pu + α′ru |
| for v:(u, v) ∈ E do |
| end for |
| ru←(1 − α′)ru/2 |
| end while |
| Return: ε-approximate PPR vector p |
| Require: directed graph G, preference vector π, teleportation constant α and tolerance ε |
| Initialize p←0, r←π, α′←α/(2 − α) |
| while ∃ u ∈ V such that do |
| sample a vertex u uniformly at random, satisfying |
| pu←pu + α′ru |
| for v:(u, v) ∈ E do |
| end for |
| ru←(1 − α′)ru/2 |
| end while |
| Return: ε-approximate PPR vector p |
Algorithm 3: approximate PPR vector (directed)
| Require: directed graph G, preference vector π, teleportation constant α and tolerance ε |
| Initialize p←0, r←π, α′←α/(2 − α) |
| while ∃ u ∈ V such that do |
| sample a vertex u uniformly at random, satisfying |
| pu←pu + α′ru |
| for v:(u, v) ∈ E do |
| end for |
| ru←(1 − α′)ru/2 |
| end while |
| Return: ε-approximate PPR vector p |
| Require: directed graph G, preference vector π, teleportation constant α and tolerance ε |
| Initialize p←0, r←π, α′←α/(2 − α) |
| while ∃ u ∈ V such that do |
| sample a vertex u uniformly at random, satisfying |
| pu←pu + α′ru |
| for v:(u, v) ∈ E do |
| end for |
| ru←(1 − α′)ru/2 |
| end while |
| Return: ε-approximate PPR vector p |
Algorithm 4: PPR clustering (directed)
| Require: directed graph G, seed node , the desired size of local cluster n and an optional regularization parameter τ |
| Step 1: calculate the approximate PPR vector p (algorithm 3) |
| Step 2: adjust the PPR vector p with two options (a) node in-degrees, , (b) regularized node in-degrees, |
| Step 3: rank all vertices according to the APPR vector p* or pτ |
| Return: local cluster—n top ranking nodes |
| Require: directed graph G, seed node , the desired size of local cluster n and an optional regularization parameter τ |
| Step 1: calculate the approximate PPR vector p (algorithm 3) |
| Step 2: adjust the PPR vector p with two options (a) node in-degrees, , (b) regularized node in-degrees, |
| Step 3: rank all vertices according to the APPR vector p* or pτ |
| Return: local cluster—n top ranking nodes |
Algorithm 4: PPR clustering (directed)
| Require: directed graph G, seed node , the desired size of local cluster n and an optional regularization parameter τ |
| Step 1: calculate the approximate PPR vector p (algorithm 3) |
| Step 2: adjust the PPR vector p with two options (a) node in-degrees, , (b) regularized node in-degrees, |
| Step 3: rank all vertices according to the APPR vector p* or pτ |
| Return: local cluster—n top ranking nodes |
| Require: directed graph G, seed node , the desired size of local cluster n and an optional regularization parameter τ |
| Step 1: calculate the approximate PPR vector p (algorithm 3) |
| Step 2: adjust the PPR vector p with two options (a) node in-degrees, , (b) regularized node in-degrees, |
| Step 3: rank all vertices according to the APPR vector p* or pτ |
| Return: local cluster—n top ranking nodes |
4 Personalized PageRank in random graphs
This section establishes several concentration results for the local clustering algorithm using the APPR vector (algorithms 2 and 4) under the DCSBM. The results show that, if the graph is generated from the DCSBM, then PPR clustering returns the desired local cluster with high probability. Since, in algorithm 4, the calculation for PPR vectors relies on only node out-degrees and the adjustment step solely utilizes node in-degrees, it is not difficult to distinguish din and dout. Thus, we state the results in undirected graphs for simplicity. One can draw the analogous conclusions for directed graphs by tracing the proof step by step.
We first present a useful tool that controls the entrywise errors of a PPR vector in random graphs. Recall that is the stationary distribution of probability transition matrix . For any vector , define the vector infinity norm as . The following theorem bounds the entrywise error of the stationary distribution of
Theorem 2(concentration of the PPR vectors)
Let G = (V, E) be a graph of N vertices generated from the DCSBM with K blocks and parameters {B, Z, Θ}. Let p and be the PPR vector corresponding to random transition matrix P and its population version respectively, with the same teleportation constant α. Let be the adjusted PPR vector of p and Let δ be the average expected node degrees, i.e. . Assume that is bounded by some finite constant and thatfor some sufficiently large constant c0 > 0. Then, with probability at least ,(7)and
for some sufficiently large constants c1, c2 > 0.
The proof of theorem 2 invokes the elementary eigenvector perturbation bound for asymmetric matrices, an analogue to the celebrated Davis–Kahan sin (Θ) theorem (Davis and Kahan, 1970), and the novel leave-one-out technique due to Chen et al. (2019). A detailed proof is given in the on-line supplementary materials section S1 to the paper.
Theorem 2 demonstrates that, if the expected average degree δ exceeds (1 − α)2 log (N) to some sufficiently large extent, then, with high probability, the random APPR vector concentrates around the population APPR vector in terms of all entries. In fact, the concentration statement holds for any valid preference vector π. Hence, the classic PageRank vector and some other variants also enjoy the entrywise error bounds, so long as they can be written as the solution to the linear system (1).
With theorem 2 and the separation measure, we then give the following corollary that bounds the accuracy of algorithm 2, in terms of graph edge density.
Corollary 1(exact recovery by APPR vector)
For any seed nodes, let C⊂V be the local cluster of n nodes returned by algorithm 2 with teleportation constant α and tolerance ε, and be the nodes in the seed node’s block. Assume that ρ < c0, , and thatfor some sufficiently large constants c0, c1, c2 > 0. If the desired size of the local cluster then, with probability at least we have(8)
A proof of corollary 1 is presented in Appendix A. We make a few remarks.
- (a)Corollary 1 demonstrates that algorithm 2 works under a sparse scenario, where the number of edges is exceedingly small in proportion to the number of possible edges in the network. To reach the entrywise control of the APPR vector and the sufficient separation of local cluster from others, theorem 2 calls for the expected node degree δ to grow with only a fraction (for any fixed teleportation constant α) of the logarithm of the size of the network, log (N). In other words, algorithm 2 requires a sample complexity (the number of edges) of order
- (b)The results show that α leverages between the sampling complexity and statistical performance of PPR clustering. To see this, rearrange condition (8),for some sufficiently small constant c′>0. As α increases, the left-hand side decreases to 0, thus making the condition more likely to hold. In contrast, as α increases, the tolerance ε must decrease at rate to guarantee an entrywise control of pε that is analogous to the form in theorem 2 ( Appendix A). More intuitively, if ε does not decrease, then, as α → 1, algorithm 1 may terminate early without reaching all vertices in the desired local cluster. In sum, algorithms 1 and 3 need at least queries (see the on-line supplementary materials section S2 for an example). This implies that we can approach the conditions in corollary 1 by setting the teleportation constant sufficiently large, whereas the computational burden can increase as α → 1.
5 Simulation studies
This section compares the PPR vector and the APPR vector. The results show the effectiveness and robustness of the APPR vector in detecting a local cluster. Experiment 1 utilizes the DCSBM with a power law degree distribution and investigates the effects of heterogeneous node degrees. Experiment 2 uses the SBM with unequal block sizes to study the influences of heterogeneous block degrees. Experiment 3 generates networks from the SBM with equal block sizes and varying edge density to examine the efficacy of PPR methods in sparse graphs.
In all simulations, we employ the block connectivity matrix B with homogeneous diagonal elements and homogeneous off-diagonal elements for any i ≠ j. Define the signal-to-noise ratio to be the expected number of in-block edges divided by the expected number of out-block edges, i.e. b1/{b2(K − 1)}, where K is the number of blocks. In particular, we set the signal-to-noise ratio to 1.5 and choose a teleportation constant of α = 0.15 throughout the section. Additional simulation results (illustrating theorem 2) are available from the on-line supplementary materials section S2.
5.1 Experiment 1
Fig. 1 plots PPR values (Figs 1(a) and 1(b)) and APPR values (Figs 1(c) and 1(d)) of a random graph generated from the DCSBM, excluding seed node(s). Figs 1(a) and 1(c) contrast the PPR and APPR when there is only one seed node and Figs 1(b) and 1(d) compare two vectors when 10 seed nodes are used. The vertices from the local block in the SBM are coloured in blue and the others are in yellow. The nodes are ordered first by block and then by node degree parameters θ (left is larger). A horizontal line is drawn for each block indicating the median of the APPR values within that block.
Comparison of (a), (b) PPR and (c), (d) APPR under the DCSBM with (a), (c) one seed node and (b), (d) 10 seed nodes:
, local cluster;
, other clusters;
, median of APPR values within each cluster
With one seed node (Figs 1(a) and 1(c)), the scatter plots have two clouds within each block. The upper cloud contains the immediate neighbours of the seed node. This separation disappears when multiple seed nodes are used (Figs 1(b) and 1(d)). To see the effect of node heterogeneity, the skewed distribution of PPR values in each block demonstrates its bias towards high degree nodes inside and outside the seed nodes block in the SBM. In contrast, APPR values are evenly distributed within blocks, verifying that the APPR vector removes the effects of node degree heterogeneity.
5.2 Experiment 2
Fig. 2(a) displays the PPR vector on an example network with b = 1.4, demonstrating its preference towards the high degree block (the third block) over local clusters. Fig. 2(b) depicts the APPR vector on the same network. Given the size of the first block, we measure the accuracy by the proportion of vertices belonging to the first block in the cluster returned. Fig. 2(c) shows the accuracy of PPR and APPR for six values of b (i.e. the geometric ratio in distribution (9)) where each point is the average of 100 sampled networks. The comparison demonstrates that the APPR vector corrects the bias of PPR caused by block heterogeneity. Moreover, block degree heterogeneity degrades the performance of both PPR and APPR. Note that APPR outperforms PPR even when b = 1; this is probably because, even when nodes have equal expected degrees in the SBM, the actual node degrees will be heterogeneous because of the randomness in the sampled graph. In a finite graph, this variability is enough to give APPR an advantage over PPR. Asymptotically, this advantage should fade away (Kloumann et al., 2017).
(a) Simulated network generated from the classic SBM of three blocks with block degree heterogeneity (
, median of PPR values within each cluster), (b) comparison of performance for PPR (
) and APPR (
) under the SBM with different levels of block degree heterogeneity and (c) comparison of performance for PPR (
) and APPR (
) under the four-parameters SBM with different sparsity: error bars are drawn by using the standard deviation
5.3 Experiment 3
Experiment 3 investigates the performance of PPR and APPR under the SBM where there is no heterogeneity in the expected node degrees or block degrees. A number of random networks were sampled from the four-parameter SBM (K = 3, N = 900, b1 = 0.6, b2 = 0.2) (Rohe et al., 2011). Under the four-parameter SBM, each of K blocks has equal size in expectation, N/K, and the probability of a connection between two nodes is b2 if they are in two separate blocks, or b1 if in the same block. In addition, the expected average degree varies: δ ∈ {15, 30, 45, 60, 75, 90}. For every setting, the results are averaged over 100 samples of the network. The PPR vector is calculated with one seed randomly chosen from block 1. Fig. 2(d) contrasts the accuracy of PPR and APPR against six values of expected average degree, showing that, when the sampled graph has minimal degree heterogeneity, the adjusted PPR vector has only slightly higher accuracy than the PPR vector.
6 A sample of Twitter
In this section, we provide a more detailed case-study to illustrate the properties of different PPR vectors. We obtain a local cluster of nodes around the seed node @NBCPolitics (NBC Politics) in the Twitter friendship graph. In the Twitter graph, the nodes are called handles or accounts (e.g. @NBCPolitics) and, if Twitter handle i follows Twitter handle j, then we define this as a directed edge (i, j) pointing from i to j. Affiliated with NBC News, NBC Politics specializes in political news coverage and had over 470000 followers on Twitter (in-degree) and follows 145 handles (out-degree) in December 2018. A brief look through @NBCPolitics’s following list reveals that it follows a wide range of accounts, from television programmes, reporters and editors affiliated with the National Broadcasting Company (NBC), to media accounts and journalists of other news outlets as well as politicians.
Data on following and handle profile information were collected through the standard Twitter search application programming interface. We queried the Twitter friendship graph starting from the seed node @NBCPolitics, using algorithm 3 with teleportation constant α = 0.15 and termination parameter ε = 10−7, ending up with 5840 surrounding handles. Through this exercise, we intend to illustrate the properties and applications of local clustering by using PPR, APPR and RPPR vectors, where we set the regularization parameter τ to 100.
We first present the results of PPR. As Table 7 shows, the top 30 handles (except @NBCPolitics) with the highest PPR values are a combination of
Top 30 handles of PPR with seed node @NBCPolitics and the teleportation constant α = 0.15 in December 2018†
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Melania Trump | 11242283 | This account is run by the Office of First Lady Melania … |
| 2 | The White House | 17625630 | Welcome to @WhiteHouse!: follow for the latest from … |
| 3 | Chuck Todd | 2032038 | Moderator of @meetthepress and @nbcnews political … |
| 4 | NBC News | 6280551 | The leading source of global news and info for more than … |
| 5 | NBC Nightly News | 962290 | Breaking news, in-depth reporting, context on news from … |
| 6 | Andrea Mitchell | 1737764 | NBC News Chief Foreign Affairs Correspondent/ … |
| 7 | Savannah Guthrie | 881669 | Mom to Vale & Charley, TODAY Co-Anchor, … |
| 8 | Joe Scarborough | 2521215 | With Malice Toward None |
| 9 | MSNBC | 2261911 | The place for in-depth analysis, political commentary … |
| 10 | Rachel Maddow MSNBC | 9498076 | I see political people … |
| 11 | Breaking News | 9223158 | |
| 12 | NBC News First Read | 53847 | The first place for news and analysis from the @NBC … |
| 13 | TODAY | 4276453 | America’s favorite morning show | Snapchat: todayshow |
| 14 | Meet the Press | 566713 | Meet the Press is the longest-running television show … |
| 15 | The Wall Street Journal | 16188842 | Breaking news and features from the WSJ |
| 16 | Pete Williams | 70062 | NBC News Justice Correspondent: covers US Supreme … |
| 17 | Mark Murray | 97571 | Mark Murray is the senior political editor for NBC … |
| 18 | POLITICO | 3695835 | Nobody knows politics like POLITICO: got a news tip … |
| 19 | Katy Tur | 587474 | MSNBC anchor @2pm, NBC News correspondent, … |
| 20 | Bill Clinton | 10697521 | Founder, Clinton Foundation and 42nd President of the … |
| 21 | Kasie Hunt | 381704 | @NBCNews Capitol Hill Correspondent: host, … |
| 22 | TIME | 15584815 | Breaking news and current events from around the globe: … |
| 23 | Kelly O’Donnell | 195765 | White House Correspondent @NBCNews Veteran of … |
| 24 | John McCain | 3181773 | Memorial account for U.S. Senator John McCain, … |
| 25 | Peter Alexander | 283522 | @NBCNews White House Correspondent/Weekend … |
| 26 | Hallie Jackson | 359099 | Chief White House Correspondent/@NBCNews/ … |
| 27 | Kristen Welker | 182244 | @NBCNews White House Correspondent: links and … |
| 28 | Carrie Dann | 37119 | @NBCNews / @NBCPolitics: RTs not endorsements |
| 29 | Willie Geist | 807536 | Host @NBC #SundayTODAY, Co-Host @Morning_Joe … |
| 30 | Morning Joe | 563650 | Live tweet during the show!: links to must-read op-eds … |
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Melania Trump | 11242283 | This account is run by the Office of First Lady Melania … |
| 2 | The White House | 17625630 | Welcome to @WhiteHouse!: follow for the latest from … |
| 3 | Chuck Todd | 2032038 | Moderator of @meetthepress and @nbcnews political … |
| 4 | NBC News | 6280551 | The leading source of global news and info for more than … |
| 5 | NBC Nightly News | 962290 | Breaking news, in-depth reporting, context on news from … |
| 6 | Andrea Mitchell | 1737764 | NBC News Chief Foreign Affairs Correspondent/ … |
| 7 | Savannah Guthrie | 881669 | Mom to Vale & Charley, TODAY Co-Anchor, … |
| 8 | Joe Scarborough | 2521215 | With Malice Toward None |
| 9 | MSNBC | 2261911 | The place for in-depth analysis, political commentary … |
| 10 | Rachel Maddow MSNBC | 9498076 | I see political people … |
| 11 | Breaking News | 9223158 | |
| 12 | NBC News First Read | 53847 | The first place for news and analysis from the @NBC … |
| 13 | TODAY | 4276453 | America’s favorite morning show | Snapchat: todayshow |
| 14 | Meet the Press | 566713 | Meet the Press is the longest-running television show … |
| 15 | The Wall Street Journal | 16188842 | Breaking news and features from the WSJ |
| 16 | Pete Williams | 70062 | NBC News Justice Correspondent: covers US Supreme … |
| 17 | Mark Murray | 97571 | Mark Murray is the senior political editor for NBC … |
| 18 | POLITICO | 3695835 | Nobody knows politics like POLITICO: got a news tip … |
| 19 | Katy Tur | 587474 | MSNBC anchor @2pm, NBC News correspondent, … |
| 20 | Bill Clinton | 10697521 | Founder, Clinton Foundation and 42nd President of the … |
| 21 | Kasie Hunt | 381704 | @NBCNews Capitol Hill Correspondent: host, … |
| 22 | TIME | 15584815 | Breaking news and current events from around the globe: … |
| 23 | Kelly O’Donnell | 195765 | White House Correspondent @NBCNews Veteran of … |
| 24 | John McCain | 3181773 | Memorial account for U.S. Senator John McCain, … |
| 25 | Peter Alexander | 283522 | @NBCNews White House Correspondent/Weekend … |
| 26 | Hallie Jackson | 359099 | Chief White House Correspondent/@NBCNews/ … |
| 27 | Kristen Welker | 182244 | @NBCNews White House Correspondent: links and … |
| 28 | Carrie Dann | 37119 | @NBCNews / @NBCPolitics: RTs not endorsements |
| 29 | Willie Geist | 807536 | Host @NBC #SundayTODAY, Co-Host @Morning_Joe … |
| 30 | Morning Joe | 563650 | Live tweet during the show!: links to must-read op-eds … |
Through the PPR vector, the top 30 handles returned to @NBCPolitics include NBC’s news-related programmes and celebrity reporters and comparable mainstream media outlets, as well as prominent political and public figures and institutions. Such results line up with its status as a mainstream political news source, demonstrating clustering effectiveness. Those Twitter handles tend to have millions of followers, showing the PPR vector’s bias towards high in-degree.
Top 30 handles of PPR with seed node @NBCPolitics and the teleportation constant α = 0.15 in December 2018†
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Melania Trump | 11242283 | This account is run by the Office of First Lady Melania … |
| 2 | The White House | 17625630 | Welcome to @WhiteHouse!: follow for the latest from … |
| 3 | Chuck Todd | 2032038 | Moderator of @meetthepress and @nbcnews political … |
| 4 | NBC News | 6280551 | The leading source of global news and info for more than … |
| 5 | NBC Nightly News | 962290 | Breaking news, in-depth reporting, context on news from … |
| 6 | Andrea Mitchell | 1737764 | NBC News Chief Foreign Affairs Correspondent/ … |
| 7 | Savannah Guthrie | 881669 | Mom to Vale & Charley, TODAY Co-Anchor, … |
| 8 | Joe Scarborough | 2521215 | With Malice Toward None |
| 9 | MSNBC | 2261911 | The place for in-depth analysis, political commentary … |
| 10 | Rachel Maddow MSNBC | 9498076 | I see political people … |
| 11 | Breaking News | 9223158 | |
| 12 | NBC News First Read | 53847 | The first place for news and analysis from the @NBC … |
| 13 | TODAY | 4276453 | America’s favorite morning show | Snapchat: todayshow |
| 14 | Meet the Press | 566713 | Meet the Press is the longest-running television show … |
| 15 | The Wall Street Journal | 16188842 | Breaking news and features from the WSJ |
| 16 | Pete Williams | 70062 | NBC News Justice Correspondent: covers US Supreme … |
| 17 | Mark Murray | 97571 | Mark Murray is the senior political editor for NBC … |
| 18 | POLITICO | 3695835 | Nobody knows politics like POLITICO: got a news tip … |
| 19 | Katy Tur | 587474 | MSNBC anchor @2pm, NBC News correspondent, … |
| 20 | Bill Clinton | 10697521 | Founder, Clinton Foundation and 42nd President of the … |
| 21 | Kasie Hunt | 381704 | @NBCNews Capitol Hill Correspondent: host, … |
| 22 | TIME | 15584815 | Breaking news and current events from around the globe: … |
| 23 | Kelly O’Donnell | 195765 | White House Correspondent @NBCNews Veteran of … |
| 24 | John McCain | 3181773 | Memorial account for U.S. Senator John McCain, … |
| 25 | Peter Alexander | 283522 | @NBCNews White House Correspondent/Weekend … |
| 26 | Hallie Jackson | 359099 | Chief White House Correspondent/@NBCNews/ … |
| 27 | Kristen Welker | 182244 | @NBCNews White House Correspondent: links and … |
| 28 | Carrie Dann | 37119 | @NBCNews / @NBCPolitics: RTs not endorsements |
| 29 | Willie Geist | 807536 | Host @NBC #SundayTODAY, Co-Host @Morning_Joe … |
| 30 | Morning Joe | 563650 | Live tweet during the show!: links to must-read op-eds … |
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Melania Trump | 11242283 | This account is run by the Office of First Lady Melania … |
| 2 | The White House | 17625630 | Welcome to @WhiteHouse!: follow for the latest from … |
| 3 | Chuck Todd | 2032038 | Moderator of @meetthepress and @nbcnews political … |
| 4 | NBC News | 6280551 | The leading source of global news and info for more than … |
| 5 | NBC Nightly News | 962290 | Breaking news, in-depth reporting, context on news from … |
| 6 | Andrea Mitchell | 1737764 | NBC News Chief Foreign Affairs Correspondent/ … |
| 7 | Savannah Guthrie | 881669 | Mom to Vale & Charley, TODAY Co-Anchor, … |
| 8 | Joe Scarborough | 2521215 | With Malice Toward None |
| 9 | MSNBC | 2261911 | The place for in-depth analysis, political commentary … |
| 10 | Rachel Maddow MSNBC | 9498076 | I see political people … |
| 11 | Breaking News | 9223158 | |
| 12 | NBC News First Read | 53847 | The first place for news and analysis from the @NBC … |
| 13 | TODAY | 4276453 | America’s favorite morning show | Snapchat: todayshow |
| 14 | Meet the Press | 566713 | Meet the Press is the longest-running television show … |
| 15 | The Wall Street Journal | 16188842 | Breaking news and features from the WSJ |
| 16 | Pete Williams | 70062 | NBC News Justice Correspondent: covers US Supreme … |
| 17 | Mark Murray | 97571 | Mark Murray is the senior political editor for NBC … |
| 18 | POLITICO | 3695835 | Nobody knows politics like POLITICO: got a news tip … |
| 19 | Katy Tur | 587474 | MSNBC anchor @2pm, NBC News correspondent, … |
| 20 | Bill Clinton | 10697521 | Founder, Clinton Foundation and 42nd President of the … |
| 21 | Kasie Hunt | 381704 | @NBCNews Capitol Hill Correspondent: host, … |
| 22 | TIME | 15584815 | Breaking news and current events from around the globe: … |
| 23 | Kelly O’Donnell | 195765 | White House Correspondent @NBCNews Veteran of … |
| 24 | John McCain | 3181773 | Memorial account for U.S. Senator John McCain, … |
| 25 | Peter Alexander | 283522 | @NBCNews White House Correspondent/Weekend … |
| 26 | Hallie Jackson | 359099 | Chief White House Correspondent/@NBCNews/ … |
| 27 | Kristen Welker | 182244 | @NBCNews White House Correspondent: links and … |
| 28 | Carrie Dann | 37119 | @NBCNews / @NBCPolitics: RTs not endorsements |
| 29 | Willie Geist | 807536 | Host @NBC #SundayTODAY, Co-Host @Morning_Joe … |
| 30 | Morning Joe | 563650 | Live tweet during the show!: links to must-read op-eds … |
Through the PPR vector, the top 30 handles returned to @NBCPolitics include NBC’s news-related programmes and celebrity reporters and comparable mainstream media outlets, as well as prominent political and public figures and institutions. Such results line up with its status as a mainstream political news source, demonstrating clustering effectiveness. Those Twitter handles tend to have millions of followers, showing the PPR vector’s bias towards high in-degree.
- (a)
NBC’s news-related programmes such as NBC News, TODAY and Meet the Press,
- (b)
NBC’s political reporters, anchors and editors, from well-known figures like Chuck Todd and Andrea Mitchell to less-known figures like Pete Williams (Justice Correspondent) and Mark Murray (Senior Political Editor);
- (c)
other mainstream news outlets such as the Wall Street Journal, POLITICO and TIME, and
- (d)
prominent public figures and politicians like Melania Trump, Bill Clinton and John McCain.
In light of NBC’s status as a mainstream news outlet and the political focus of @NBCPolitics, such results make sound sense. It must also be noted that all the top 30 handles are direct friends of @NBCPolitics’s and have at least tens of thousands of followers. The median follower count is 1.4 million, suggesting high in-degrees. In fact, the pattern that is observed in the top 30 extends to the top 200 handles with the highest PPR values, which include NBC’s own programmes, journalists, editors and staff, fellow mainstream media outlets and their staff, and prominent public figures, politicians and government institutions (see the on-line supplementary materials section S4). The median in-degree of the top 200 handles is around 184000, though there are four handles with fewer than 1000 followers. One important thing to note is that, among the top 200 handles, the first 139 are all directly followed by @NBCPolitics, with handles having high in-degrees generally ranked higher than those having low in-degrees (although @NBCPolitics follows 145 handles, six of them might have privacy protection that has prevented us from accessing their information). The remaining handles that are on the list, although not directly followed by @NBCPolitics, include five handles that are associated with NBC, from its news anchor Lester Holt to its News International President. However, the majority of those indirectly followed by @NBCPolitics are mainly high profile political and public figures (like President Trump, Vice-President Pence, Hillary Clinton and Stephen Colbert), government organizations (like the White House Office of Cabinet Affairs and National Security Council) and mainstream news outlets (like the New York Times, Cable News Network and the Associated Press) and well-known journalists (like John Dickerson and Anderson Cooper). We can thus conclude that the PPR vector is biased towards popular accounts followed directly by the seed node or indirectly by its friends, reflecting the popular Twitter handles that are followed by them. This property of the PPR vector can be harnessed by researchers who are interested in identifying the upstream of a handle, i.e. those Twitter elites who are followed by and might influence the seed node and by extension its followers.
In contrast, the APPR vector upweights handles that are much less popular (i.e. those with low in-degrees). As shown in Table 8, the 30 handles with the highest APPR values include NBC’s reporters, writers, editors, producers and programmes, all of whom have a few hundred to a few thousand followers. The 30 handles also include those unaffiliated with NBC, such as the director of a non-profit (Enroll America), the director of digital programming at National Geographic and @CNNPolitics’s editor. All of them are professionally related to the seed node. This testifies to the applicability of APPR for locating an idiosyncratic local cluster around a seed node. However, more than half (17) of the 30 handles are obscure and not directly followed by @NBCPolitics. The reason why they appear on the list is probably that they have just one and at most a dozen followers (recall that APPR divides by in-degree). In fact, 160 of the top 200 handles are not direct friends of @NBCPolitics; the median in-degree of the top 200 handles is merely 8 (on-line supplementary materials section S4). Those handles might have ended up on the list through a combination of luck and, more importantly, their extremely low in-degrees. In this regard, noise can be introduced by the APPR vector because it prioritizes handles with extremely low in-degrees that are possibly several degrees separated from the seed node.
Top 30 handles of APPR with seed node @NBCPolitics and the teleportation constant α = 0.15 in December 2018†
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Stephanie Palla | 198 | Enroll America National Regional Director … |
| 2 | Jennifer Sizemore | 386 | |
| 3 | Alissa Swango | 441 | Director of Digital Programming at @natgeo: all things … |
| 4 | Making a Difference | 670 | @NBCNightlyNews’ popular feature profiles ordinary … |
| 5 | Ron Whittemore | 1 | |
| 6 | Svante Stockselius | 3 | |
| 7 | Greg Martin | 1161 | Political Booking Producer at @nbcnews @todayshow |
| 8 | Area Man | 1 | I am Area Man. I pwn your news feed |
| 9 | CELESTIA ROBINSON | 2 | |
| 10 | NBC Field Notes | 1390 | NBC News correspondents and http://t.co/1eSopOQt8s … |
| 11 | rob adams | 2 | |
| 12 | JL | 2 | |
| 13 | David Kelsey | 1 | |
| 14 | Hank Morris | 1 | |
| 15 | Jesse Marks | 1 | |
| 16 | Brayden Rainey | 1 | |
| 17 | child of the tiger | 3 | Yet another activist twitter, fighting all those fun … |
| 18 | Julie Swango | 4 | |
| 19 | Author Dianne Kube | 7 | Dianne Kube is an Author with a passion, for family, … |
| 20 | Consider the Source | 7 | |
| 21 | Adam Edelman | 2341 | Political reporter @nbcnews: Wisconsin native, … |
| 22 | Phil McCausland | 2519 | @NBCNews Digital reporter focused on the rural-urban … |
| 23 | Corky Siemaszko | 2538 | Senior Writer at NBC News Digital (former NY Daily … |
| 24 | Sam Petulla | 2588 | Editor @cnnpolitics: Usually looking for datasets … |
| 25 | Ken Strickland | 2693 | NBC News Washington Bureau Chief |
| 26 | Mike Mullen | 7 | |
| 27 | Elyse PG | 2697 | White House producer @nbcnews |@USCAnnenberg alum … |
| 28 | A. Johnson | 2 | Change your thoughts & you change your world: -Normal … |
| 29 | Steve Fenton | 4 | |
| 30 | Dobe Pitty Mami | 13 |
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Stephanie Palla | 198 | Enroll America National Regional Director … |
| 2 | Jennifer Sizemore | 386 | |
| 3 | Alissa Swango | 441 | Director of Digital Programming at @natgeo: all things … |
| 4 | Making a Difference | 670 | @NBCNightlyNews’ popular feature profiles ordinary … |
| 5 | Ron Whittemore | 1 | |
| 6 | Svante Stockselius | 3 | |
| 7 | Greg Martin | 1161 | Political Booking Producer at @nbcnews @todayshow |
| 8 | Area Man | 1 | I am Area Man. I pwn your news feed |
| 9 | CELESTIA ROBINSON | 2 | |
| 10 | NBC Field Notes | 1390 | NBC News correspondents and http://t.co/1eSopOQt8s … |
| 11 | rob adams | 2 | |
| 12 | JL | 2 | |
| 13 | David Kelsey | 1 | |
| 14 | Hank Morris | 1 | |
| 15 | Jesse Marks | 1 | |
| 16 | Brayden Rainey | 1 | |
| 17 | child of the tiger | 3 | Yet another activist twitter, fighting all those fun … |
| 18 | Julie Swango | 4 | |
| 19 | Author Dianne Kube | 7 | Dianne Kube is an Author with a passion, for family, … |
| 20 | Consider the Source | 7 | |
| 21 | Adam Edelman | 2341 | Political reporter @nbcnews: Wisconsin native, … |
| 22 | Phil McCausland | 2519 | @NBCNews Digital reporter focused on the rural-urban … |
| 23 | Corky Siemaszko | 2538 | Senior Writer at NBC News Digital (former NY Daily … |
| 24 | Sam Petulla | 2588 | Editor @cnnpolitics: Usually looking for datasets … |
| 25 | Ken Strickland | 2693 | NBC News Washington Bureau Chief |
| 26 | Mike Mullen | 7 | |
| 27 | Elyse PG | 2697 | White House producer @nbcnews |@USCAnnenberg alum … |
| 28 | A. Johnson | 2 | Change your thoughts & you change your world: -Normal … |
| 29 | Steve Fenton | 4 | |
| 30 | Dobe Pitty Mami | 13 |
Through the APPR vector, the top 30 handles returned to @NBCPolitics include some relevant handles (NBC’s news team and their counterparts in other mainstream news organizations) and many obscure handles (handles with few followers and no profile descriptions). This results from the APPR vector’s bias towards extreme low degree and introduces noise to the clustering results.
Top 30 handles of APPR with seed node @NBCPolitics and the teleportation constant α = 0.15 in December 2018†
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Stephanie Palla | 198 | Enroll America National Regional Director … |
| 2 | Jennifer Sizemore | 386 | |
| 3 | Alissa Swango | 441 | Director of Digital Programming at @natgeo: all things … |
| 4 | Making a Difference | 670 | @NBCNightlyNews’ popular feature profiles ordinary … |
| 5 | Ron Whittemore | 1 | |
| 6 | Svante Stockselius | 3 | |
| 7 | Greg Martin | 1161 | Political Booking Producer at @nbcnews @todayshow |
| 8 | Area Man | 1 | I am Area Man. I pwn your news feed |
| 9 | CELESTIA ROBINSON | 2 | |
| 10 | NBC Field Notes | 1390 | NBC News correspondents and http://t.co/1eSopOQt8s … |
| 11 | rob adams | 2 | |
| 12 | JL | 2 | |
| 13 | David Kelsey | 1 | |
| 14 | Hank Morris | 1 | |
| 15 | Jesse Marks | 1 | |
| 16 | Brayden Rainey | 1 | |
| 17 | child of the tiger | 3 | Yet another activist twitter, fighting all those fun … |
| 18 | Julie Swango | 4 | |
| 19 | Author Dianne Kube | 7 | Dianne Kube is an Author with a passion, for family, … |
| 20 | Consider the Source | 7 | |
| 21 | Adam Edelman | 2341 | Political reporter @nbcnews: Wisconsin native, … |
| 22 | Phil McCausland | 2519 | @NBCNews Digital reporter focused on the rural-urban … |
| 23 | Corky Siemaszko | 2538 | Senior Writer at NBC News Digital (former NY Daily … |
| 24 | Sam Petulla | 2588 | Editor @cnnpolitics: Usually looking for datasets … |
| 25 | Ken Strickland | 2693 | NBC News Washington Bureau Chief |
| 26 | Mike Mullen | 7 | |
| 27 | Elyse PG | 2697 | White House producer @nbcnews |@USCAnnenberg alum … |
| 28 | A. Johnson | 2 | Change your thoughts & you change your world: -Normal … |
| 29 | Steve Fenton | 4 | |
| 30 | Dobe Pitty Mami | 13 |
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Stephanie Palla | 198 | Enroll America National Regional Director … |
| 2 | Jennifer Sizemore | 386 | |
| 3 | Alissa Swango | 441 | Director of Digital Programming at @natgeo: all things … |
| 4 | Making a Difference | 670 | @NBCNightlyNews’ popular feature profiles ordinary … |
| 5 | Ron Whittemore | 1 | |
| 6 | Svante Stockselius | 3 | |
| 7 | Greg Martin | 1161 | Political Booking Producer at @nbcnews @todayshow |
| 8 | Area Man | 1 | I am Area Man. I pwn your news feed |
| 9 | CELESTIA ROBINSON | 2 | |
| 10 | NBC Field Notes | 1390 | NBC News correspondents and http://t.co/1eSopOQt8s … |
| 11 | rob adams | 2 | |
| 12 | JL | 2 | |
| 13 | David Kelsey | 1 | |
| 14 | Hank Morris | 1 | |
| 15 | Jesse Marks | 1 | |
| 16 | Brayden Rainey | 1 | |
| 17 | child of the tiger | 3 | Yet another activist twitter, fighting all those fun … |
| 18 | Julie Swango | 4 | |
| 19 | Author Dianne Kube | 7 | Dianne Kube is an Author with a passion, for family, … |
| 20 | Consider the Source | 7 | |
| 21 | Adam Edelman | 2341 | Political reporter @nbcnews: Wisconsin native, … |
| 22 | Phil McCausland | 2519 | @NBCNews Digital reporter focused on the rural-urban … |
| 23 | Corky Siemaszko | 2538 | Senior Writer at NBC News Digital (former NY Daily … |
| 24 | Sam Petulla | 2588 | Editor @cnnpolitics: Usually looking for datasets … |
| 25 | Ken Strickland | 2693 | NBC News Washington Bureau Chief |
| 26 | Mike Mullen | 7 | |
| 27 | Elyse PG | 2697 | White House producer @nbcnews |@USCAnnenberg alum … |
| 28 | A. Johnson | 2 | Change your thoughts & you change your world: -Normal … |
| 29 | Steve Fenton | 4 | |
| 30 | Dobe Pitty Mami | 13 |
Through the APPR vector, the top 30 handles returned to @NBCPolitics include some relevant handles (NBC’s news team and their counterparts in other mainstream news organizations) and many obscure handles (handles with few followers and no profile descriptions). This results from the APPR vector’s bias towards extreme low degree and introduces noise to the clustering results.
To reduce noise, we applied a regularization step to the APPR vector to remove those ‘distant’ and small nodes while preserving the close and relevant nodes. In Table 9, the majority of the top 30 handles with the highest regularized APPR (i.e. RPPR) values have three- or four-digit numbers of followers. Similarly to the APPR results, they include NBC’s news crew. But the difference is that the overwhelming majority (18) of the top 30 handles work at NBC. Some handles who work for other news organizations (e.g. Sam Petulla at @cnnpolitics and Emmanuelle Saliba at @Euronews) might have previously worked at NBC or have a close connection with its news team. Even the four handles that are not directly followed by @NBCPolitics are interesting—they are non-profit organizations (NYC Clothing Bank and Voices United) and a news-related individual or organization (James Miklaszewski and Social Headlines). This pattern can also be observed in the top 200 handles, 72 of whom are directly followed by @NBCPolitics. The overwhelming majority of those who are directly followed by it are affiliated with NBC, comprising its day-to-day news team, who enjoy much less publicity than the celebrity reporters. The remaining 128 of them, who are not directly followed by @NBCPolitics, actually also include 20 of NBC’s journalists and staff, such as Ray Farmer (NBC News photographer) and Jim Miklaszewski (Chief Pentagon Correspondent for NBC News). Others are non-profit organizations like Vets Helping Heroes and professionals from other news organizations or companies such as the Wall Street Journal, National Football League Network and Microsoft, who might have worked for NBC or have a close connection with it. Although there still appear to be obscure handles with few followers, they decrease significantly in number—the median in-degree of the top 200 handles is 340 (on-line supplementary materials section S4): a precipitous drop from that of the top PPR handles yet not too small compared with that of the top APPR handles. We thus conclude that the regularized APPR vector returns a local cluster with little noise, reflecting a seed node’s close circles, either directly or indirectly related.
Top 30 handles of RPPR with seed node @NBCPolitics and the teleportation constant α = 0.15 in December 2018†
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Stephanie Palla | 198 | Enroll America National Regional Director http://t.co/X6jJIE … |
| 2 | Jennifer Sizemore | 386 | |
| 3 | Alissa Swango | 441 | Director of Digital Programming at @natgeo: all things food … |
| 4 | Making a Difference | 670 | @NBCNightlyNews’ popular feature profiles ordinary people do … |
| 5 | Greg Martin | 1161 | Political Booking Producer at @nbcnews @todayshow |
| 6 | NBC Field Notes | 1390 | NBC News correspondents and http://t.co/1eSopOQt8s reporters … |
| 7 | Adam Edelman | 2341 | Political reporter @nbcnews: Wisconsin native, Bestchester … |
| 8 | Phil McCausland | 2519 | @NBCNews Digital reporter focused on the rural-urban divide … |
| 9 | Corky Siemaszko | 2538 | Senior Writer at NBC News Digital (former NY Daily News … |
| 10 | Sam Petulla | 2588 | Editor @cnnpolitics: usually looking for datasets; you can … |
| 11 | Ken Strickland | 2693 | NBC News Washington Bureau Chief |
| 12 | Elyse PG | 2697 | White House producer @nbcnews |@USCAnnenberg alum | LA kid … |
| 13 | Hasani Gittens | 3002 | Level 29 Mage: senior News Ed. @NBCNews; sheriff of Nattahna … |
| 14 | Scott Foster | 3464 | Senior Producer, Washington @NBCNEWS @TODAYshow |
| 15 | Zach Haberman | 3693 | Lead Breaking News Editor, @NBCNews: previously had other jobs … |
| 16 | Emmanuelle Saliba | 4004 | Head of Social Media Strategy @Euronews | Launched #THECUBE … |
| 17 | Alex Johnson | 4371 | News, data and analysis for @NBCNews; data geek; … |
| 18 | Savannah Sellers | 4637 | News junkie: host of NBC’s "Stay Tuned" on Snapchat … |
| 19 | NYC Clothing Bank | 154 | We distribute new, never-worn clothing and merchandise … |
| 20 | Shaquille Brewster | 5362 | @NBCNews Producer/Politics | @HowardU Alum| Journalist … |
| 21 | Joey Scarborough | 6277 | NBC News Social Media Editor: New York Daily News Alum; RTs … |
| 22 | Jane C. Timm | 6478 | @nbcnews political reporter and fact checker: more fun than … |
| 23 | Anthony Terrell | 6827 | Emmy Award winning journalist: political observer; covered … |
| 24 | NBC News Videos | 7838 | The latest video from http://t.co/xPyvMOTEF6 |
| 25 | Libby Leist | 7946 | Executive Producer @todayshow |
| 26 | Voices United | 310 | Voices United is a non profit educational organization … |
| 27 | Social Headlines | 344 | Daily roundup of top social media and networking stories |
| 28 | James Miklaszewski | 337 | Writer, Photographer, Editor, Director, Producer, Newshound … |
| 29 | Courtney Kube | 9494 | NBC News National Security & Military Reporter … |
| 30 | Bob Corker | 10042 | Serving Tennesseans in the U.S. Senate |
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Stephanie Palla | 198 | Enroll America National Regional Director http://t.co/X6jJIE … |
| 2 | Jennifer Sizemore | 386 | |
| 3 | Alissa Swango | 441 | Director of Digital Programming at @natgeo: all things food … |
| 4 | Making a Difference | 670 | @NBCNightlyNews’ popular feature profiles ordinary people do … |
| 5 | Greg Martin | 1161 | Political Booking Producer at @nbcnews @todayshow |
| 6 | NBC Field Notes | 1390 | NBC News correspondents and http://t.co/1eSopOQt8s reporters … |
| 7 | Adam Edelman | 2341 | Political reporter @nbcnews: Wisconsin native, Bestchester … |
| 8 | Phil McCausland | 2519 | @NBCNews Digital reporter focused on the rural-urban divide … |
| 9 | Corky Siemaszko | 2538 | Senior Writer at NBC News Digital (former NY Daily News … |
| 10 | Sam Petulla | 2588 | Editor @cnnpolitics: usually looking for datasets; you can … |
| 11 | Ken Strickland | 2693 | NBC News Washington Bureau Chief |
| 12 | Elyse PG | 2697 | White House producer @nbcnews |@USCAnnenberg alum | LA kid … |
| 13 | Hasani Gittens | 3002 | Level 29 Mage: senior News Ed. @NBCNews; sheriff of Nattahna … |
| 14 | Scott Foster | 3464 | Senior Producer, Washington @NBCNEWS @TODAYshow |
| 15 | Zach Haberman | 3693 | Lead Breaking News Editor, @NBCNews: previously had other jobs … |
| 16 | Emmanuelle Saliba | 4004 | Head of Social Media Strategy @Euronews | Launched #THECUBE … |
| 17 | Alex Johnson | 4371 | News, data and analysis for @NBCNews; data geek; … |
| 18 | Savannah Sellers | 4637 | News junkie: host of NBC’s "Stay Tuned" on Snapchat … |
| 19 | NYC Clothing Bank | 154 | We distribute new, never-worn clothing and merchandise … |
| 20 | Shaquille Brewster | 5362 | @NBCNews Producer/Politics | @HowardU Alum| Journalist … |
| 21 | Joey Scarborough | 6277 | NBC News Social Media Editor: New York Daily News Alum; RTs … |
| 22 | Jane C. Timm | 6478 | @nbcnews political reporter and fact checker: more fun than … |
| 23 | Anthony Terrell | 6827 | Emmy Award winning journalist: political observer; covered … |
| 24 | NBC News Videos | 7838 | The latest video from http://t.co/xPyvMOTEF6 |
| 25 | Libby Leist | 7946 | Executive Producer @todayshow |
| 26 | Voices United | 310 | Voices United is a non profit educational organization … |
| 27 | Social Headlines | 344 | Daily roundup of top social media and networking stories |
| 28 | James Miklaszewski | 337 | Writer, Photographer, Editor, Director, Producer, Newshound … |
| 29 | Courtney Kube | 9494 | NBC News National Security & Military Reporter … |
| 30 | Bob Corker | 10042 | Serving Tennesseans in the U.S. Senate |
Through the RPPR vector, the top 30 handles returned to @NBCPolitics include much fewer low in-degree and obscure handles and many more moderately connected nodes that are relevant to @NBCPolitics, including its reporters and editors and media professionals from other organizations.
Top 30 handles of RPPR with seed node @NBCPolitics and the teleportation constant α = 0.15 in December 2018†
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Stephanie Palla | 198 | Enroll America National Regional Director http://t.co/X6jJIE … |
| 2 | Jennifer Sizemore | 386 | |
| 3 | Alissa Swango | 441 | Director of Digital Programming at @natgeo: all things food … |
| 4 | Making a Difference | 670 | @NBCNightlyNews’ popular feature profiles ordinary people do … |
| 5 | Greg Martin | 1161 | Political Booking Producer at @nbcnews @todayshow |
| 6 | NBC Field Notes | 1390 | NBC News correspondents and http://t.co/1eSopOQt8s reporters … |
| 7 | Adam Edelman | 2341 | Political reporter @nbcnews: Wisconsin native, Bestchester … |
| 8 | Phil McCausland | 2519 | @NBCNews Digital reporter focused on the rural-urban divide … |
| 9 | Corky Siemaszko | 2538 | Senior Writer at NBC News Digital (former NY Daily News … |
| 10 | Sam Petulla | 2588 | Editor @cnnpolitics: usually looking for datasets; you can … |
| 11 | Ken Strickland | 2693 | NBC News Washington Bureau Chief |
| 12 | Elyse PG | 2697 | White House producer @nbcnews |@USCAnnenberg alum | LA kid … |
| 13 | Hasani Gittens | 3002 | Level 29 Mage: senior News Ed. @NBCNews; sheriff of Nattahna … |
| 14 | Scott Foster | 3464 | Senior Producer, Washington @NBCNEWS @TODAYshow |
| 15 | Zach Haberman | 3693 | Lead Breaking News Editor, @NBCNews: previously had other jobs … |
| 16 | Emmanuelle Saliba | 4004 | Head of Social Media Strategy @Euronews | Launched #THECUBE … |
| 17 | Alex Johnson | 4371 | News, data and analysis for @NBCNews; data geek; … |
| 18 | Savannah Sellers | 4637 | News junkie: host of NBC’s "Stay Tuned" on Snapchat … |
| 19 | NYC Clothing Bank | 154 | We distribute new, never-worn clothing and merchandise … |
| 20 | Shaquille Brewster | 5362 | @NBCNews Producer/Politics | @HowardU Alum| Journalist … |
| 21 | Joey Scarborough | 6277 | NBC News Social Media Editor: New York Daily News Alum; RTs … |
| 22 | Jane C. Timm | 6478 | @nbcnews political reporter and fact checker: more fun than … |
| 23 | Anthony Terrell | 6827 | Emmy Award winning journalist: political observer; covered … |
| 24 | NBC News Videos | 7838 | The latest video from http://t.co/xPyvMOTEF6 |
| 25 | Libby Leist | 7946 | Executive Producer @todayshow |
| 26 | Voices United | 310 | Voices United is a non profit educational organization … |
| 27 | Social Headlines | 344 | Daily roundup of top social media and networking stories |
| 28 | James Miklaszewski | 337 | Writer, Photographer, Editor, Director, Producer, Newshound … |
| 29 | Courtney Kube | 9494 | NBC News National Security & Military Reporter … |
| 30 | Bob Corker | 10042 | Serving Tennesseans in the U.S. Senate |
| Rank . | Name . | Followers . | Description . |
|---|---|---|---|
| 1 | Stephanie Palla | 198 | Enroll America National Regional Director http://t.co/X6jJIE … |
| 2 | Jennifer Sizemore | 386 | |
| 3 | Alissa Swango | 441 | Director of Digital Programming at @natgeo: all things food … |
| 4 | Making a Difference | 670 | @NBCNightlyNews’ popular feature profiles ordinary people do … |
| 5 | Greg Martin | 1161 | Political Booking Producer at @nbcnews @todayshow |
| 6 | NBC Field Notes | 1390 | NBC News correspondents and http://t.co/1eSopOQt8s reporters … |
| 7 | Adam Edelman | 2341 | Political reporter @nbcnews: Wisconsin native, Bestchester … |
| 8 | Phil McCausland | 2519 | @NBCNews Digital reporter focused on the rural-urban divide … |
| 9 | Corky Siemaszko | 2538 | Senior Writer at NBC News Digital (former NY Daily News … |
| 10 | Sam Petulla | 2588 | Editor @cnnpolitics: usually looking for datasets; you can … |
| 11 | Ken Strickland | 2693 | NBC News Washington Bureau Chief |
| 12 | Elyse PG | 2697 | White House producer @nbcnews |@USCAnnenberg alum | LA kid … |
| 13 | Hasani Gittens | 3002 | Level 29 Mage: senior News Ed. @NBCNews; sheriff of Nattahna … |
| 14 | Scott Foster | 3464 | Senior Producer, Washington @NBCNEWS @TODAYshow |
| 15 | Zach Haberman | 3693 | Lead Breaking News Editor, @NBCNews: previously had other jobs … |
| 16 | Emmanuelle Saliba | 4004 | Head of Social Media Strategy @Euronews | Launched #THECUBE … |
| 17 | Alex Johnson | 4371 | News, data and analysis for @NBCNews; data geek; … |
| 18 | Savannah Sellers | 4637 | News junkie: host of NBC’s "Stay Tuned" on Snapchat … |
| 19 | NYC Clothing Bank | 154 | We distribute new, never-worn clothing and merchandise … |
| 20 | Shaquille Brewster | 5362 | @NBCNews Producer/Politics | @HowardU Alum| Journalist … |
| 21 | Joey Scarborough | 6277 | NBC News Social Media Editor: New York Daily News Alum; RTs … |
| 22 | Jane C. Timm | 6478 | @nbcnews political reporter and fact checker: more fun than … |
| 23 | Anthony Terrell | 6827 | Emmy Award winning journalist: political observer; covered … |
| 24 | NBC News Videos | 7838 | The latest video from http://t.co/xPyvMOTEF6 |
| 25 | Libby Leist | 7946 | Executive Producer @todayshow |
| 26 | Voices United | 310 | Voices United is a non profit educational organization … |
| 27 | Social Headlines | 344 | Daily roundup of top social media and networking stories |
| 28 | James Miklaszewski | 337 | Writer, Photographer, Editor, Director, Producer, Newshound … |
| 29 | Courtney Kube | 9494 | NBC News National Security & Military Reporter … |
| 30 | Bob Corker | 10042 | Serving Tennesseans in the U.S. Senate |
Through the RPPR vector, the top 30 handles returned to @NBCPolitics include much fewer low in-degree and obscure handles and many more moderately connected nodes that are relevant to @NBCPolitics, including its reporters and editors and media professionals from other organizations.
(a) Illustration of 5840 Twitter handles examined by algorithm 3 and three samples of size 200 by PPR, APPR and RPPR (each dot represents a user in Twitter) (
, top 200 handles by PPR vector (vertices above the line are PPR’s sample);
, sample returned by algorithm 4 given n = 200 (vertices above this boundary correspond to APPR’s sample);
, sample of RPPR;
, boundary of this sample) and (b) in-and-out ratio of local clusters identified by PPR (
), APPR (
) and RPPR (
), as the sample sizes vary (a higher in-and-out ratio indicates a more internally connected cluster)
The PPR clustering is fairly robust to the choice of teleportation constant, despite the size of local cluster. To illustrate this, we also performed the same pipeline of analysis with the seed @NBCPolitics while varying the value of α (e.g. 0.05, 0.25 and 0.33) in parallel. We observed that those local clusters returned by algorithm 4 all share a great portion of members in common. For example, there are 280 (93.3%) overlapping members between two targeted samples of size n = 300, using α = 0.15 and α = 0.25 respectively. These suggest a low sensitivity to the teleportation constant (see the on-line supplementary materials section S2).
Fig. 3(a) depicts the behaviours of PPR, APPR and RPPR. Each handle queried in this sampling is displayed as a dot, with the y-axis representing the PPR value and x-axis the number of followers (i.e. in-degree). Top handles with the highest PPR values are above the blue broken line, which tend to concentrate on the right-hand end of the x-axis and thus are biased towards high in-degrees. Top handles with the highest APPR values are dots to the left of the yellow chain curve, which gather on the left-hand end of the x-axis and thus in favour of low in-degrees. Regularized APPR, by purple dots, excludes the very low degree nodes and very high degree nodes. As the empirical results show, these three vectors can be thought of as lenses through which we view the local structure of a given Twitter handle with varying foci, rendering high, moderate and low in-degree blocks and serving different needs and purposes.
7 Discussion
This paper studies the PPR vector under the DCSBM and PPR clustering in massive block model graphs. We establish some consistency results for this method and examine its performance through analysis of Twitter friendship graph. As shown in the results, the PPR vectors with and without adjustment have distinct properties and can be used to sample a massive graph effectively for various purposes. However, there are limitations that are worthy of future investigations.
In Section 3, we provide a representation of the PPR vector under the DCSBM and its extension into directed graphs. The result does not impose extra structural restrictions on the model parameters, except that B corresponds to a strongly connected blockwise graph. We consider a positive definite connectivity matrix particularly so that it is intuitive to conceive the notion of a local cluster. In practice (and many of our experiments; see the on-line supplementary materials section S2), however, a PPR-type algorithm appears to continue working for a broader range of B (e.g. singular or indefinite), provided that the teleportation constant is sufficiently large (e.g. α > 0.1). It is unclear yet what is the minimum constraint needed on B for the PPR clustering to function. In addition, the DCSBM does have its limits. For example, the model fails to capture either mixed block membership or popularity features which are potentially informative in real world networks. The behaviour of a PPR vector under other extensions of SBMs, such as the mixed membership SBM and popularity-adjusted block model, remains unknown (Airoldi et al., 2008; Sengupta and Chen, 2018). Future studies on the PPR vector under these models could shed further light on the PPR clustering and offer more practical guidelines on their application.
In Section 4, we proved the consistency of PPR clustering, requiring the average expected node degree to grow of the order of log (N), which hits the boundary between the theoretical guarantees and the realistic observation. In contrast, scale-free networks such as the preferential attachment model (Barabási and Albert, 1999) have finite expected node degrees. Future investigations into variants of PPR that could possibly overcome this limitation yet ensure a fine local cluster discovery would be particularly interesting and useful.
In Section 6, we introduce the regularized version of the APPR (the RPPR) vector, with a series of empirical evidence showing its efficacy in targeted sampling. Although the results appear promising, theoretical guarantees for this technique remain unexplored. For some mathematical analyses, one may resort to the techniques that were used in Le et al. (2016). It has previously been shown that the regularized graph Laplacian (or transition matrix) enjoys finite sample convergence properties, which facilitate the consistency of many regularized spectral methods. It thus is a reasonable conjecture that RPPR vectors are also suitable for local clustering.
An R implementation of PPR clustering is available from https://github.com/RoheLab/aPPR.
Supporting information
Additional ‘supporting information’ may be found in the on-line version of this article:
‘Supplementary materials to “Targeted sampling from massive Blockmodel graphs with personalized PageRank”’.
Acknowledgements
This research is supported by National Science Foundation grant DMS-1612456 and DMS-1916378 and Army Research Office grant W911NF-15-1-0423. We thank Yuling Yan and E. Auden Krauska for their helpful comments. We thank Alex Hayes for kindly advising on the software development.
References
Appendix A Technical proofs
A.1 Proof of proposition 1
We apply the Perron–Frobenius theorem for the first part (Perron, 1907; Frobenius, 1912) and complete the proof by construction.
- (a)
First, we show that Q is a Markov transition matrix by modifying G = (V, E). For this, we shrink the weights of every existing edge by a factor 1 − α and add an edge-weighted α between seed node and all nodes in the graph. Then Q represents the new graph G′(V, E′), which is strongly connected by construction. Hence Q is irreducible. The PPR vector p is all positive. To see this, note that the equation pT = pTQ implies that p is a stationary distribution for the standard random walk on G′. Since G′ is strongly connected, it follows that the stationary distribution must be all positive. From the Perron–Frobenius theorem, the only all positive eigenvector of a non-negative irreducible matrix is associated with the leading eigenvalue, which is 1 in our case. Since the leading eigenvalue of a non-negative irreducible matrix is simple, we conclude that p is unique.
- (b)We finish the proof by constructing an explicit form of the PPR vector. Let The infinite sum converges for α ∈ (0, 1]. Then, satisfies the definition of the PPR vector,Since the solution is unique, we have .
A.2 Proof of proposition 2
The desired result follows from recognizing that pε + p(r) is initially 0 + p(π) and that, when the algorithm terminates, [p(r)]u ⩽ ru for any sampled u.
Remark 1If εd1 > 1, algorithm 1 terminates after the first round and simply outputs p = 0. Under this circumstance, proposition 2 still holds, because .
A.3 Lemmas for the degree-corrected stochastic block model
Lemma 2(properties of the DCSBM)
Under the population directed DCSBM with K blocks and parameters ,
- (a)
, and , and
- (b)
and
ProofResult (a) is an alternative way of writing the definition. For result (b), we prove the first equation. Recall that, for any i, ; then, by definition,
Remark 2Since ZTΘinZ = IK, result (a) implies that
Lemma 3(explicit form of and its powers)
Under the population directed DCSBM with K blocks and parameters , the population graph transition is the productand its matrix powers are
ProofBy definition and lemma 2, part (b), for any u, v ∈ V,For the powers of , noting that ZTΘinZ = IK,The result desired follows from the principle of induction on the kth power.
A.4 Proof of theorem 1
By proposition 1 and lemma 3, we have
In addition, it follows from lemma 2, part (a), that
This completes the proof.
A.5 Proof of lemma 1
For any α > 0, the PPR vector with seed node is the solution to the equation where Define a sequence of probability distributions such that where is an arbitrary initial probability distribution. Then, . For simplicity, we assume that is close to , i.e., for any ɛ > 0 and s ⩾ 0,
This can be achieved by finding an integer S(ɛ) that is sufficiently large and setting
We first claim that
In fact, for any u ≠ 1,
We then show that for any v ≠ 1 by contradiction. Suppose otherwise that ; then equation 11 implies that, for any ,where Hence, . In addition, applying equation 12 recursively we have
The inequality means that, if is fixed, can be arbitrarily small when s → ∞, which contradicts the fact that is a probability distribution. This completes the proof.
Remark 3When the teleportation constant is 0, the PPR vector becomes the stationary probability distribution of a standard random walk:After adjusting by node degrees, every entry becomes identical (). Lemma 1 is intuitive, recognizing that the teleportation introduces a particular favour of the seed node.
Remark 4When the edges are weighted (non-negative), the stationary distribution of a random walk is still proportional to node degrees, if one defines the degree as the sum of edge weights incident to the node (Lovász, 1993). Note also that the stationary distribution of a random walk in a directed graph is characterized by the in-degree of nodes (Ghoshal and Barabási, 2011; Lu et al., 2013). The conclusion and a modified proof apply to directed or weighted graphs.
A.6 Proof of corollary 1
Since Δα ⩽ 1, assumption (8) contains condition (7) in theorem 2, which together with proposition 2 implies that
if Δ2δ/ log (N) is sufficiently large. These collectively imply that as desired.


