Frequent Subgraph Mining Based on Pregel

Graph is an increasingly popular way to model complex data, and the size of single graphs is growing toward massive. Nonetheless, executing graph algorithms efﬁciently and at scale is surprisingly challenging. As a consequence, distributed programming frameworks have emerged to empower large graph processing. Pregel , as a popular computational model for processing billion-vertex graphs, has been employed to improve the scalability of many algorithms. In this paper, we investigate frequent subgraph mining on single large graphs using Pregel . We present the ﬁrst distributed algorithm based on Pregel for single massive graphs. In addition, two optimizations are proposed to enhance the algo-rithm, reducing communication cost and distribution overhead. Extensive experiments conducted on real-life data conﬁrm the effectiveness and efﬁciency of the proposed algorithm and techniques.


INTRODUCTION
Graph data model is an increasingly popular way to represent data in various application fields, including social networks, bioinformatics, web graphs, etc.Recent decades have witnessed a rapid proliferation of sizes and volumes of graph-structured data.A variety of fundamental problems have been investigated on graphs, including subgraph search and matching [1,2], structure similarity queries [3,4], distance and reachability queries [5,6], etc.
Frequent pattern mining has been a focused theme in data mining for over a decade.Abundant literature was dedicated to this area, making tremendous progress, including frequent itemset mining, sequential pattern mining and so forth.Frequent subgraphs are subgraphs found from a collection of graphs or single large graph with support no less than a user-specified threshold.Frequent subgraphs are useful at characterizing graph datasets, classifying and clustering graphs, and building structural indices [7].We differentiate the two aforementioned scenarios-multi-graph and single-graph, and this paper focuses on frequent subgraph mining (FSM) on the single-graph setting.
Nowadays, graphs (networks) are growing toward massive.Take Facebook as an example.The number of active users of Facebook increased to one billion in late 2012, which was merely less than 9 years after its founding.By modeling users as vertices, friendships as edges, we have an overwhelmingly large graph of billion vertices; other typical massive networks are observed in the forms of phone call networks, protein interaction networks, world wide web, etc.Given the rapid growth of applications of FSM in various disciplines, as well as the sheer size of real-life graphs, an efficient method for distributed FSM at scale is of high demand.
Distributed FSM from single massive graphs is challenging, due to not only the special constraints of FSM algorithm design, but also the deficient support from existing distributed programming frameworks.First, an FSM algorithm computes the support of a candidate subgraph over the entire input graph.In a distributed platform, if the input graph is partitioned over various worker nodes, the local support of the subgraph is not much useful for deciding whether subgraph is globally frequent.Also, the support computation cannot be delayed arbitrarily, since candidate frequent subgraphs can be generated only from frequent subgraphs as per Apriori principle.Additionally, although there are several existing models, including MapReduce [8], the de facto big data processing framework, they also do not accommodate graph algorithms [9].Among the distributed graph processing frameworks, Pregel [10] is recognized for its scalability, flexibility, fault tolerance and a number of other attractive features.It is a vertex-centric programming model such that developers usually only need to submit processing scripts on vertices to the framework, which will handle the remaining issues such as graph partition and synchronization.However, it is suggested that structure-related computation may not fit Pregel naturally [11].Therefore, mapping a structure mining algorithm onto Pregel requires non-trivial efforts, since Pregel does not specify implementation details for self-defined functions.
In this paper, we focus on efficient implementation of FSM over single massive graphs on a Pregel-like extensible computing platform.To the best of our knowledge, this is among the first attempts to address the problem at scale under a modern distributed programming framework.
In summary, we make the following contribution: (i) We propose a systematic solution for FSM in single massive graphs using Pregel-like distributed programming paradigm.(ii) We devise two optimization techniques to enhance the baseline algorithm, reducing communication cost and distribution overhead, respectively.(iii) We evaluate the resulting algorithm pegi with extensive experiments on public real-life data.The experiment results confirm the efficiency and scalability of the proposed methods.
Organization.Sections 2 and 3 discuss related work and preliminaries, respectively.Section 4 presents the baseline algorithm, followed by optimizations in Section 5. Section 6 describes our experiments, and we conclude the paper in Section 7.

RELATED WORK
While FSM was extensively studied, how to overcome the rapid growth of graph data remains open.Following discusses related work in three directions-FSM on multi-graph setting, FSM on single-graph setting and distributed graph processing.
Mining Graph Collections.Apriori was utilized to mine frequent subgraphs on transaction settings by AGM [12].AGM generates candidate graphs by adding a vertex at a time; as an improvement, FSG [13] puts forward edge-growth mining.Both methods adopt the breadth-first search, i.e. first compute size-k frequent subgraphs, based on which size-(k + 1) frequent subgraphs are computed thereafter.
Distinctively, recent approaches follow depth-first search, with gSpan [14] as a representative.gSpan relies on a novel canonical graph labeling to assist with search space pruning.Later, another graph representation was employed to reduce the overhead of subgraph isomorphism tests [15].Recently, GAS-TON [16] proposed to categorize graphs into paths, trees and cyclic graphs, and developed accelerative techniques, respectively.A follow-up work [17] provides more insight into the categorization of graphs for speedup.
Compared with the aforementioned in-memory algorithms, ADI-Mine [18] is a disk-based algorithm leveraging a three-level ADI-index.To enhance with parallel computing, SUBDUE [19] describes a shared-memory parallel approach by partitioning tasks.Recently, MapReduce was employed, in which pattern size grows in each round of a MapReduce job.The state-of-the-art MapReduce-based solution is attributed to a two-step filter-and-refinement method [20], which incorporates techniques to predict candidate patterns and reduce inter-machine communication cost.Analogous work that adapts single-machine sequential algorithms into the MapReduce platform include [21,22].Note that their object differs from ours in that they focus on a large collection of graphs, rather than from a single large network.Therefore, in a similar flavor, we contend that recompiling a single-machine FSM algorithm on single graphs into a distributed fashion also worth dedicated effort.
Mining Single Graphs.As an equally important problem, this line of research focuses on mining single large graphs but with less work, to which our work belongs.Efforts were first dedicated to defining appropriate support measures, e.g.MI, HO and MIS.SIGRAM uses MIS [23], and follows a grow-and-store approach-it needs to store the intermediate results for support evaluation.A parallel version over a multi-core machine was also implemented to enhance its efficiency [24].To avoid the computational complexity of MIS, HO [25] and MI [26] were proposed.
The most recent work is attributed to GraMi [27], which formulates the FSM problem as a constraint satisfaction problem.It finds only the minimal set of instances to satisfy the support threshold, and hence, improves performance.While we are not aware of any published distributed solution for the identical problem, this paper describes the first triumph, which outperforms GraMi in terms of both efficiency and scalability.
Among others, we are also aware of several approximate FSM algorithms, e.g.Grew [28] and gApprox [29].
Distributed Graph Processing.Distributed graph processing framework is inevitable in handling massive graphs.MapReduce [8] is a universal tool for processing large volume of data, including graphs [30].Hence, it has been used to compute personalized PageRank [31], connected components [32], etc. Lately, a high-level query language GLog [33] was introduced on MapReduce for graph analysis.

PRELIMINARIES
Section 3.1 introduces the concepts related to FSM on single graphs, and Section 3.2 is a brief Pregel primer.

Frequent subgraph mining
For ease of exposition, we focus on simple graphs, i.e. undirected graphs with neither self-loops nor multiple edges.A

is also called a subgraph of G, and f (g) is an embedding of g in G.
Consider two graphs g G, and a minimum support threshold τ ; assume there is a function φ to measure the support of g in G.If φ(g) ≥ τ , g is a frequent subgraph of data graph G.There are several ways to measure the support of a subgraph g in a single graph G, and the most intuitive way is to count the isomorphisms of g in G.
Example 1.Consider the collaboration network G in Fig. 1, with authors represented by vertices and co-authorship by edges, and the label in the vertex indicates the community that the author belongs to.Given a subgraph g, there are three subgraph isomorphisms from g to G, i.e. v 1 − v 2 − v 3 to u 4 − u 3 − u 2 , u 5 − u 3 − u 2 and u 9 − u 8 − u 6 , respectively.Further, consider a minimum support threshold τ = 3, and φ is defined as Definition 3.1.φ(g) = 3 ≥ τ , and thus, g is a frequent subgraph of G.
Unfortunately, the aforementioned metric is not antimonotonic [23,25,26], since a subgraph may appear less times than its extensions.For instance, consider in Fig. 1 vertex IR and its extension IR-DB on G.It is easy to verify that the support of the former is 1 while the support of its extension is 2. Anti-monotonicity is crucial to developing algorithms that can effectively prune the search space, without which they have to carry out exhaustive search.As a consequence, existing literature presents several anti-monotonic support metrics based on (1) minimum image (MI) [26], (2) harmful overlap (HO) [25] and (3) maximum independent sets (MIS) [23].These measures are all established on subgraph isomorphisms, but differ in the extent of compatible overlap among them, and hence, the computational complexity.In particular, MI is the only metric that can be computed efficiently, while HO and MIS involve solving NP-complete problems; the result set of MI is always a superset of those of HO and MIS, and therefore, the desired results can be further derived from the results of MI with additional computation.Therefore, we adopt MI as the support measure in the sequel, whereas the algorithms are readily to be extended to others with minor effort.
Further to the definition, F(v) are the images of v ∈ V g in G with respect to g, and hence, the conditional support of v with respect to g, denoted by φ g (v) = |F(v)|, which may be less than |F|.We shorten 'minimum image-based support' to 'support' onwards; to distinguish, 'images' always refers to single vertices, while 'embeddings' can be vertices if g is a vertex, or subgraphs if g has at least two vertices.
Formally, the problem of FSM in a single graph considers a data graph G and a minimum support threshold τ , and finds  all subgraphs g in G such that φ(g) ≥ τ .This paper concerns solving the problem exactly and at scale, and the algorithm works for both connected and disconnected G. Additionally, several existing work proposes to mine maximal patterns, we argue that these desired subgraphs can be derived from our answers with further computation.Afterwards, we will focus on producing all frequent subgraphs.

Pregel overview
The iterative graph processing architecture Pregel [10] is based on the bulk synchronous parallel model of distributed computation.Pregel uses a master/workers model-one instance acts as the master, while the others become workers.The basic computation model of Pregel is shown in Fig. 2, illustrated with three supersteps on a master and two workers.
Given a single large graph, the vertices, identified by an ID, are firstly distributed by a partitioner across workers running on different computing nodes.The default partitioner is a hash function on vertex IDs.Computation is achieved by iterations, namely supersteps.The master performs serial computation and coordination between supersteps, and all workers conduct parallel computation and synchronize at the end of supersteps.
All algorithms in Pregel are implemented in a vertex-centric fashion.Specifically, every vertex has a vertex value, a set of edges and a set of messages sent to it in the previous superstep.In other terms, in superstep i, the vertex can receive the messages sent by other vertices in superstep i − 1, query and update the information of the current vertex and its edges, initiate topology mutation, communicate with global aggregation, and send messages to other vertices for superstep i + 1.After all vertices finish their computation, a global synchronization allows global data to be aggregated, and messages to be delivered.
Giraph 1 originated as the open-source counterpart to Pregel.To implement a graph algorithm, users instantiate methods master.compute()and vertex.compute() of the master and vertex classes, respectively.To enable master/vertex to perform multiple functions, compute() is executed based on a 1 https://giraph.apache.org/switch of multiple cases per superstep, corresponding to different functions of master/vertex.Hence, the interleave of these cases between the master and vertices working together accomplishes a task.
Aggregators are a mechanism for global communication, and data transmission.Each vertex can provide a value to an aggregator in superstep i, the system combines those values and the result is available to all vertices in superstep i + 1.It is possible to define a sticky aggregator for input values from all supersteps, which may hold a value, an array or even a map.We rely on these functions for global coordination.In particular, two methods are to be instantiated, (i) Aggregate(α, x), which enables the master/current vertex to send a value x to the aggregator named α and (ii) GetAggregatedValue(β), which enables the master/current vertex to retrieve the data stored in the aggregator named β.

THE MINING ALGORITHM
This section introduces the algorithm pegi for FSM in Pregel.
We first present a single-machine sequential algorithm for FSM on single graphs, and then map it onto the Pregel model resulting pegi.Its compute methods on the master and vertex are detailed thereafter.

Baseline algorithm
An FSM algorithm has to traverse all possible subgraphs of the data graph.By carefully organizing the subgraphs into a candidate generation tree, a commonly adopted approach is to conduct a depth-first search on the tree.A tree node represents a subgraph, or pattern, and the parent-child relation depicts the growth of a pattern.In particular, a subgraph is extended to one of its children by attaching every time a new edge.The extended subgraph is included as an answer if it is frequent and has not been discovered previously.This generation process ensures that the unambiguously defined candidate generation tree comprises all patterns.Additionally, anti-monotone pruning is utilized to shrink the search space, i.e. any extension of an infrequent graph cannot be frequent.We differentiate two types of edges that can be used to extend a subgraph p: (1) forward edge, if it introduces a new vertex, namely target vertex, to p and (2) backward edge, if it is added between two vertices of p, which does not exist before.While they both extend p, backward edges do not affect the vertex set of p.

Algorithm 1: Baseline(G, τ )
Input : G is a graph; τ is a support threshold.
Output: P is a set of frequent subgraphs, initialized to ∅.The pseudo-code in Algorithm 1 implements our baseline FSM algorithm on single graphs.Algorithm 1 takes as input a graph G and a support threshold τ , and outputs the complete set of frequent subgraphs.It first collects the set of frequent single edges in G (Line 1), which are essentially the size-1 frequent subgraphs.Then, for each frequent subgraph found, it carries out an iterative DFS mining via function DFS-Mine (Line 2).Specifically, DFSMine first enumerates the 1-edge extension of the current subgraph p (Line 4), which are p's children in the candidate generation tree, as well as their occurrences.For each enumerated edge e, we construct a extended subgraph p (Line 5).If e is a forward edge with target vertex v, we first compute the conditional support of v with respect to p .If it is below the support threshold, it will never contribute a new frequent subgraph, and we continue to examine other edges (Lines 6-7).Then, we evaluate the support of p on the vertices of original subgraph p.If φ(p ) is not less than the threshold by anti-monotone pruning, we find an answer, as long as p is not seen in P (Lines 8-9).Afterwards, it puts p into another round of DFSMine (Line 10).The algorithm terminates when no more candidate subgraphs can be generated.

Distributed paradigm of pegi
An important observation from the baseline algorithm is that it tests whether an edge can be used to extend the current frequent subgraph, and then proceeds if the edge meets the support threshold.On the distributed setting, as the data graph is distributed to the workers, the local support of an edge lower than the threshold does not necessary lead to the failure globally.Another observation is the 1-edge extension is enumerated on the basis of the occurrences of current subgraph; that is, when the algorithm generates the candidates for the subsequent round, it requires the embeddings of current subgraph.Thus, the most intuitive way to implement this is storing the embeddings and tracking the changes.As a consequence, it is non-trivial to adapt the algorithm developed for single machine to run under Pregel, which involves complex design of computation and interaction between the master and workers.We address the challenges by proposing pegi (Pregel-based frequent subgraph mining).
Given a massive graph, pegi first distributes the graph partitions according to available workers.We do not leverage advanced graph partitioner in this work, and apply the default random partitioner.Then, it iteratively conducts two types of jobs, i.e. pattern growth and embedding discovery, on the master and workers, respectively.In other terms, the step-control of pattern space traversal is carefully handled by the master node, and for each step of pattern growth it requires updates of newly discovered embeddings, which is carried out on the distributed workers.
To achieve the aforementioned functions, we conceive three cases for the master, and four cases for the vertices, as abstracted in Algorithms 2 and 3, respectively.We list the functional cases on the master and vertex in Tables 1 and 2, respectively; Table 3 summarizes the aggregators involved in the algorithms.Through interaction among the cases above, the baseline pegi executes and flows as depicted in Fig. 3.
(i) In the first superstep, the master skips its compute method, and each vertex runs into case VERTEX to send its vertex label for aggregation.Particularly, label along with the vertex ID is sent to aggregator frq_v for statistics (Line 2 of Algorithm 3), in order to produce the set of frequent single vertices.(ii) In the second superstep, the master runs into case VER-TEX, where it first derives frequent single vertices by accumulating the images of every vertex label in aggregator frq_v.The results are refereed by V f , which is a map acting as a posting list, with frequent single vertices as entries and images as postings.Then, one frequent vertex is chosen as target vertex, i.e. the vertex to be extended to (Lines 3-5 of Algorithm 2).Its embeddings, namely target images, are distributed via aggregator nxt_v.
Afterwards, the vertices run into case EXTEND and execute ExploreEdge (to be detailed in Algorithm 5).It first updates the local embedding information by attaching the target images.Then, starting from these newly extended vertices, it explores its neighborhood to find candidate edges for subsequent supersteps from neighboring vertices.These edges are put in aggregator cnd_e.Additionally, in order to evaluate the support of target vertices for candidate forward edges, it sends a message to the neighboring vertices, i.e. potential target images.The message contains the edge that it follows, which is to be accumulated on those vertices shortly.(iii) In the third superstep, the master idles, while the vertices run into case SUPPORT.Specifically, it reads the incoming messages, and for every distinct edge e, we increment its counter at the vertex.Recall that each candidate forward edge is associated with a target vertex to be extended to.Thus, the counter records in essence the local conditional support of the target vertex for a candidate forward edge.The values are then transmitted via a designated aggregator sup_v (Lines 5-6 of Algorithm 3), which will be used to determine the next growing edge.(iv) In the fourth superstep, the master initiates case GROW executing GrowPattern (to be detailed Algorithm 4), where DFS-based pattern growth is conducted.Instead of proceeding iteratively as in Algorithm 1, we break from the procedure when an edge is chosen as the next growing edge, and set in aggregator nxt_e.
After that, the vertices start case TARGET, where they collect the target images following the chosen edge, and transmit them via aggregator nxt_v (Lines 9-10 of Algorithm 3).If backtrack is just executed in Grow-Pattern on the master, workers also synchronize with the master here by removing the last updated edges.(v) In the firth superstep, the master runs into case UPDATE, where it obtains the target images from aggregator nxt_v, which are used to update the embedding trees for the newly extended vertices on the master.This completes the growth of the first edge, finishing one round of pattern growth.Next, the vertices start again case EXTEND, initiating another round of embedding discovery.
The mining process proceeds iteratively, and the distributed grow-and-backtrack terminates till no more edge extension is allowed.
Remark.One may note that Algorithm 1 starts from frequent edges; in contrast, we propose to find the set of frequent vertices and grow patterns from fixed vertices.This consideration is due to the excessively large amount of candidate edges that could be aggregated in the first step.They may not be accommodated by the master, and hence, easily become a bottleneck bringing down the performance.Starting from a fixed frequent vertex effectively reduces the number of first-round candidate edges, which is bounded by O(ψd G ), where ψ is the average support of a frequent vertex, and d G is the average vertex degree of G.
Thus far, we have not explained the management of embeddings, procedures of GrowPattern on the master and ExploreEdge on the vertex.Following discusses our consideration and details the implementations.

On embeddings
To facilitate edge support evaluation, we adopt the growand-store approach [23], and thus, embeddings of the current subgraph are carefully materialized in a tree structure, namely embedding tree.In particular, we employ DFS encoding scheme [14] to assist the candidate generation, such that each subgraph in the candidate generation tree is expressed by a corresponding DFS code.Thus, we can linearize the vertices of an embedding according to their order in the DFS code.Thus, the linearized embedding of a pattern is of the same length (or depth) as its DFS code.Using null as the tree root, we gradually merge two embeddings from the first vertex to the last, forming a tree structure, as long as they share the same data vertex at the identical depth.
Note that we maintain all the embeddings of the current subgraph on the master, namely the global embedding tree; for every worker, only the embeddings starting from vertices on the worker are kept, referred as local embedding trees.It is noted that there is a choice of whether or not to store embeddings of the current subgraph [23,27].Existing single machine solution [27] contends that storing all the embeddings may hinder the algorithm from processing large graphs for memory constraint.While we are not against it, it is believed that the case is different in a distributed system.In the latter, distributing the embeddings to its owner worker enables the system to function as a 'memory cloud', and hence, alleviates the space overhead of maintaining embeddings.Seeing the advantage of fast embedding discovery, therefore, we choose to store the embeddings of the current subgraph as intermediate results.In implantation, embedding tree is realized by using the context function of Pregel, such that all the vertices on the same worker is able to access it.

On master
Among various functions performed on the master by master.compute(),as shown in Algorithm 2, case GROW does the crucial work on the master-pattern growth.We outline the major steps of case GROW in Algorithm 4, Particularly, it takes as input the set of candidate edges and the conditional support of their target vertices in aggregators cnd_e and sup_v, respectively, and produces the edge chosen to grow in aggregator nxt_e.Note the candidate edges E c includes both forward and backward edges, while is the conditional support of target vertices for candidate forward edges.In other words, for every candidate forward edge e, we have a corresponding value in equal the conditional support of the target vertex for e.It will be looked up shortly in DFSGrow to determine the next growing edge.The chosen frequent edge is distributed via aggregator nxt_e such that embedding discovery can be carried out on the distributed workers thereafter (Line 3).
We then proceed to explain DFSGrow in Algorithm 4, whose implementation is similar to Algorithm 1.To simulate the DFS process in a distributed fashion, we employ a stack S to reserve the iteration states for backtracking.S is a globally defined stack of edge sets, initialized to ∅. Algorithm 4 takes as input a set of candidate edges E c , conditional support of target vertices and the current subgraph p, and computes frequent subgraphs based on p iteratively as output.Specifically, for each candidate edge e, we first append it to p, and remove it from E c to ensure the search space rooted at this subgraph will not be explored multiple times (Line 5).Then, we conduct edge support evaluation to test whether p is frequent.Specifically, if e is a forward edge with target vertex v, we seek the conditional support of v in .It proceeds only if φ p (v) passes the support threshold (Lines 6-8); otherwise, it continues to examine other edges.Then, we (further) evaluate the support Algorithm 4: GrowPattern(E c , ) Input : E c in aggregator cnd_e is a set of edges; in aggregator sup_v is a set of supports.Output: e in aggregator nxt_e is a chosen edge.If φ(p ) exceeds the support threshold, and p is not discovered previously, we find an answer, and then push the remaining edges in E c as a set into stack S for backtrack (Lines 9-12).At last, we break from the procedure by returning the next growing edge.
After screening all edges in E c such that no more edge is chosen as the next growing edge, we start backtracking.We first check if stack S is empty to decide whether to halt (Line 13), as empty S implies the finish of mining under one frequent vertex.If not, we go back to the precedent node in the candidate generation tree (Lines 15-17).Specifically, we remove e from the current subgraph, discard the last updated edges from the embedding tree, and pop the top edge set out of S. The mining procedure is then called again.Iteratively in this way, we conduct a complete traverse of the candidate generation tree, examining all possible subgraphs.

On vertex
Embedding discovery is carried out on the workers in a distributed vertex-centric fashion, and the major step is to explore for candidate edges.Hence, the core function ExploreEdge of case EXTEND in vertex.compute() is presented in Algorithm 5, which finds local candidate edges starting from the newly extended vertex.
Algorithm 5 takes as input the newly extended vertices V t in aggregator nxt_v, and outputs candidate edges E c for the Algorithm 5: ExploreEdge(V t ) Input : V t in aggregator nxt_v is a set of vertices.
Output: E c in aggregator cnd_e is a set of edges.subsequent rounds in aggregator cnd_e.In particular, as a new edge is attached to the current subgraph, we instantiate it on the local embeddings incident on V t (Line 1).Next, we collect the candidate edges starting from this vertex in E c (Line 2), which is to be sent to the master via aggregator cnd_e (Line 6).For each candidate forward edge e, the corresponding target images are retrieved in V t ; additionally, we send the candidate forward edge as a message to each of the target images (Lines 3-5), which will be later accumulated for the conditional support of target vertices.

Illustration and analysis
So far, we have presented the complete algorithm of pegi.
Putting them together, we illustrate one round of pattern growth and embedding discovery in Example 3.
Example 3. Consider in Fig. 4 data graph G and current subgraph p (black) with candidate edges (red in color version, or gray in black and white version), and assume τ = 3.Five example embeddings of p in G are listed below, and hence, it is easy to verify that p is a frequent subgraph.To grow patterns based on p, we first find candidate edges in case EXTEND of vertex compute method.In particular, we detail the computation on u 4 .Every neighboring edge of u 4 is examined with respect to the embeddings of p.For instance, by checking edge (u 4 , u 5 ), four candidate forward edges can be discovered, which may be used to grow p. Subsequently, each candidate forward edge is sent to the corresponding potential target images.In the next superstep, target images receive the messages containing the candidate edges; for each received distinct candidate edge, a target image contributes 1 to the local conditional support of the target vertex.Then, the master obtains the candidate edges in case GROW.For instance, consider (u 4 , u 5 ) w.r.t.f 1 , and the new subgraph is denoted by p .u 0 ,u 2 and u 5 receive messages containing candidate forward edge (DM, ML), φ p (v 5 ) is evaluated as 3.
Recall in Section 3.1 that we employ MI as the support metric.Hence, we further check the global embedding tree, and verify the conditional support of the other vertices also meets the threshold τ = 3 (cf.Section 3.1 and Example 2).That is,  ∀i ∈ {0, 1, 2, 3, 4, 5}, φ p (v i ) ≥ 3; thus, φ(p ) ≥ 3, and p is determined to be frequent based on MI.Afterwards, if a candidate forward edge is selected, one more superstep is required to aggregate the target images, e.g.u 0 ,u 2 and u 5 for p , for updating the local and global embedding trees.
Analysis.The correctness of the algorithm is guaranteed by the completeness of the search procedure and the correctness of the support evaluation.It is easy to verify that the proposed algorithm computes the set of frequent subgraphs in a given massive graph, without missing or redundant results.
Space cost, communication cost and number of supersteps are major concerns for investigating a Pregel-based algorithm [34].We first study memory consumption.The major memory usage is from the storage of embeddings.Consider a pattern p, and assume the extreme case that all the vertices and edges of the data graph possess an identical vertex label.The total number of embeddings is O(d , where d G is the average vertex degree of G. Hence, the maximum memory required on each worker is in the same order of magnitude.
As a consequence, the memory consumption is heavily related to two factors, namely label distribution and density of graphs.The aforementioned worst case only occurs when the assumption is realized.We will see in Section 6 that real-life graphs usually incur much less memory footprint with general label distributions.
Then, we analyze the communication cost.In particular, we are interested in the total number of messages passing in the system, since this requires costly communication among the workers over the network.Recall in Algorithm 5 that for each candidate forward edge, the newly extended vertices send messages to neighboring vertices.Therefore, in one round of pattern growth, the total number of messages is Next, we investigate the number of supersteps required for discovering a pattern.Intuitively, the more supersteps, the more synchronization barriers, the larger distribution overhead.As explained in the algorithm, every time one edge is grown on the pattern, and exactly two superstep is required.Recall that all patterns are discovered through a tree-structured search space.Moreover, for each leaf node in the search tree, one more superstep is required to confirm that the candidate edge set is empty and the current pattern is on a leaf.As the number of leaves equals O(|P|), the total number of supersteps required is 2|P| + O(|P|) = O(|P|), where P is the set of frequent subgraphs.
In light of the analysis above, we may improve algorithmic efficiency, if aggregated vertices and required supersteps can be reduced.In the sequel, we will devise two optimization techniques working orthogonally to achieve better overall performance.

OPTIMIZATIONS
This section introduces optimizations on top of the baseline algorithm, to reduce communication cost (Section 5.1) and synchronization overhead (Section 5.2), respectively.

Filtering for less message passing
Large amount of data aggregation incurs great communication cost due to network delay.Algorithm performance can be enhanced if we can reduce such cost.We address the issue below, and start with a motivating example.
Example 4. Consider in Fig. 4 graph G and current pattern p, and threshold τ = 3. Recall in Algorithm 3 that messages are sent to {u 6 , u 9 } for candidate edge (DM, DB), in order to obtain the conditional support of the target vertex.However, the candidate edge is verified shortly to be infrequent, and hence, the two messages were sent in vain.
Motivated by the example, we contend that if we can filter out these non-promising target vertices, we are able to decrease the message passing in the system, and thus, reduce communication cost.To this end, we propose to first obtain an upperbound of the support that never underestimate the real value, and use this to compare with the threshold.If the upperbound is less than the threshold, the target vertex cannot be extended to, and the candidate forward edge needs not to be tested.
To obtain the estimation, for each candidate edge, we simply aggregate the number of neighboring vertices that match the candidate edge, and the aggregated number is an upperbound of the conditional support of the target vertex.
Lemma 5.1.Consider current subgraph p and target vertex u, and let v ∈ V t is a newly extended vertex, and M (v) is v's neighboring vertices that match u.
Proof.(sketch) v M (v) includes all the target images with respect to u.Based on Definition 3.2, the vertices that contribute to the conditional support of u are always composes a subset of v M (v), since there may be duplicate images for u among all the embeddings of p.Therefore, the correctness of the inequation follows.
To implement this idea, we need to augment the compute methods of the master and vertex.In particular, an extra superstep is used for executing filtering.In Algorithm 5, after getting the candidate edges, for each candidate forward edge e with target vertex u, we aggregate the number of target images via aggregator est_v.In the subsequent superstep, the master obtains the upperbound of φ p (u) from the aggregator, and then, compares with the threshold.If the upperbound is less than the threshold, we discard the target vertex u; otherwise, we put it into aggregator cnd_v.After testing all the target vertices, the surviving ones are distributed to the workers.Then, the algorithm flows back to sending messages to target images, in order to compute the real support.Note this time messages are only sent to target images of the vertices in aggregator cnd_v.We omit the pseudo-code in the interest of space.
We remark that this optimization technique reduces not only the message passing, but also the data aggregation in the system, since less target vertices and images are transmitted to compute the exact conditional support.Nevertheless, it is admitted that the aforementioned benefits come at the cost of one extra superstep, in comparison with the basic pegi.Thus, the technique is particularly useful in reducing communication cost when there are more infrequent candidate forward edges.

Coupling growth of multiple edges
As there is a synchronization barrier between supersteps, a large number of supersteps increase distribution overhead.To achieve elegant responsiveness of the algorithm, the less supersteps the better.In the following, we investigate whether supersteps can be reduced.Let us first look at an example.version).The search space subtree rooted at p is depicted in Fig. 5.Each node represents a pattern; for example, node +abcd denotes the pattern with all four edges attached to p.All the subgraphs in Fig. 5 are frequent, and duplicate nodes are removed.Specifically, 15 patterns are discovered in 15 steps of pattern growth in basic pegi.We observe that if backward edge a can grow on p, b and c can continue to grow without breaking from DFSGrow, as backward edges do not increase but only prune the existing embeddings.Based on the pruned embeddings, the support of b and c can be calculated, and hence, DFSGrow can proceed.In comparison, this approach requires only eight mining steps.It ceases until we reach d, as d is a forward edge, which may bring embeddings on extra vertices.
Motivated by the example, we formalize the idea into a recursive version of Algorithm 4. The first difference of the recursive algorithm lies in that we check whether the current subgraph p is visited, rather than remove e from E c (Line 5 of Algorithm 4).Afterwards, when e is verified to be frequent and p is canonical, we do not break; instead, we check whether e is a backward edge.If yes, it continues with a recursive calculation until the next edge is a forward edge.Upon encountering a forward edge, which is the exit of the recursion, we synchronize it to the workers, and push remaining E c into the stack for backtrack.This completes the batch processing of backward edges of the current round, and then, we carry on the iterative pattern growth.

EXPERIMENTS
This section presents our experiment results and analysis.

Experiment set-up
We used Giraph for experiments.All experiments were conducted using Java JRE v1.6.0 on Amazon EC2.By  We experimented on different workload settings over the following real-life datasets: (i) Twitter (TT)2 : This graph models the social news of Twitter, which consists of 11 316 811 vertices and 85 331 846 edges.A vertex represents a Twitter user, and an edge represents an interaction between the two users connected by the edge.(ii) LiveJournal (LJ) 3 : LiveJournal is a free online community with almost 10 million members; a significant fraction of these members are highly active.LiveJournal allows members to maintain journals, individual and group blogs, and it allows people to declare which other members are their friends and to which communities they belong.(iii) US Patents (UP) 4 : This graph represents the reference relations between US patents.This graph contains 3 774 768 nodes and 16 522 438 edges.We used the property class as the label collection, and hence, it has 418 labels in total.
Among the three datasets, UP possesses labels; for TT and LJ, we randomly added labels to the vertices and edges.That is, 100 distinct labels were used for vertices and six for edges on TT, and 30 for vertices and one for edges on LJ.To better mimic the label distribution on real-life networks, the randomization followed the Gaussian distribution.Table 4 lists the statistics of the datasets, where γ = |E|/|V |.Through Table 4, we can see that the three datasets are of different characteristics, i.e.LJ is much denser than the other two, TT is larger in terms of vertex number while the vertices of UP have the most distinct labels.
The following values were measured and reported: (1) peak memory consumption per machine; (2) total data transmission; (3) number of supersteps and (4) elapsed time.

Evaluating proposed techniques
We implemented the baseline algorithm under distributed paradigm to demonstrate the effectiveness of the framework, denoted by ' Baseline' (BA), constituted of Algorithms 2 and 3. On top of Baseline, we further implemented the two optimization techniques, resulting (i) +Filter, labeled by 'FT', which incorporates the filtering technique in Section 5.1 to reduce communication cost; (ii) +Backward, labeled by 'BE', which employs the optimization technique leveraging backward edges in Section 5.2 for reducing distribution overhead and (iii) +All, labeled by 'AL', which integrates all the proposed techniques.
Through Fig. 6a-l, we observe that Baseline, +Filter and +Backward work well on the three massive graphs, and can effectively obtain the frequent subgraphs from the graphs.In particular, we first evaluate the effect of +Filter and +Backward on memory consumption, and plot the results in Fig. 6a-c.The results reveal that more memory is required to carry on +Backward than Baseline.This is justifiable for that the compute method on the master has to go through a recursive function, bringing a slight increase of memory footprint.In comparison, the requirement for memory would be much smaller if +Filter were adopted, around 2000 MB spared in maximum.Rather than aggregate all the data in one superstep, which can be excessively memory-consuming, updating embedding information separately saves the memory but at the cost of more supersteps.However, as illustrated in Fig. 6j-l, incorporating +Filter actually results in a drop in total running time.
Next, we study the effect of these techniques on communication cost, with results plot in Fig. 6d-f.Communication cost is implied by the total data transmission in TB.As +Backward does not result in any change of data transmission, we omit it in this comparison.Specifically, communication cost is divided into two categories, namely data aggregation cost and message-passing cost.Data aggregation includes the data synchronized between the mater and workers through the aggregators, while message passing refers the messages sent among vertices.In general, message-passing cost accounts for the majority, which is reflected by the bar of 'BA' in Fig. 8a-c.As +Filter is designed to reduce message passing, the result demonstrates its effectiveness that the total communication cost drops dramatically, though it sees a slight increase in the data aggregation cost.Note that the effectiveness of +Filter is moderate on LJ.This could be justified by the shortage of target vertices in each round due to the shortage of distinct labels.In specific, the widest gap is 0.610 TB when the threshold τ = 3000 on TT, 0.405 TB on LJ with τ = 500 and 0.24 TB on UP with τ = 600.
We also recorded the number of supersteps, and plot the results in Fig. 6g-i.It can be revealed that +Filter actually increases the number of supersteps, while +Backward reduces it but performs differently.In nature, +Filter reduces the amount of messages at the expense of one extra superstep, and +Backward shrinks the number of supersteps when involving backward edges.Consequently, +All enlarges the number of supersteps in a moderate rate, and the gap between Baseline and +All narrows down when the threshold τ becomes large.
The overall performance is compared in Fig. 6j-l.We see that the elapsed time for all the proposed algorithms drops with the increase of τ , where the proposed optimization techniques witness a remarkable improvement.It is justifiable that by introducing +Backward, the running time can be saved for less supersteps and less synchronization barriers.However, it is not immediately clear that +Filter can reserve the time due to the increases in the number of supersteps.In comparison with Baseline, +Filter filters out considerable number of nonpromising candidate edges, and only small amount of edges are sent as messages, largely reducing the communication cost.As the message passing requires network communication, it can be rather expensive.Despite the increase of supersteps, the contribution for reducing communication cost is more significant.Finally, by incorporating both +Backward and 14 X.Zhao et al.  run on a single instance m3.xlarge, and experiments were carried on part of the three datasets (20% and 40% samples, respectively).
Forest Fire [44] was used to sample the data, which can well maintain the properties of original graphs.Note that we shrank the support threshold to ensure adequate number of frequent subgraphs in the sample graphs.We plot the results in Fig. 7a-k.
As to memory required for running the algorithms (pegi measured by the peak memory usage of single machines among all workers, and GraMi measured by the peak value of the single instance), we read in Fig. 7a and d that the privilege of pegi over GraMi is evident on the 20% sample, nearly one order of magnitude smaller.That is, 10 3 MB for pegi, and 10 4 MB for GraMi.It is even more significant on the 40% sample, as GraMi ran out of memory when τ is small, e.g.τ = 1500 and 2000; nonetheless, pegi levels off at the 10 3 magnitude.This superiority is achieved by storing the data in multiple instances and applying the optimization technique for reducing data aggregation.The results on LJ and UP saw similarity in Fig. 7b, e, c and f.Note that although GraMi does not maintain the embeddings, it still needs to materialize many, if not all, embeddings for support computation, as long as the embeddings overlap with identical images.Additionally, by doing this, it misses the opportunity of computation sharing, and hence, has to grow embeddings every time from scratch.Shortly, will we see how this affects the time efficiency.
As to the overall performance, we plot the results in Fig. 7g-l.Figure 7g-i show that pegi outperforms GraMi, particularly when τ is small.As GraMi did not finish in 10 h, the actual time was not recorded.The superiority is more notable on 40% sample in Fig. 7j-l, shown in logarithmic scale.The running time for pegi is almost one magnitude superior over GraMi.Either, we could not record the time for GraMi at small τ 's, due to out of memory (more than 15 GB space).
In short, the comparison experiment verifies that pegi performs better in general than GraMi, especially on large data sizes and small thresholds.

Evaluating scalability
Lastly, we demonstrate the scalability of the proposed methods.In this set of experiments, we randomly sampled fractions of the original graphs to run the algorithms.In particular, we sampled the original graph, resulting five sampled graph with {20%, 40%, 60%, 80%, 100%} vertices selected.While the number of vertices grow linearly, the number of frequent subgraphs and their embeddings increase sharply.This is due to the fact that edges of the sampled graph mounting exponentially.The results of scalability against dataset size are provided in Fig. 8a-c.Figure 8a showcases a significant growth of running time against the dataset size.Although the growing trend is noteworthy, it is slower than the exponential growth, especially when τ = 5000, which does not increase linearly in the logarithmic scale in the figure.Figure 8b presents a similar trend; although the lines fluctuate slightly, it does not impact the general conclusion.Figure 8c shows even better performance in scalability.As τ is fixed, the number of patterns and embeddings should rise exponentially against data size.It is reasonable to see an exponential increase in the running time.However, we witness a slowdown in growth rate on the three datasets, demonstrating that pegi scales well on the reallife data.

Frequent Subgraph Mining Based on Pregel
To further evaluate its ability to handle massive graphs, we also conducted experiments on synthetic graphs with an order of magnitude more edge than UP.We used a synthetic graph generator, 5 which naturally measures graph size in terms of |E|.On a synthetic graph of 100M edges with γ = 5 and the number of distinct vertex and edge labels equal 100 and 5, respectively, for instance, when the support threshold was 5000, it took 6832 s to respond.
Besides data scalability, it is of importance to see how the algorithm performs against the number of computing nodes.We conducted the experiment by increasing 10 worker instances each time, from 10 to 50.Intuitively, the runtime performance improves along with the growth of computing nodes.According to the results in Fig. 8d-f, we confirm this prediction.In general, the three lines representing different τ 's drop gradually with the increase of instance number, though the time saved is less significant.Particularly, on TT, the maximum speedup is 1.55 when the number of instances increases from 10 to 20 and τ = 4500; the minimum speedup is as large as 1.17 when the number of instances increases from 40 to 50 and τ = 5000.On LJ when τ = 1500, the speedups are 1.49, 1.28, 1.12 and 1.08, respectively, every time we added 10 more instances to the system.For UP, the maximum speedup is achieved when increasing the number of instances from 10 to 20 and τ = 600, and average value is as large as 1.64 for that support threshold.In the ideal scenario, which is hard to meet in practice, the number of instances and the elapsed time are negatively linear correlated.Lines in the figure shows good scalability, though the decreasing trend gradually slows down, due to the fact that the increase in the number of instances may pose burden on the synchronization within the cluster.It is also worth noting that it performs better in scalability when the threshold is getting larger.This could be justified, as larger threshold results in shorter length/size of discovered patterns.As the possible growing paths from the current embedding are larger, and more likely to be evenly distributed in the graph data when the current subgraph is shorter, the query tasks are more evenly dispatched over all the workers, leading to better scalability.

CONCLUSION
In this paper, we have studied the problem of FSM in a distributed environment.In particular, we base our solution on a popular big graph processing framework Pregel, and present the first solution of its kind.The compute methods on the master and vertex are carefully designed, working together to achieve a collective goal.Moreover, in order to enhance the mining performance, we propose two optimization techniques to reduce message passing and number of supersteps, respectively.Comprehensive experiments on real-life data confirm the efficiency and scalability of the proposal.
In the future, we plan to apply the proposed algorithm on various real-life applications, e.g.anti-spam emails and network security monitoring, to obtain potential interesting patterns.Additionally, it is of interest to investigate whether Pregel can be leveraged to solve the problem of mining constrained subgraph patterns and significant subgraph patterns on massive graphs.
a labeling function that assigns labels to vertices and edges.|V G | and |E G | are the number of vertices and edges in G, respectively.l G (v) denotes the label of vertex v ∈ V G .l G (u, v) denotes the label of edge e ∈ E G , where e = (u, v).

Definition 3 . 2 (
Minimum image-based support).Consider a set of distinct subgraph isomorphisms F = {f i } from g to G, where i ∈ [1, |F|].Let F(v) denote the set of distinct vertices u ∈ V G such that there exists an isomorphism f i mapping v ∈ V g to u.The minimum image-based support of g in G is defined as φ

17 return
DFSGrow(E c , , p) of p on the vertices of p leveraging the global embedding tree.

Example 5 .
Consider graphs in Fig.4, and four candidate edges a, b, c, d (red in color version, or gray in black and white

FIGURE 5 .
FIGURE 5. Candidate generation tree rooted at p.

TABLE 4 .
default, Data statistics.The standard configuration of the instance for NameNode was m1.medium, with one CPU and 3.75 GB memory.The remaining instances were m3.xlarge, with four CPU and 15 GB memory.To ensure adequate memory for every worker, the number of workers is set to 20, i.e. one instance for each worker.