12
Research Article Efficient and Scalable Graph Similarity Joins in MapReduce Yifan Chen, 1 Xiang Zhao, 1 Chuan Xiao, 2 Weiming Zhang, 1 and Jiuyang Tang 1 1 College of Information System and Management, National University of Defense Technology, Changsha 410073, China 2 Nagoya University, Nagoya, Japan Correspondence should be addressed to Xiang Zhao; [email protected] Received 17 March 2014; Accepted 29 May 2014; Published 8 July 2014 Academic Editor: Jian J. Zhang Copyright © 2014 Yifan Chen et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. is paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. e superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results. 1. Introduction As the most commonly used abstract data structure, graphs have been widely used for modeling the data in the fields of bioinformatics, multimedia, social networking, and the like. As a consequence, efforts were dedicated to various problems in managing and analyzing graph data, for example, frequent subgraph mining [1], structure search and indexing [2, 3], similarity search [4, 5], and so forth. is paper focuses on graph similarity join, one basic operation for processing graph data. Given two graph object sets and and a distance threshold, a graph similarity join returns all the pairs of graph objects, respectively, from and , the distances of which are no larger than the threshold. Graph similarity join has a wide spectral of applications, especially in preprocessing of graph mining, for example, structural data cleaning and near replicate structure detection. e most widely applied measure for determining graph similarity is graph edit distance (GED) [6, 7]. Compared with alternative measures, GED has at least three advantages: (1) it allows changes in both vertices and edges; (2) it reflects the topological information of graphs; (3) it is a metric that can be applied to any type of graphs. Consequently, we employ GED to quantify graph similarity in this paper. It is shown that exact computation of GED is NP-hard [8]. e state-of-the-art algorithm for graph similarity join is GSimJoin [9], which adopts the filtering-verification framework. In particular, signatures are generated for every graph with path-based -gram approach for count filtering (cf. Section 2.2). In the phase of verification, GED com- putation is invoked for candidate pairs via an A -based algorithm. GSimJoin is an in-memory algorithm, the perfor- mance and scalability of which are restricted to the available memory of a machine. e dataset size presented in the experimental study is limited in the thousands, since a graph similarity join operation in the worst case needs (||||) count filtering condition checks and similarity computations thereaſter. e era of big data calls for scalable algorithms to support large-scale data processing. is paper attempts to address the challenges on massive graph data. MapReduce is a well-known programming framework to facilitate processing large-scale data in parallel [10]. MassJoin [11] is a MapReduce-based algorithm for similarity join on strings. Nonetheless, there has been no existing distributed algorithm for graph similarity joins. Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 749028, 11 pages http://dx.doi.org/10.1155/2014/749028

Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

Research ArticleEfficient and Scalable Graph Similarity Joins in MapReduce

Yifan Chen1 Xiang Zhao1 Chuan Xiao2 Weiming Zhang1 and Jiuyang Tang1

1 College of Information System and Management National University of Defense Technology Changsha 410073 China2Nagoya University Nagoya Japan

Correspondence should be addressed to Xiang Zhao xiangzhaonudteducn

Received 17 March 2014 Accepted 29 May 2014 Published 8 July 2014

Academic Editor Jian J Zhang

Copyright copy 2014 Yifan Chen et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Along with the emergence of massive graph-modeled data it is of great importance to investigate graph similarity joins due to theirwide applications formultiple purposes including data cleaning and near duplicate detectionThis paper considers graph similarityjoins with edit distance constraints which return pairs of graphs such that their edit distances are no larger than a given thresholdLeveraging the MapReduce programming model we propose MGSJoin a scalable algorithm following the filtering-verificationframework for efficient graph similarity joins It relies on counting overlapping graph signatures for filtering out nonpromisingcandidates With the potential issue of too many key-value pairs in the filtering phase spectral Bloom filters are introduced toreduce the number of key-value pairs Furthermore we integrate the multiway join strategy to boost the verification where aMapReduce-based method is proposed for GED calculation The superior efficiency and scalability of the proposed algorithms aredemonstrated by extensive experimental results

1 Introduction

As the most commonly used abstract data structure graphshave been widely used for modeling the data in the fields ofbioinformatics multimedia social networking and the likeAs a consequence efforts were dedicated to various problemsin managing and analyzing graph data for example frequentsubgraph mining [1] structure search and indexing [2 3]similarity search [4 5] and so forth

This paper focuses on graph similarity join one basicoperation for processing graph data Given two graph objectsets 119877 and 119878 and a distance threshold a graph similarityjoin returns all the pairs of graph objects respectivelyfrom 119877 and 119878 the distances of which are no larger thanthe threshold Graph similarity join has a wide spectral ofapplications especially in preprocessing of graph mining forexample structural data cleaning and near replicate structuredetection

The most widely applied measure for determining graphsimilarity is graph edit distance (GED) [6 7] Compared withalternative measures GED has at least three advantages (1) itallows changes in both vertices and edges (2) it reflects thetopological information of graphs (3) it is a metric that can

be applied to any type of graphs Consequently we employGED to quantify graph similarity in this paper It is shownthat exact computation of GED is NP-hard [8]

The state-of-the-art algorithm for graph similarity joinis GSimJoin [9] which adopts the filtering-verificationframework In particular signatures are generated for everygraph with path-based 119902-gram approach for count filtering(cf Section 22) In the phase of verification GED com-putation is invoked for candidate pairs via anAlowast-basedalgorithm GSimJoin is an in-memory algorithm the perfor-mance and scalability of which are restricted to the availablememory of a machine The dataset size presented in theexperimental study is limited in the thousands since a graphsimilarity join operation in the worst case needs 119874(|119877||119878|)

count filtering condition checks and similarity computationsthereafter The era of big data calls for scalable algorithms tosupport large-scale data processing This paper attempts toaddress the challenges on massive graph data

MapReduce is a well-known programming frameworkto facilitate processing large-scale data in parallel[10] MassJoin [11] is a MapReduce-based algorithm forsimilarity join on strings Nonetheless there has been noexisting distributed algorithm for graph similarity joins

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 749028 11 pageshttpdxdoiorg1011552014749028

2 The Scientific World Journal

Inspired by [11] this paper investigates graph similarity joinsbased on MapReduce

We firstly propose MGSJoin a MapReduce-based algo-rithm following the filtering-verification framework Itemploys signatures of path-base 119902-grams as keys and thecorresponding graphs as values forming key-value pairsThrough filtering the graph pairs whose common signaturesare less than the threshold given by count filtering conditionare filtered out The remaining pairs constitute the candidateset sent to verification thereafter Due to the potentially largenumber of key-value pairs that may incur large communi-cation cost we incorporate the Bloom filter technique byadding generated signatures to spectral Bloom filters Thiseffectively reduces the number of intermediate key-valuepairs and thus the complexity of network transmissionwhile the filtering capacity is mostly preserved Furthermorewe employmultiway join to improve the verification phase bycondensing two MapReduce rounds into one while devisinga MapReduce-based method to calculate GED

To the best of our knowledge it is among the firstattempts to present a MapReduce-based graph similarity joinalgorithm Our contribution can be summarized as follows

(i) We redesign the current in-memory graph similarityjoin algorithm and adapt it to the MapReduce frame-work The resulting baseline algorithm is capable ofprocessing large-scale graph datasets

(ii) We propose to use Bloom filters to reduce inter-mediate key-value pairs in the filtering phase whilesacrificing little filtering capacity Besides we presenta multiway join optimized verification strategy suchthat the number of required MapReduce rounds isreduced too Moreover a MapReduce-based methodis designed forGED calculation which can handle thecalculation for large graphs

(iii) We implement the proposed algorithm MGSJoin andconduct a wide range of experiments on a real datasetThe results show that both the efficiency and thescalability of MGSJoin are superior to the currentsolutions

This paper is constructed as follows In Section 2 prob-lem definition and background are provided We proposethe basic algorithm in Section 3 integrate Bloom filtersin Section 4 and optimize the verification in Section 5Section 6 lists the experimental results and analyses Relatedworks are described in Section 7 followed by conclusion inSection 8

2 Preliminaries

21 Problem Definition In this paper we focus on simplegraph namely an undirected graph without self-loops ormultiple edges A labeled graph can be represented as aquadruple 119892 = ⟨119881 119864 119897

119881 119897119864⟩ where 119881 is a set of vertices and

119864 sube 119881 times 119881 is a set of edges 119897119881and 119897119864are label functions

that assign labels to vertices and edges respectively 119881(119903)

denotes the vertex set of 119903 and119864(119903) denotes its edge set |119881(119903)|

and |119864(119903)| represent the numbers of vertices and edges in 119903

respectively 119897119881

(119906) denotes the label of 119906 isin 119881 and 119897119864(119890(119906 V))

denotes the label of the edge between 119906 and V 119906 V isin 119881Additionally we use |119892| = |119881| + |119864| to depict the size of graph119892

Definition 1 (graph pair) A graph pair is a tuple denoted by⟨119903 119904⟩ where 119903 isin 119877 119904 isin 119878119877 and 119878 are two sets of graph objectsrespectively

Definition 2 (graph isomorphism) A graph 119903 is isomorphicto another graph 119904 denoted by 119903 = 119904 if there exists a bijection119891 119881(119903) rarr 119881(119904) such that

(1) forall119906 isin 119881(119903)(119891(119906) isin 119881(119904) and 119897119881

(119906) = 119897119881

(119891(119906)))(2) forall119890(119906 V) isin 119864(119903)(119890(119891(119906) 119891(V)) isin 119864(119904) and 119897

119864(119890(119906 V)) =

119897119864(119890(119891(119906) 119891(V))))

Definition 3 (graph edit operation) A graph edit operation isan edit operation to transform one graph to another It can beone of the following six operations

(i) insert an isolated vertex into the graph(ii) delete an isolated vertex from the graph(iii) change the label of a vertex(iv) insert an edge between two disconnected vertices(v) delete an edge from the graph(vi) change the label of an edge

Definition 4 (graph edit distance) The graph edit distance(GED) between graphs 119903 and 119904 denoted by GED(119903 119904) is theminimum number of edit operations that transform 119903 to agraph isomorphic to 119904

Example 5 Figure 1(a) illustrates the molecule namedcyclopropanone while Figure 1(b) shows another moleculewhich does not exist When recording cyclopropanone intodatabase errors may be made and the molecule can becomethe one shown in Figure 1(b) Manual checking is requiredto find the errors which is very difficult Seeing that thetwo molecules are very similar we can adapt the GED tomeasure the similarity so that graph similarity join can beapplied to resolve the problem Take the two molecules asan example First change the bond (1198621 119874) from double tosingle and then transform one of the atoms 119867 bonded with1198623 into 119873 through which a molecule isomorphic to the onein Figure 1(b) is obtained So at least two edit operationsare required namely the graph edit distance Given thethreshold 120591 = 4 the two graphs are regarded similar

Problem 6 (graph similarity join) Given two sets of graphobjects 119877 and 119878 and a distance threshold 120591 as input a graphsimilarity join returns a result set ⟨119903 119904⟩ | 119903 isin 119877 and 119904 isin

119878 and GED(119903 119904) le 120591

22 Count Filtering

Definition 7 (path-based 119902-gram [9]) A path-based 119902-gramin a graph is a simple path of length 119902 ldquoSimplerdquo means thatthere is no repeated vertex in the path

The Scientific World Journal 3

0

C1

C2 C3 H

HH

H

(a) Cyclopropanone

0

C1

C2 C3H

H H

N

(b) ldquoCyclopropanonerdquo

Figure 1 Molecules

The path-based 119902-grams of a graph constitute the graphsignatures denoted by 119904119894119892 = ⟨119881 119864 119897

119881 119897119864⟩ where 119904119894119892 sube 119903|119881| =

119902+1 and |119864| = 119902 Let 119876119903be graph 119903rsquos signature set We say 119904119894119892

and 1199041198941198921015840 are common if 119904119894119892 = 119904119894119892

1015840 Note there can be multiple119902-grams that correspond to one particular signature

Lemma 8 (count filtering [9]) Graphs 119903 and 119904 satisfy thedistance constraints 120591 if the number of common signatures for⟨119903 119904⟩ is no less than 119871119861

119871119861 = max (1003816100381610038161003816119876119903

1003816100381610038161003816 minus 120591 sdot 119863 (119903) 1003816100381610038161003816119876119904

1003816100381610038161003816 minus 120591 sdot 119863 (119904)) (1)

where 119863(119903) (resp 119863(119904)) is the maximum number of affectedsignatures in 119876

119903(resp 119876

119904) when one edit operation is invoked

on 119903 (resp 119904)

The count filtering condition check requires119874(max(119889

119902

119903|119881(119903)| 119889

119902

119904|119881(119904)|)) where 119889

119903(resp 119889

119904) is the

average degree of 119903 (resp 119904) A graph similarity join requirespairwise count filtering condition checks thus resulting ina complexity of 119874(max(119889

119902

119903|119881(119903)| 119889

119902

119904|119881(119904)|)|119877||119878|) This can

be intolerable on large-scale datasets Next we present ascalable solution leveraging the MapReduce paradigm

3 Framework

This section presents a MapReduce-based graph similarityjoin algorithm following the filtering-verification fashionThe outline of the algorithm is listed below

31 Filtering Weallocate twoMapReduce jobs to the filteringphase

Job 1 Job 1 counts the same type of common signatures forgraph pairs We use graphs as values and their correspond-ing 119894119889s as keys to compose input key-value pairs 119902-gramsignatures (denoted by 119904119894119892) are generated in the Map taskWe use the generated signatures as keys and 119894119889s of graphs asvalues to form the output key-value pairs As a consequencein the Reduce task we can obtain graphs with commonsignatures For the same signature a graph 119894119889 may appear

more than once since there may exist several 119902-grams in agraph corresponding to an identical signature The functionfor Map task is shown as follows

Map⟨119903119894119889 119903⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119892 119903119894119889119904119894119889⟩

(1) input ⟨119903119894119889 119903⟩ or ⟨119904119894119889 119904⟩(2) generate 119902-gram signatures for each input

graph(3) emit ⟨119904119894119892 119903119894119889⟩ or ⟨119904119894119892 119904119894119889⟩ for each signature

with the 119894119889 of its corresponding graph

The Reducer gets ⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ in the Reduce taskWedefine119891

119903(resp119891

119904) to denote the occurrences of 119903119894119889 (resp

119904119894119889) in the 119897119894119904119905(V119886119897119906119890) of Reduce input and 119898119891to denote the

occurrences of graph pairs that share the input key (119904119894119892) ofReduce 119898

119891= min(119891

119903 119891119904) 119898119891is calculated and output with

the corresponding graph pair The function of Reduce task isas follows

Reduce⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ rarr ⟨119903119894119889 (119904119894119889 119898119891)⟩

(1) get list of values consisting of 119903119894119889 and 119904119894119889 withthe specific key (119904119894119892)

(2) split the list of values into list of 119903119894119889 and list of119904119894119889

(3) for each pair ⟨119903119894119889 119904119894119889⟩ where 119903119894119889 is from119897119894119904119905(119903119894119889) and 119904119894119889 is from 119897119894119904119905(119904119894119889) calculate 119898

119891

and output ⟨119903119894119889 (119904119894119889 119898119891)⟩

Job 2 Job 2 counts the total common signatures for graphpairs and checks the count filtering condition after which theset of candidate pairs is obtained The Map function is listedas follows

Map⟨119903119894119889 (119904119894119889 119898119891)⟩ rarr ⟨119903119894119889 (119904119894119889 119898

119891)⟩

(1) read a ⟨119903119894119889 (119904119894119889 119898119891)⟩ into Map task

(2) emit ⟨119903119894119889 (119904119894119889 119898119891)⟩ which is exactly the same

as it reads

4 The Scientific World Journal

We input ⟨119903119894119889 (119904119894119889 119898119891)⟩ to theMap task and then output

exactly the same key-value pair Therefore the Reduce taskreceives all the graph pairs generated in job 1 with the specific119904119894119889 As the graph pair may have more than one type ofcommon signatures 119898

119891is summed with the same graph pair

for each identical signature respectively Subsequently thegraph pairs with less than 119871119861 (1) common signatures will bediscarded The remaining graph pairs are candidate pairs forverification The function for Reduce is as follows

Reduce⟨119903119894119889 119897119894119904119905(119904119894119889 + 119898119891)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(1) receive output from Mappers the specific 119903119894119889and a list of 119904119894119889 with 119898

119891

(2) sum 119898119891and calculate the number of common

signatures for each pair ⟨119903119894119889 119904119894119889⟩(3) conduct the count filtering for each pair and

output pairs to DFS whose common signaturesare more than 119871119861

32 Verification In the verification phase candidate pairs⟨119903119894119889 119904119894119889⟩ are to be verified where the graphs 119903 and 119904 arerequired that is join operations are necessary to retrievegraphs 119903 and 119904 by their 119894119889rsquos Hence we allocate two MapRe-duce jobs to join (119903119894119889 119903) (119903119894119889 119904119894119889) and (119904119894119889 119904)

Job 1 Job 1 replaces 119904119894119889 with graph 119904 The map function takesgraph set 119878 and ⟨119903119894119889 119904119894119889⟩ as input and emits ⟨119903119894119889 119904⟩ which islisted as follows

Map⟨119903119894119889 119904119894119889⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119889 119903119894119889119904⟩

(1) input candidate pair ⟨119903119894119889 119904119894119889⟩ and graph set 119878(2) for ⟨119903119894119889 119904119894119889⟩ emit ⟨119904119894119889 119903119894119889⟩ and for ⟨119904119894119889 119904⟩

emit it exactly both of which take 119904119894119889 as the key

TheReduce task gathers the list 119903119894119889 and graph 119904 for the key119904119894119889 and then outputs the key-value pair ⟨119903119894119889 119904⟩ The functionfor Reducer is as follows

Reduce⟨119904119894119889 119897119894119904119905(119903119894119889119904)⟩ rarr ⟨119903119894119889 119904⟩

(1) receive a list of 119903119894119889 and graph set 119878 with thespecific key 119904119894119889

(2) for ⟨119903119894119889 119904119894119889⟩ replace 119904119894119889 with 119904 and output pair⟨119903119894119889 119904⟩

Job 2 Job 2 replaces 119903119894119889 with graph 119903 invokes label filteringconditions and calculates GED to find the similar graphpairs

The function for Map task is as follows

Map⟨119903119894119889 119904⟩⟨119903119894119889 119903⟩ rarr ⟨119903119894119889 119903119904⟩

(1) input the candidate pair ⟨119903119894119889 119904⟩ and graph set119877(2) emit the key-value pair it reads where the group

key is 119903119894119889 for both cases

The function for Reduce task is as follows

Reduce⟨119903119894119889 119897119894119904119905(119903119904)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(1) receive a list of values that consisted of graphs 119903

and 119904 corresponding to the key 119903119894119889(2) replace 119903119894119889 with graph 119903(3) calculate GED for pairs ⟨119903 119904⟩ and output similar

pairs

33 Correctness and Complexity Analysis All graph pairs thatsatisfy the edit distance constraints are returned error-freewhich justifies its correctness

For algorithm complexity we take all three phasesmdashMap Reduce and Shufflemdashinto consideration IO readingoverhead from distributed file system (DFS) is considered forMap task whereas IO writing overhead into DFS is analyzedfor Reduce task Both tasks also take time complexity intoconsideration Shuffle considers the network communicationcost

Some parameters are defined preceding the analysis |119892|

denotes the average size of a graph 120572 means candidate ratiothat is the percentage of candidate pairs from all graph pairsand 120573 means the ratio of similar pairs from pairs that passedcount filtering We assume that the size of key-value pair thatcontains only graph IDs is 1

In job 1 of filter phase Map reads the data of graph sets119877 and 119878 from DFS the cost of which is 119874(|119892| sdot (|119877| + |119878|))Consider the worst case in Map task where 119902-grams aregenerated for each graph 119903(119904) and the number of signaturesgenerated is |119881|

119902

so the time complexity for generating119902-grams can be estimated as 119874(|119881|

119902

) Therefore the timecomplexity for Map task is 119874(|119881|

119902

sdot (|119877| + |119878|)) As eachgenerated 119902-gram signature forms an output key-value pair(the size for the pair is 1) the communication cost for Shuffleis119874(|119881|

119902

sdot(|119877|+|119878|)) In theReduce task all graphs representedby 119894119889 containing the same signature are acquired and pairedThus the IO cost is 119874(|119877||119878|) Then the occurrences of119903119894119889 and 119904119894119889 are counted In other words all the key-valuepairs generated from the Map task are counted so the timecomplexity for Reduce is 119874(|119881|

119902

sdot (|119877| + |119878|))Consider job 2 in filter phase where all the output key-

value pairs from job 1 are read intoMap task of job 2 so the IOcost is 119874(|119877||119878|) Map outputs the key-value pairs it reads sothe time complexity forMap task and communication cost forShuffle are119874(|119877||119878|) In Reduce task all pairs are traversed sothe time complexity is119874(|119877||119878|) As we have 120572|119877||119878| candidatekey-value pairs the IO cost for Reduce task is 119874(120572|119877||119878|)

In verification phase of job 1 Map reads the candidatepairs represented by ⟨119903119894119889 119904119894119889⟩ and graph set 119878 from DFSwhere the IO overhead requires 119874(120572|119877||119878| + |119892||119878|) The timecomplexity of Map task and communication cost of Shuffleare also 119874(120572|119877||119878| + |119892||119878|) because in Map function we justemit what has been exactly input into Then in Reduce taskwe replace 119904119894119889with graph 119904 the time complexity is119874(120572|119877||119878|)and the IO overhead is 119874(120572|119892||119877||119878|)

In verification phase of job 2 Map reads the candidatepairs represented by ⟨119903119894119889 119904⟩ and graph set 119877 from DFSwhere the IO overhead requires 119874(|119892||119877|(120572|119878| + 1)) The

The Scientific World Journal 5

Input graph object sets 119877 and 119878 GED threshold 120591

Output similar graph pairs ⟨119903 119904⟩

(1) Filter(2) Job 1 count the same type of common signatures for graph pairs(3) Job 2 count the total common signatures and check the count filtering fir graph pairs(4) Verification(5) Job 1 replace sid with graph 119904(6) Job 2 replace rid with graph 119903 and calculate GED for candidate pairs

Algorithm 1MGSJoin

(1)Map ⟨119903119894119889 119903⟩ ⟨119904119894119889 119904⟩ rarr ⟨119903119894119889 119904119887119891119903⟩ ⟨119904119894119889 119904119887119891

119904⟩

(2) create SBF for each graph and insert the generated 119902-gram signatures into it(3) output ⟨119903119894119889 119904119887119891

119903⟩ or ⟨119904119894119889 119904119887119891

119904⟩

(4) Shuffle conduct Cartesian product between pairs ⟨119903119894119889 119904119887119891119903⟩ and pairs ⟨119904119894119889 119904119887119891

119904⟩

(5) Reduce ⟨⟨119903119894119889 119904119887119891119903⟩ 119897119894119904119905 (⟨119904119894119889 119904119887119891

119904⟩)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(6) for any ⟨119903119894119889 119904119894119889⟩ calculate the intersection of 119904119887119891119903and 119904119887119891

119904and estimate

the number of common signatures(7) invoke the count filtering for each pair ⟨119903119894119889 119904119894119889⟩ and output it if it passes the condition

Algorithm 2 Replacement of filter of Algorithm 1

time complexity of Map task and communication cost ofShuffle are also 119874(|119892||119877|(120572|119878| + 1)) for the same reason aboveThen in Reduce task we first replace 119903119894119889 with graph 119903 andthen calculate GED for graph pairs The GED calculation by

theAlowast-based algorithm requires 119874(|119881||119881|

) In the worst case

the time complexity of Reduce is 119874(120572|119877||119878||119881||119881|

) Finallysimilar graph pairs are emitted into DFS which requires119874(120572120573|119877||119878|)

4 Incorporating Bloom Filters

In the filtering phase of Algorithm 1 twoMapReduce jobs arerequired with many intermediate key-value pairs generatedand transmitted These increase the IO and communicationcost which can be fairly time-consuming This sectionintroduces the Bloom filter technique to reduce such costNext we first recall the concept of spectral Bloom filters

41 Spectral Bloom Filter Bloom filters [12] are space efficientdata structures which allow fast membership queries over agiven set A Bloom filter uses 119896 hash functions ℎ

1 ℎ2 ℎ

119896

to hash elements into an array of size 119898 For an element 119890 inthe set the bit at positions ℎ

1(119890) ℎ2(119890) ℎ

119896(119890) in the array is

set to 1 Given a query item 119902 we check its membership in theset by examining the bits at positions ℎ

1(119902) ℎ2(119902) ℎ

119896(119902)

of the array The item 119902 is reported to be contained in theset if (and only if) all the aforementioned bits are 1 Thismethod brings a small probability of false-positive that isit may return a positive result for an item which actually isnot contained in the set but no false-negative while gainingsubstantial space savings [13]

Spectral Bloom filter (SBF) [14] generalized the basicBloom filter to be able to record the element frequency

which is thus adopted in this paper SBF is represented by⟨119860 119891⟩ where 119860 is a set and 119891 is a map from 119860 to naturalnumbers that is 119891 119860 rarr 119873 where 119873 is the universe ofnatural numbers SBF replaces the bit vector with a vectorof 119898 counters For insertion of item 119890 the counters 119862 =

119862ℎ1(119890)

119862ℎ2(119890)

119862ℎ119896(119890)

are increased by 1 for insertion anddecreased by 1 for deletion Let 119891(119902) denote the frequency of119902 A basic query for SBF on an item 119902 returns an estimationon 119891(119902) that is 119891(119902) = min

119894isin1119896119862ℎ119894(119902)

Note that similarto Bloom filters SBF also never underestimate 119891(119902)

42 Algorithm Incorporating SBF not only reduces the num-ber of key-value pairs but also contracts the two MapReducejobs into one in the filtering phase In particular the Maptask takes graph sets 119877 and 119878 as input Then we createa SBF for each graph by adding the 119902-gram signaturesCartesian product is conducted for the output key-valuepairs ⟨119903119894119889 119904119887119891

119903⟩ and ⟨119904119894119889 119904119887119891

119904⟩ In the Reduce task for

each graph pair their SBFs are intersected to estimate thenumber of common signatures By intersecting two SBFswithcounters 119862

(1) and 119862(2) respectively it returns another SBF

with counters119862lowast119862lowast119894

= min119862(1)

119894 119862(2)

119894 119894 isin 1 119898 Hence

the number of common signatures could be estimated bylfloor(1119896) sum

119898

119894=1119862lowast

119894rfloor Subsequently the graph pairs which have

less than 119871119861 common signatures given by count filteringcondition will be discarded and the remaining pairs formthe candidate set

We provide the pseudocode of the aforementioned pro-cess in Algorithm 2

43 Correctness and Complexity Analysis There is smallprobability that the false positive case happens with SBFSpecifically a query for an item 119902 in SBF on 119891(119902) 119891(119902)

6 The Scientific World Journal

Table 1 Complexity analysis of filtering

Phase Job 1 Job 2 +SBF

Map IO 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|))

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902

sdot (|119877| + |119878|))

Shuffle 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)

Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)

may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters

Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|

119902

(|119877|+

|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898

2

|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1

5 Optimizing Verification Phase

In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈

119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one

51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))

(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations

119892 (119909) = GED (119903119901 119904119901)

ℎ (119909) = Γ (119871119881

(119903119901) 119871119881

(119904119902)) + Γ (119871

119864(119903119902) 119871119864

(119904119902))

(2)

119903119901consists of the vertices that have been mapped and edges

connecting them while 119903119902consists of the vertices unmapped

yet as well as their resident edges The equation for ℎ(119909)

represents the label difference of two graphsThe search space forAlowast-based approach is very large

requiring 119874(|119881||119881|

) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round

The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree

52 Multiway Join In relational database a multiway joincan process 119879

1(119860 119861) ⋈ 119879

2(119861 119862) ⋈ 119879

3(119862 119863) together in one

round where 119879119894(119883 119884) is a relational table with attributes

119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899

2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905

2(119887 119888) isin 119879

2(119861 119862) is sent to theReducer numbered

(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879

3(119888 119889)) is sent

to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909

(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]

We encapsulate the improved verification procedure inAlgorithm 4

53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion

The Scientific World Journal 7

Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩

Output Graph edit distance of 119903 and 119904

(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903

119902is empty go to line (8)

(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩

(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V

1isin 119903119902(119909) to a vertex V

2isin 119904119902(119909) for any V

1and V

2

(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it

Algorithm 3 MRGED (119903 119904)

(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩

(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)

(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED

calculation else invoke the Alowast-based in-memory algorithm

Algorithm 4 Replacement of verification of Algorithm 1

In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +

|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +

120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce

task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2

6 Experiments

61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem

Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)

62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF

The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved

63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED

and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1

8 The Scientific World Journal

Table 2 Complexity analysis of verification

Phase Job 1 Job 2 +MJ

Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)

Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Reduce IO 119874 (1205721003816100381610038161003816119892

1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)

Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|

) 119874 (120572|119877||119878|119881||119881|

)

0 1 2 3 4 5

Elap

sed

time (

s)

GED threshold (120591)

Basic

+SBF

15k

12k

9k

6k

3k

(a)

0

50

100

150

200

250

300

1 2 3 4 5GED threshold (120591)

Num

ber o

f Δ (c

andi

date

pai

rs)

(b)

Figure 2 Evaluating filtering

Table 3 Dataset statistics

Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07

is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory

64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded

65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that

with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel

66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin

7 Related Work

Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper

MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been

The Scientific World Journal 9

Elap

sed

time (

s)

1 2 3 4 5GED threshold (120591)

16k

12k

8k

4k

MGSJoin

+MRGED

(a)

Elap

sed

time (

s)

Basic

+MR

105

104

103

102

1 2 3 4 5GED threshold (120591)

(b)

Figure 3 Evaluating verification

0

Elap

sed

time (

s)

GSimJoin

MGSJoin

1 2 3 4 5GED threshold (120591)

5k

4k

3k

2k

1k

(a)

Elap

sed

time (

s)

Number of graphs

GSimJoin

MGSJoin

105

104

103

102

102

101

103

104

106

105

(b)

Figure 4 Comparing with state-of-the-art method

2k

4k

6k

8k

10k

504540353025201510

Elap

sed

time

Number of instances

120591 = 1

120591 = 2

120591 = 3

120591 = 4

(a)

010 20 30 40 50

Elap

sed

time (

s)

Number of instances and number of graphs (times2000)

5k

4k

3k

2k

1k

(b)

Figure 5 Speedup and Scale-up

10 The Scientific World Journal

applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce

Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]

8 Conclusion

In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs

As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)

References

[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002

[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004

[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006

[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005

[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008

[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983

[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983

[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979

[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012

[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA

[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970

[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004

[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003

[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010

[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012

[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

2 The Scientific World Journal

Inspired by [11] this paper investigates graph similarity joinsbased on MapReduce

We firstly propose MGSJoin a MapReduce-based algo-rithm following the filtering-verification framework Itemploys signatures of path-base 119902-grams as keys and thecorresponding graphs as values forming key-value pairsThrough filtering the graph pairs whose common signaturesare less than the threshold given by count filtering conditionare filtered out The remaining pairs constitute the candidateset sent to verification thereafter Due to the potentially largenumber of key-value pairs that may incur large communi-cation cost we incorporate the Bloom filter technique byadding generated signatures to spectral Bloom filters Thiseffectively reduces the number of intermediate key-valuepairs and thus the complexity of network transmissionwhile the filtering capacity is mostly preserved Furthermorewe employmultiway join to improve the verification phase bycondensing two MapReduce rounds into one while devisinga MapReduce-based method to calculate GED

To the best of our knowledge it is among the firstattempts to present a MapReduce-based graph similarity joinalgorithm Our contribution can be summarized as follows

(i) We redesign the current in-memory graph similarityjoin algorithm and adapt it to the MapReduce frame-work The resulting baseline algorithm is capable ofprocessing large-scale graph datasets

(ii) We propose to use Bloom filters to reduce inter-mediate key-value pairs in the filtering phase whilesacrificing little filtering capacity Besides we presenta multiway join optimized verification strategy suchthat the number of required MapReduce rounds isreduced too Moreover a MapReduce-based methodis designed forGED calculation which can handle thecalculation for large graphs

(iii) We implement the proposed algorithm MGSJoin andconduct a wide range of experiments on a real datasetThe results show that both the efficiency and thescalability of MGSJoin are superior to the currentsolutions

This paper is constructed as follows In Section 2 prob-lem definition and background are provided We proposethe basic algorithm in Section 3 integrate Bloom filtersin Section 4 and optimize the verification in Section 5Section 6 lists the experimental results and analyses Relatedworks are described in Section 7 followed by conclusion inSection 8

2 Preliminaries

21 Problem Definition In this paper we focus on simplegraph namely an undirected graph without self-loops ormultiple edges A labeled graph can be represented as aquadruple 119892 = ⟨119881 119864 119897

119881 119897119864⟩ where 119881 is a set of vertices and

119864 sube 119881 times 119881 is a set of edges 119897119881and 119897119864are label functions

that assign labels to vertices and edges respectively 119881(119903)

denotes the vertex set of 119903 and119864(119903) denotes its edge set |119881(119903)|

and |119864(119903)| represent the numbers of vertices and edges in 119903

respectively 119897119881

(119906) denotes the label of 119906 isin 119881 and 119897119864(119890(119906 V))

denotes the label of the edge between 119906 and V 119906 V isin 119881Additionally we use |119892| = |119881| + |119864| to depict the size of graph119892

Definition 1 (graph pair) A graph pair is a tuple denoted by⟨119903 119904⟩ where 119903 isin 119877 119904 isin 119878119877 and 119878 are two sets of graph objectsrespectively

Definition 2 (graph isomorphism) A graph 119903 is isomorphicto another graph 119904 denoted by 119903 = 119904 if there exists a bijection119891 119881(119903) rarr 119881(119904) such that

(1) forall119906 isin 119881(119903)(119891(119906) isin 119881(119904) and 119897119881

(119906) = 119897119881

(119891(119906)))(2) forall119890(119906 V) isin 119864(119903)(119890(119891(119906) 119891(V)) isin 119864(119904) and 119897

119864(119890(119906 V)) =

119897119864(119890(119891(119906) 119891(V))))

Definition 3 (graph edit operation) A graph edit operation isan edit operation to transform one graph to another It can beone of the following six operations

(i) insert an isolated vertex into the graph(ii) delete an isolated vertex from the graph(iii) change the label of a vertex(iv) insert an edge between two disconnected vertices(v) delete an edge from the graph(vi) change the label of an edge

Definition 4 (graph edit distance) The graph edit distance(GED) between graphs 119903 and 119904 denoted by GED(119903 119904) is theminimum number of edit operations that transform 119903 to agraph isomorphic to 119904

Example 5 Figure 1(a) illustrates the molecule namedcyclopropanone while Figure 1(b) shows another moleculewhich does not exist When recording cyclopropanone intodatabase errors may be made and the molecule can becomethe one shown in Figure 1(b) Manual checking is requiredto find the errors which is very difficult Seeing that thetwo molecules are very similar we can adapt the GED tomeasure the similarity so that graph similarity join can beapplied to resolve the problem Take the two molecules asan example First change the bond (1198621 119874) from double tosingle and then transform one of the atoms 119867 bonded with1198623 into 119873 through which a molecule isomorphic to the onein Figure 1(b) is obtained So at least two edit operationsare required namely the graph edit distance Given thethreshold 120591 = 4 the two graphs are regarded similar

Problem 6 (graph similarity join) Given two sets of graphobjects 119877 and 119878 and a distance threshold 120591 as input a graphsimilarity join returns a result set ⟨119903 119904⟩ | 119903 isin 119877 and 119904 isin

119878 and GED(119903 119904) le 120591

22 Count Filtering

Definition 7 (path-based 119902-gram [9]) A path-based 119902-gramin a graph is a simple path of length 119902 ldquoSimplerdquo means thatthere is no repeated vertex in the path

The Scientific World Journal 3

0

C1

C2 C3 H

HH

H

(a) Cyclopropanone

0

C1

C2 C3H

H H

N

(b) ldquoCyclopropanonerdquo

Figure 1 Molecules

The path-based 119902-grams of a graph constitute the graphsignatures denoted by 119904119894119892 = ⟨119881 119864 119897

119881 119897119864⟩ where 119904119894119892 sube 119903|119881| =

119902+1 and |119864| = 119902 Let 119876119903be graph 119903rsquos signature set We say 119904119894119892

and 1199041198941198921015840 are common if 119904119894119892 = 119904119894119892

1015840 Note there can be multiple119902-grams that correspond to one particular signature

Lemma 8 (count filtering [9]) Graphs 119903 and 119904 satisfy thedistance constraints 120591 if the number of common signatures for⟨119903 119904⟩ is no less than 119871119861

119871119861 = max (1003816100381610038161003816119876119903

1003816100381610038161003816 minus 120591 sdot 119863 (119903) 1003816100381610038161003816119876119904

1003816100381610038161003816 minus 120591 sdot 119863 (119904)) (1)

where 119863(119903) (resp 119863(119904)) is the maximum number of affectedsignatures in 119876

119903(resp 119876

119904) when one edit operation is invoked

on 119903 (resp 119904)

The count filtering condition check requires119874(max(119889

119902

119903|119881(119903)| 119889

119902

119904|119881(119904)|)) where 119889

119903(resp 119889

119904) is the

average degree of 119903 (resp 119904) A graph similarity join requirespairwise count filtering condition checks thus resulting ina complexity of 119874(max(119889

119902

119903|119881(119903)| 119889

119902

119904|119881(119904)|)|119877||119878|) This can

be intolerable on large-scale datasets Next we present ascalable solution leveraging the MapReduce paradigm

3 Framework

This section presents a MapReduce-based graph similarityjoin algorithm following the filtering-verification fashionThe outline of the algorithm is listed below

31 Filtering Weallocate twoMapReduce jobs to the filteringphase

Job 1 Job 1 counts the same type of common signatures forgraph pairs We use graphs as values and their correspond-ing 119894119889s as keys to compose input key-value pairs 119902-gramsignatures (denoted by 119904119894119892) are generated in the Map taskWe use the generated signatures as keys and 119894119889s of graphs asvalues to form the output key-value pairs As a consequencein the Reduce task we can obtain graphs with commonsignatures For the same signature a graph 119894119889 may appear

more than once since there may exist several 119902-grams in agraph corresponding to an identical signature The functionfor Map task is shown as follows

Map⟨119903119894119889 119903⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119892 119903119894119889119904119894119889⟩

(1) input ⟨119903119894119889 119903⟩ or ⟨119904119894119889 119904⟩(2) generate 119902-gram signatures for each input

graph(3) emit ⟨119904119894119892 119903119894119889⟩ or ⟨119904119894119892 119904119894119889⟩ for each signature

with the 119894119889 of its corresponding graph

The Reducer gets ⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ in the Reduce taskWedefine119891

119903(resp119891

119904) to denote the occurrences of 119903119894119889 (resp

119904119894119889) in the 119897119894119904119905(V119886119897119906119890) of Reduce input and 119898119891to denote the

occurrences of graph pairs that share the input key (119904119894119892) ofReduce 119898

119891= min(119891

119903 119891119904) 119898119891is calculated and output with

the corresponding graph pair The function of Reduce task isas follows

Reduce⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ rarr ⟨119903119894119889 (119904119894119889 119898119891)⟩

(1) get list of values consisting of 119903119894119889 and 119904119894119889 withthe specific key (119904119894119892)

(2) split the list of values into list of 119903119894119889 and list of119904119894119889

(3) for each pair ⟨119903119894119889 119904119894119889⟩ where 119903119894119889 is from119897119894119904119905(119903119894119889) and 119904119894119889 is from 119897119894119904119905(119904119894119889) calculate 119898

119891

and output ⟨119903119894119889 (119904119894119889 119898119891)⟩

Job 2 Job 2 counts the total common signatures for graphpairs and checks the count filtering condition after which theset of candidate pairs is obtained The Map function is listedas follows

Map⟨119903119894119889 (119904119894119889 119898119891)⟩ rarr ⟨119903119894119889 (119904119894119889 119898

119891)⟩

(1) read a ⟨119903119894119889 (119904119894119889 119898119891)⟩ into Map task

(2) emit ⟨119903119894119889 (119904119894119889 119898119891)⟩ which is exactly the same

as it reads

4 The Scientific World Journal

We input ⟨119903119894119889 (119904119894119889 119898119891)⟩ to theMap task and then output

exactly the same key-value pair Therefore the Reduce taskreceives all the graph pairs generated in job 1 with the specific119904119894119889 As the graph pair may have more than one type ofcommon signatures 119898

119891is summed with the same graph pair

for each identical signature respectively Subsequently thegraph pairs with less than 119871119861 (1) common signatures will bediscarded The remaining graph pairs are candidate pairs forverification The function for Reduce is as follows

Reduce⟨119903119894119889 119897119894119904119905(119904119894119889 + 119898119891)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(1) receive output from Mappers the specific 119903119894119889and a list of 119904119894119889 with 119898

119891

(2) sum 119898119891and calculate the number of common

signatures for each pair ⟨119903119894119889 119904119894119889⟩(3) conduct the count filtering for each pair and

output pairs to DFS whose common signaturesare more than 119871119861

32 Verification In the verification phase candidate pairs⟨119903119894119889 119904119894119889⟩ are to be verified where the graphs 119903 and 119904 arerequired that is join operations are necessary to retrievegraphs 119903 and 119904 by their 119894119889rsquos Hence we allocate two MapRe-duce jobs to join (119903119894119889 119903) (119903119894119889 119904119894119889) and (119904119894119889 119904)

Job 1 Job 1 replaces 119904119894119889 with graph 119904 The map function takesgraph set 119878 and ⟨119903119894119889 119904119894119889⟩ as input and emits ⟨119903119894119889 119904⟩ which islisted as follows

Map⟨119903119894119889 119904119894119889⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119889 119903119894119889119904⟩

(1) input candidate pair ⟨119903119894119889 119904119894119889⟩ and graph set 119878(2) for ⟨119903119894119889 119904119894119889⟩ emit ⟨119904119894119889 119903119894119889⟩ and for ⟨119904119894119889 119904⟩

emit it exactly both of which take 119904119894119889 as the key

TheReduce task gathers the list 119903119894119889 and graph 119904 for the key119904119894119889 and then outputs the key-value pair ⟨119903119894119889 119904⟩ The functionfor Reducer is as follows

Reduce⟨119904119894119889 119897119894119904119905(119903119894119889119904)⟩ rarr ⟨119903119894119889 119904⟩

(1) receive a list of 119903119894119889 and graph set 119878 with thespecific key 119904119894119889

(2) for ⟨119903119894119889 119904119894119889⟩ replace 119904119894119889 with 119904 and output pair⟨119903119894119889 119904⟩

Job 2 Job 2 replaces 119903119894119889 with graph 119903 invokes label filteringconditions and calculates GED to find the similar graphpairs

The function for Map task is as follows

Map⟨119903119894119889 119904⟩⟨119903119894119889 119903⟩ rarr ⟨119903119894119889 119903119904⟩

(1) input the candidate pair ⟨119903119894119889 119904⟩ and graph set119877(2) emit the key-value pair it reads where the group

key is 119903119894119889 for both cases

The function for Reduce task is as follows

Reduce⟨119903119894119889 119897119894119904119905(119903119904)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(1) receive a list of values that consisted of graphs 119903

and 119904 corresponding to the key 119903119894119889(2) replace 119903119894119889 with graph 119903(3) calculate GED for pairs ⟨119903 119904⟩ and output similar

pairs

33 Correctness and Complexity Analysis All graph pairs thatsatisfy the edit distance constraints are returned error-freewhich justifies its correctness

For algorithm complexity we take all three phasesmdashMap Reduce and Shufflemdashinto consideration IO readingoverhead from distributed file system (DFS) is considered forMap task whereas IO writing overhead into DFS is analyzedfor Reduce task Both tasks also take time complexity intoconsideration Shuffle considers the network communicationcost

Some parameters are defined preceding the analysis |119892|

denotes the average size of a graph 120572 means candidate ratiothat is the percentage of candidate pairs from all graph pairsand 120573 means the ratio of similar pairs from pairs that passedcount filtering We assume that the size of key-value pair thatcontains only graph IDs is 1

In job 1 of filter phase Map reads the data of graph sets119877 and 119878 from DFS the cost of which is 119874(|119892| sdot (|119877| + |119878|))Consider the worst case in Map task where 119902-grams aregenerated for each graph 119903(119904) and the number of signaturesgenerated is |119881|

119902

so the time complexity for generating119902-grams can be estimated as 119874(|119881|

119902

) Therefore the timecomplexity for Map task is 119874(|119881|

119902

sdot (|119877| + |119878|)) As eachgenerated 119902-gram signature forms an output key-value pair(the size for the pair is 1) the communication cost for Shuffleis119874(|119881|

119902

sdot(|119877|+|119878|)) In theReduce task all graphs representedby 119894119889 containing the same signature are acquired and pairedThus the IO cost is 119874(|119877||119878|) Then the occurrences of119903119894119889 and 119904119894119889 are counted In other words all the key-valuepairs generated from the Map task are counted so the timecomplexity for Reduce is 119874(|119881|

119902

sdot (|119877| + |119878|))Consider job 2 in filter phase where all the output key-

value pairs from job 1 are read intoMap task of job 2 so the IOcost is 119874(|119877||119878|) Map outputs the key-value pairs it reads sothe time complexity forMap task and communication cost forShuffle are119874(|119877||119878|) In Reduce task all pairs are traversed sothe time complexity is119874(|119877||119878|) As we have 120572|119877||119878| candidatekey-value pairs the IO cost for Reduce task is 119874(120572|119877||119878|)

In verification phase of job 1 Map reads the candidatepairs represented by ⟨119903119894119889 119904119894119889⟩ and graph set 119878 from DFSwhere the IO overhead requires 119874(120572|119877||119878| + |119892||119878|) The timecomplexity of Map task and communication cost of Shuffleare also 119874(120572|119877||119878| + |119892||119878|) because in Map function we justemit what has been exactly input into Then in Reduce taskwe replace 119904119894119889with graph 119904 the time complexity is119874(120572|119877||119878|)and the IO overhead is 119874(120572|119892||119877||119878|)

In verification phase of job 2 Map reads the candidatepairs represented by ⟨119903119894119889 119904⟩ and graph set 119877 from DFSwhere the IO overhead requires 119874(|119892||119877|(120572|119878| + 1)) The

The Scientific World Journal 5

Input graph object sets 119877 and 119878 GED threshold 120591

Output similar graph pairs ⟨119903 119904⟩

(1) Filter(2) Job 1 count the same type of common signatures for graph pairs(3) Job 2 count the total common signatures and check the count filtering fir graph pairs(4) Verification(5) Job 1 replace sid with graph 119904(6) Job 2 replace rid with graph 119903 and calculate GED for candidate pairs

Algorithm 1MGSJoin

(1)Map ⟨119903119894119889 119903⟩ ⟨119904119894119889 119904⟩ rarr ⟨119903119894119889 119904119887119891119903⟩ ⟨119904119894119889 119904119887119891

119904⟩

(2) create SBF for each graph and insert the generated 119902-gram signatures into it(3) output ⟨119903119894119889 119904119887119891

119903⟩ or ⟨119904119894119889 119904119887119891

119904⟩

(4) Shuffle conduct Cartesian product between pairs ⟨119903119894119889 119904119887119891119903⟩ and pairs ⟨119904119894119889 119904119887119891

119904⟩

(5) Reduce ⟨⟨119903119894119889 119904119887119891119903⟩ 119897119894119904119905 (⟨119904119894119889 119904119887119891

119904⟩)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(6) for any ⟨119903119894119889 119904119894119889⟩ calculate the intersection of 119904119887119891119903and 119904119887119891

119904and estimate

the number of common signatures(7) invoke the count filtering for each pair ⟨119903119894119889 119904119894119889⟩ and output it if it passes the condition

Algorithm 2 Replacement of filter of Algorithm 1

time complexity of Map task and communication cost ofShuffle are also 119874(|119892||119877|(120572|119878| + 1)) for the same reason aboveThen in Reduce task we first replace 119903119894119889 with graph 119903 andthen calculate GED for graph pairs The GED calculation by

theAlowast-based algorithm requires 119874(|119881||119881|

) In the worst case

the time complexity of Reduce is 119874(120572|119877||119878||119881||119881|

) Finallysimilar graph pairs are emitted into DFS which requires119874(120572120573|119877||119878|)

4 Incorporating Bloom Filters

In the filtering phase of Algorithm 1 twoMapReduce jobs arerequired with many intermediate key-value pairs generatedand transmitted These increase the IO and communicationcost which can be fairly time-consuming This sectionintroduces the Bloom filter technique to reduce such costNext we first recall the concept of spectral Bloom filters

41 Spectral Bloom Filter Bloom filters [12] are space efficientdata structures which allow fast membership queries over agiven set A Bloom filter uses 119896 hash functions ℎ

1 ℎ2 ℎ

119896

to hash elements into an array of size 119898 For an element 119890 inthe set the bit at positions ℎ

1(119890) ℎ2(119890) ℎ

119896(119890) in the array is

set to 1 Given a query item 119902 we check its membership in theset by examining the bits at positions ℎ

1(119902) ℎ2(119902) ℎ

119896(119902)

of the array The item 119902 is reported to be contained in theset if (and only if) all the aforementioned bits are 1 Thismethod brings a small probability of false-positive that isit may return a positive result for an item which actually isnot contained in the set but no false-negative while gainingsubstantial space savings [13]

Spectral Bloom filter (SBF) [14] generalized the basicBloom filter to be able to record the element frequency

which is thus adopted in this paper SBF is represented by⟨119860 119891⟩ where 119860 is a set and 119891 is a map from 119860 to naturalnumbers that is 119891 119860 rarr 119873 where 119873 is the universe ofnatural numbers SBF replaces the bit vector with a vectorof 119898 counters For insertion of item 119890 the counters 119862 =

119862ℎ1(119890)

119862ℎ2(119890)

119862ℎ119896(119890)

are increased by 1 for insertion anddecreased by 1 for deletion Let 119891(119902) denote the frequency of119902 A basic query for SBF on an item 119902 returns an estimationon 119891(119902) that is 119891(119902) = min

119894isin1119896119862ℎ119894(119902)

Note that similarto Bloom filters SBF also never underestimate 119891(119902)

42 Algorithm Incorporating SBF not only reduces the num-ber of key-value pairs but also contracts the two MapReducejobs into one in the filtering phase In particular the Maptask takes graph sets 119877 and 119878 as input Then we createa SBF for each graph by adding the 119902-gram signaturesCartesian product is conducted for the output key-valuepairs ⟨119903119894119889 119904119887119891

119903⟩ and ⟨119904119894119889 119904119887119891

119904⟩ In the Reduce task for

each graph pair their SBFs are intersected to estimate thenumber of common signatures By intersecting two SBFswithcounters 119862

(1) and 119862(2) respectively it returns another SBF

with counters119862lowast119862lowast119894

= min119862(1)

119894 119862(2)

119894 119894 isin 1 119898 Hence

the number of common signatures could be estimated bylfloor(1119896) sum

119898

119894=1119862lowast

119894rfloor Subsequently the graph pairs which have

less than 119871119861 common signatures given by count filteringcondition will be discarded and the remaining pairs formthe candidate set

We provide the pseudocode of the aforementioned pro-cess in Algorithm 2

43 Correctness and Complexity Analysis There is smallprobability that the false positive case happens with SBFSpecifically a query for an item 119902 in SBF on 119891(119902) 119891(119902)

6 The Scientific World Journal

Table 1 Complexity analysis of filtering

Phase Job 1 Job 2 +SBF

Map IO 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|))

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902

sdot (|119877| + |119878|))

Shuffle 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)

Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)

may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters

Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|

119902

(|119877|+

|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898

2

|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1

5 Optimizing Verification Phase

In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈

119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one

51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))

(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations

119892 (119909) = GED (119903119901 119904119901)

ℎ (119909) = Γ (119871119881

(119903119901) 119871119881

(119904119902)) + Γ (119871

119864(119903119902) 119871119864

(119904119902))

(2)

119903119901consists of the vertices that have been mapped and edges

connecting them while 119903119902consists of the vertices unmapped

yet as well as their resident edges The equation for ℎ(119909)

represents the label difference of two graphsThe search space forAlowast-based approach is very large

requiring 119874(|119881||119881|

) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round

The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree

52 Multiway Join In relational database a multiway joincan process 119879

1(119860 119861) ⋈ 119879

2(119861 119862) ⋈ 119879

3(119862 119863) together in one

round where 119879119894(119883 119884) is a relational table with attributes

119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899

2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905

2(119887 119888) isin 119879

2(119861 119862) is sent to theReducer numbered

(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879

3(119888 119889)) is sent

to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909

(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]

We encapsulate the improved verification procedure inAlgorithm 4

53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion

The Scientific World Journal 7

Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩

Output Graph edit distance of 119903 and 119904

(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903

119902is empty go to line (8)

(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩

(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V

1isin 119903119902(119909) to a vertex V

2isin 119904119902(119909) for any V

1and V

2

(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it

Algorithm 3 MRGED (119903 119904)

(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩

(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)

(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED

calculation else invoke the Alowast-based in-memory algorithm

Algorithm 4 Replacement of verification of Algorithm 1

In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +

|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +

120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce

task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2

6 Experiments

61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem

Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)

62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF

The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved

63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED

and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1

8 The Scientific World Journal

Table 2 Complexity analysis of verification

Phase Job 1 Job 2 +MJ

Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)

Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Reduce IO 119874 (1205721003816100381610038161003816119892

1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)

Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|

) 119874 (120572|119877||119878|119881||119881|

)

0 1 2 3 4 5

Elap

sed

time (

s)

GED threshold (120591)

Basic

+SBF

15k

12k

9k

6k

3k

(a)

0

50

100

150

200

250

300

1 2 3 4 5GED threshold (120591)

Num

ber o

f Δ (c

andi

date

pai

rs)

(b)

Figure 2 Evaluating filtering

Table 3 Dataset statistics

Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07

is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory

64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded

65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that

with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel

66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin

7 Related Work

Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper

MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been

The Scientific World Journal 9

Elap

sed

time (

s)

1 2 3 4 5GED threshold (120591)

16k

12k

8k

4k

MGSJoin

+MRGED

(a)

Elap

sed

time (

s)

Basic

+MR

105

104

103

102

1 2 3 4 5GED threshold (120591)

(b)

Figure 3 Evaluating verification

0

Elap

sed

time (

s)

GSimJoin

MGSJoin

1 2 3 4 5GED threshold (120591)

5k

4k

3k

2k

1k

(a)

Elap

sed

time (

s)

Number of graphs

GSimJoin

MGSJoin

105

104

103

102

102

101

103

104

106

105

(b)

Figure 4 Comparing with state-of-the-art method

2k

4k

6k

8k

10k

504540353025201510

Elap

sed

time

Number of instances

120591 = 1

120591 = 2

120591 = 3

120591 = 4

(a)

010 20 30 40 50

Elap

sed

time (

s)

Number of instances and number of graphs (times2000)

5k

4k

3k

2k

1k

(b)

Figure 5 Speedup and Scale-up

10 The Scientific World Journal

applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce

Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]

8 Conclusion

In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs

As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)

References

[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002

[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004

[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006

[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005

[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008

[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983

[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983

[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979

[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012

[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA

[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970

[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004

[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003

[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010

[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012

[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

The Scientific World Journal 3

0

C1

C2 C3 H

HH

H

(a) Cyclopropanone

0

C1

C2 C3H

H H

N

(b) ldquoCyclopropanonerdquo

Figure 1 Molecules

The path-based 119902-grams of a graph constitute the graphsignatures denoted by 119904119894119892 = ⟨119881 119864 119897

119881 119897119864⟩ where 119904119894119892 sube 119903|119881| =

119902+1 and |119864| = 119902 Let 119876119903be graph 119903rsquos signature set We say 119904119894119892

and 1199041198941198921015840 are common if 119904119894119892 = 119904119894119892

1015840 Note there can be multiple119902-grams that correspond to one particular signature

Lemma 8 (count filtering [9]) Graphs 119903 and 119904 satisfy thedistance constraints 120591 if the number of common signatures for⟨119903 119904⟩ is no less than 119871119861

119871119861 = max (1003816100381610038161003816119876119903

1003816100381610038161003816 minus 120591 sdot 119863 (119903) 1003816100381610038161003816119876119904

1003816100381610038161003816 minus 120591 sdot 119863 (119904)) (1)

where 119863(119903) (resp 119863(119904)) is the maximum number of affectedsignatures in 119876

119903(resp 119876

119904) when one edit operation is invoked

on 119903 (resp 119904)

The count filtering condition check requires119874(max(119889

119902

119903|119881(119903)| 119889

119902

119904|119881(119904)|)) where 119889

119903(resp 119889

119904) is the

average degree of 119903 (resp 119904) A graph similarity join requirespairwise count filtering condition checks thus resulting ina complexity of 119874(max(119889

119902

119903|119881(119903)| 119889

119902

119904|119881(119904)|)|119877||119878|) This can

be intolerable on large-scale datasets Next we present ascalable solution leveraging the MapReduce paradigm

3 Framework

This section presents a MapReduce-based graph similarityjoin algorithm following the filtering-verification fashionThe outline of the algorithm is listed below

31 Filtering Weallocate twoMapReduce jobs to the filteringphase

Job 1 Job 1 counts the same type of common signatures forgraph pairs We use graphs as values and their correspond-ing 119894119889s as keys to compose input key-value pairs 119902-gramsignatures (denoted by 119904119894119892) are generated in the Map taskWe use the generated signatures as keys and 119894119889s of graphs asvalues to form the output key-value pairs As a consequencein the Reduce task we can obtain graphs with commonsignatures For the same signature a graph 119894119889 may appear

more than once since there may exist several 119902-grams in agraph corresponding to an identical signature The functionfor Map task is shown as follows

Map⟨119903119894119889 119903⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119892 119903119894119889119904119894119889⟩

(1) input ⟨119903119894119889 119903⟩ or ⟨119904119894119889 119904⟩(2) generate 119902-gram signatures for each input

graph(3) emit ⟨119904119894119892 119903119894119889⟩ or ⟨119904119894119892 119904119894119889⟩ for each signature

with the 119894119889 of its corresponding graph

The Reducer gets ⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ in the Reduce taskWedefine119891

119903(resp119891

119904) to denote the occurrences of 119903119894119889 (resp

119904119894119889) in the 119897119894119904119905(V119886119897119906119890) of Reduce input and 119898119891to denote the

occurrences of graph pairs that share the input key (119904119894119892) ofReduce 119898

119891= min(119891

119903 119891119904) 119898119891is calculated and output with

the corresponding graph pair The function of Reduce task isas follows

Reduce⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ rarr ⟨119903119894119889 (119904119894119889 119898119891)⟩

(1) get list of values consisting of 119903119894119889 and 119904119894119889 withthe specific key (119904119894119892)

(2) split the list of values into list of 119903119894119889 and list of119904119894119889

(3) for each pair ⟨119903119894119889 119904119894119889⟩ where 119903119894119889 is from119897119894119904119905(119903119894119889) and 119904119894119889 is from 119897119894119904119905(119904119894119889) calculate 119898

119891

and output ⟨119903119894119889 (119904119894119889 119898119891)⟩

Job 2 Job 2 counts the total common signatures for graphpairs and checks the count filtering condition after which theset of candidate pairs is obtained The Map function is listedas follows

Map⟨119903119894119889 (119904119894119889 119898119891)⟩ rarr ⟨119903119894119889 (119904119894119889 119898

119891)⟩

(1) read a ⟨119903119894119889 (119904119894119889 119898119891)⟩ into Map task

(2) emit ⟨119903119894119889 (119904119894119889 119898119891)⟩ which is exactly the same

as it reads

4 The Scientific World Journal

We input ⟨119903119894119889 (119904119894119889 119898119891)⟩ to theMap task and then output

exactly the same key-value pair Therefore the Reduce taskreceives all the graph pairs generated in job 1 with the specific119904119894119889 As the graph pair may have more than one type ofcommon signatures 119898

119891is summed with the same graph pair

for each identical signature respectively Subsequently thegraph pairs with less than 119871119861 (1) common signatures will bediscarded The remaining graph pairs are candidate pairs forverification The function for Reduce is as follows

Reduce⟨119903119894119889 119897119894119904119905(119904119894119889 + 119898119891)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(1) receive output from Mappers the specific 119903119894119889and a list of 119904119894119889 with 119898

119891

(2) sum 119898119891and calculate the number of common

signatures for each pair ⟨119903119894119889 119904119894119889⟩(3) conduct the count filtering for each pair and

output pairs to DFS whose common signaturesare more than 119871119861

32 Verification In the verification phase candidate pairs⟨119903119894119889 119904119894119889⟩ are to be verified where the graphs 119903 and 119904 arerequired that is join operations are necessary to retrievegraphs 119903 and 119904 by their 119894119889rsquos Hence we allocate two MapRe-duce jobs to join (119903119894119889 119903) (119903119894119889 119904119894119889) and (119904119894119889 119904)

Job 1 Job 1 replaces 119904119894119889 with graph 119904 The map function takesgraph set 119878 and ⟨119903119894119889 119904119894119889⟩ as input and emits ⟨119903119894119889 119904⟩ which islisted as follows

Map⟨119903119894119889 119904119894119889⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119889 119903119894119889119904⟩

(1) input candidate pair ⟨119903119894119889 119904119894119889⟩ and graph set 119878(2) for ⟨119903119894119889 119904119894119889⟩ emit ⟨119904119894119889 119903119894119889⟩ and for ⟨119904119894119889 119904⟩

emit it exactly both of which take 119904119894119889 as the key

TheReduce task gathers the list 119903119894119889 and graph 119904 for the key119904119894119889 and then outputs the key-value pair ⟨119903119894119889 119904⟩ The functionfor Reducer is as follows

Reduce⟨119904119894119889 119897119894119904119905(119903119894119889119904)⟩ rarr ⟨119903119894119889 119904⟩

(1) receive a list of 119903119894119889 and graph set 119878 with thespecific key 119904119894119889

(2) for ⟨119903119894119889 119904119894119889⟩ replace 119904119894119889 with 119904 and output pair⟨119903119894119889 119904⟩

Job 2 Job 2 replaces 119903119894119889 with graph 119903 invokes label filteringconditions and calculates GED to find the similar graphpairs

The function for Map task is as follows

Map⟨119903119894119889 119904⟩⟨119903119894119889 119903⟩ rarr ⟨119903119894119889 119903119904⟩

(1) input the candidate pair ⟨119903119894119889 119904⟩ and graph set119877(2) emit the key-value pair it reads where the group

key is 119903119894119889 for both cases

The function for Reduce task is as follows

Reduce⟨119903119894119889 119897119894119904119905(119903119904)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(1) receive a list of values that consisted of graphs 119903

and 119904 corresponding to the key 119903119894119889(2) replace 119903119894119889 with graph 119903(3) calculate GED for pairs ⟨119903 119904⟩ and output similar

pairs

33 Correctness and Complexity Analysis All graph pairs thatsatisfy the edit distance constraints are returned error-freewhich justifies its correctness

For algorithm complexity we take all three phasesmdashMap Reduce and Shufflemdashinto consideration IO readingoverhead from distributed file system (DFS) is considered forMap task whereas IO writing overhead into DFS is analyzedfor Reduce task Both tasks also take time complexity intoconsideration Shuffle considers the network communicationcost

Some parameters are defined preceding the analysis |119892|

denotes the average size of a graph 120572 means candidate ratiothat is the percentage of candidate pairs from all graph pairsand 120573 means the ratio of similar pairs from pairs that passedcount filtering We assume that the size of key-value pair thatcontains only graph IDs is 1

In job 1 of filter phase Map reads the data of graph sets119877 and 119878 from DFS the cost of which is 119874(|119892| sdot (|119877| + |119878|))Consider the worst case in Map task where 119902-grams aregenerated for each graph 119903(119904) and the number of signaturesgenerated is |119881|

119902

so the time complexity for generating119902-grams can be estimated as 119874(|119881|

119902

) Therefore the timecomplexity for Map task is 119874(|119881|

119902

sdot (|119877| + |119878|)) As eachgenerated 119902-gram signature forms an output key-value pair(the size for the pair is 1) the communication cost for Shuffleis119874(|119881|

119902

sdot(|119877|+|119878|)) In theReduce task all graphs representedby 119894119889 containing the same signature are acquired and pairedThus the IO cost is 119874(|119877||119878|) Then the occurrences of119903119894119889 and 119904119894119889 are counted In other words all the key-valuepairs generated from the Map task are counted so the timecomplexity for Reduce is 119874(|119881|

119902

sdot (|119877| + |119878|))Consider job 2 in filter phase where all the output key-

value pairs from job 1 are read intoMap task of job 2 so the IOcost is 119874(|119877||119878|) Map outputs the key-value pairs it reads sothe time complexity forMap task and communication cost forShuffle are119874(|119877||119878|) In Reduce task all pairs are traversed sothe time complexity is119874(|119877||119878|) As we have 120572|119877||119878| candidatekey-value pairs the IO cost for Reduce task is 119874(120572|119877||119878|)

In verification phase of job 1 Map reads the candidatepairs represented by ⟨119903119894119889 119904119894119889⟩ and graph set 119878 from DFSwhere the IO overhead requires 119874(120572|119877||119878| + |119892||119878|) The timecomplexity of Map task and communication cost of Shuffleare also 119874(120572|119877||119878| + |119892||119878|) because in Map function we justemit what has been exactly input into Then in Reduce taskwe replace 119904119894119889with graph 119904 the time complexity is119874(120572|119877||119878|)and the IO overhead is 119874(120572|119892||119877||119878|)

In verification phase of job 2 Map reads the candidatepairs represented by ⟨119903119894119889 119904⟩ and graph set 119877 from DFSwhere the IO overhead requires 119874(|119892||119877|(120572|119878| + 1)) The

The Scientific World Journal 5

Input graph object sets 119877 and 119878 GED threshold 120591

Output similar graph pairs ⟨119903 119904⟩

(1) Filter(2) Job 1 count the same type of common signatures for graph pairs(3) Job 2 count the total common signatures and check the count filtering fir graph pairs(4) Verification(5) Job 1 replace sid with graph 119904(6) Job 2 replace rid with graph 119903 and calculate GED for candidate pairs

Algorithm 1MGSJoin

(1)Map ⟨119903119894119889 119903⟩ ⟨119904119894119889 119904⟩ rarr ⟨119903119894119889 119904119887119891119903⟩ ⟨119904119894119889 119904119887119891

119904⟩

(2) create SBF for each graph and insert the generated 119902-gram signatures into it(3) output ⟨119903119894119889 119904119887119891

119903⟩ or ⟨119904119894119889 119904119887119891

119904⟩

(4) Shuffle conduct Cartesian product between pairs ⟨119903119894119889 119904119887119891119903⟩ and pairs ⟨119904119894119889 119904119887119891

119904⟩

(5) Reduce ⟨⟨119903119894119889 119904119887119891119903⟩ 119897119894119904119905 (⟨119904119894119889 119904119887119891

119904⟩)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(6) for any ⟨119903119894119889 119904119894119889⟩ calculate the intersection of 119904119887119891119903and 119904119887119891

119904and estimate

the number of common signatures(7) invoke the count filtering for each pair ⟨119903119894119889 119904119894119889⟩ and output it if it passes the condition

Algorithm 2 Replacement of filter of Algorithm 1

time complexity of Map task and communication cost ofShuffle are also 119874(|119892||119877|(120572|119878| + 1)) for the same reason aboveThen in Reduce task we first replace 119903119894119889 with graph 119903 andthen calculate GED for graph pairs The GED calculation by

theAlowast-based algorithm requires 119874(|119881||119881|

) In the worst case

the time complexity of Reduce is 119874(120572|119877||119878||119881||119881|

) Finallysimilar graph pairs are emitted into DFS which requires119874(120572120573|119877||119878|)

4 Incorporating Bloom Filters

In the filtering phase of Algorithm 1 twoMapReduce jobs arerequired with many intermediate key-value pairs generatedand transmitted These increase the IO and communicationcost which can be fairly time-consuming This sectionintroduces the Bloom filter technique to reduce such costNext we first recall the concept of spectral Bloom filters

41 Spectral Bloom Filter Bloom filters [12] are space efficientdata structures which allow fast membership queries over agiven set A Bloom filter uses 119896 hash functions ℎ

1 ℎ2 ℎ

119896

to hash elements into an array of size 119898 For an element 119890 inthe set the bit at positions ℎ

1(119890) ℎ2(119890) ℎ

119896(119890) in the array is

set to 1 Given a query item 119902 we check its membership in theset by examining the bits at positions ℎ

1(119902) ℎ2(119902) ℎ

119896(119902)

of the array The item 119902 is reported to be contained in theset if (and only if) all the aforementioned bits are 1 Thismethod brings a small probability of false-positive that isit may return a positive result for an item which actually isnot contained in the set but no false-negative while gainingsubstantial space savings [13]

Spectral Bloom filter (SBF) [14] generalized the basicBloom filter to be able to record the element frequency

which is thus adopted in this paper SBF is represented by⟨119860 119891⟩ where 119860 is a set and 119891 is a map from 119860 to naturalnumbers that is 119891 119860 rarr 119873 where 119873 is the universe ofnatural numbers SBF replaces the bit vector with a vectorof 119898 counters For insertion of item 119890 the counters 119862 =

119862ℎ1(119890)

119862ℎ2(119890)

119862ℎ119896(119890)

are increased by 1 for insertion anddecreased by 1 for deletion Let 119891(119902) denote the frequency of119902 A basic query for SBF on an item 119902 returns an estimationon 119891(119902) that is 119891(119902) = min

119894isin1119896119862ℎ119894(119902)

Note that similarto Bloom filters SBF also never underestimate 119891(119902)

42 Algorithm Incorporating SBF not only reduces the num-ber of key-value pairs but also contracts the two MapReducejobs into one in the filtering phase In particular the Maptask takes graph sets 119877 and 119878 as input Then we createa SBF for each graph by adding the 119902-gram signaturesCartesian product is conducted for the output key-valuepairs ⟨119903119894119889 119904119887119891

119903⟩ and ⟨119904119894119889 119904119887119891

119904⟩ In the Reduce task for

each graph pair their SBFs are intersected to estimate thenumber of common signatures By intersecting two SBFswithcounters 119862

(1) and 119862(2) respectively it returns another SBF

with counters119862lowast119862lowast119894

= min119862(1)

119894 119862(2)

119894 119894 isin 1 119898 Hence

the number of common signatures could be estimated bylfloor(1119896) sum

119898

119894=1119862lowast

119894rfloor Subsequently the graph pairs which have

less than 119871119861 common signatures given by count filteringcondition will be discarded and the remaining pairs formthe candidate set

We provide the pseudocode of the aforementioned pro-cess in Algorithm 2

43 Correctness and Complexity Analysis There is smallprobability that the false positive case happens with SBFSpecifically a query for an item 119902 in SBF on 119891(119902) 119891(119902)

6 The Scientific World Journal

Table 1 Complexity analysis of filtering

Phase Job 1 Job 2 +SBF

Map IO 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|))

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902

sdot (|119877| + |119878|))

Shuffle 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)

Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)

may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters

Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|

119902

(|119877|+

|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898

2

|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1

5 Optimizing Verification Phase

In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈

119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one

51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))

(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations

119892 (119909) = GED (119903119901 119904119901)

ℎ (119909) = Γ (119871119881

(119903119901) 119871119881

(119904119902)) + Γ (119871

119864(119903119902) 119871119864

(119904119902))

(2)

119903119901consists of the vertices that have been mapped and edges

connecting them while 119903119902consists of the vertices unmapped

yet as well as their resident edges The equation for ℎ(119909)

represents the label difference of two graphsThe search space forAlowast-based approach is very large

requiring 119874(|119881||119881|

) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round

The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree

52 Multiway Join In relational database a multiway joincan process 119879

1(119860 119861) ⋈ 119879

2(119861 119862) ⋈ 119879

3(119862 119863) together in one

round where 119879119894(119883 119884) is a relational table with attributes

119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899

2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905

2(119887 119888) isin 119879

2(119861 119862) is sent to theReducer numbered

(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879

3(119888 119889)) is sent

to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909

(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]

We encapsulate the improved verification procedure inAlgorithm 4

53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion

The Scientific World Journal 7

Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩

Output Graph edit distance of 119903 and 119904

(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903

119902is empty go to line (8)

(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩

(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V

1isin 119903119902(119909) to a vertex V

2isin 119904119902(119909) for any V

1and V

2

(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it

Algorithm 3 MRGED (119903 119904)

(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩

(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)

(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED

calculation else invoke the Alowast-based in-memory algorithm

Algorithm 4 Replacement of verification of Algorithm 1

In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +

|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +

120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce

task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2

6 Experiments

61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem

Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)

62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF

The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved

63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED

and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1

8 The Scientific World Journal

Table 2 Complexity analysis of verification

Phase Job 1 Job 2 +MJ

Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)

Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Reduce IO 119874 (1205721003816100381610038161003816119892

1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)

Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|

) 119874 (120572|119877||119878|119881||119881|

)

0 1 2 3 4 5

Elap

sed

time (

s)

GED threshold (120591)

Basic

+SBF

15k

12k

9k

6k

3k

(a)

0

50

100

150

200

250

300

1 2 3 4 5GED threshold (120591)

Num

ber o

f Δ (c

andi

date

pai

rs)

(b)

Figure 2 Evaluating filtering

Table 3 Dataset statistics

Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07

is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory

64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded

65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that

with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel

66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin

7 Related Work

Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper

MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been

The Scientific World Journal 9

Elap

sed

time (

s)

1 2 3 4 5GED threshold (120591)

16k

12k

8k

4k

MGSJoin

+MRGED

(a)

Elap

sed

time (

s)

Basic

+MR

105

104

103

102

1 2 3 4 5GED threshold (120591)

(b)

Figure 3 Evaluating verification

0

Elap

sed

time (

s)

GSimJoin

MGSJoin

1 2 3 4 5GED threshold (120591)

5k

4k

3k

2k

1k

(a)

Elap

sed

time (

s)

Number of graphs

GSimJoin

MGSJoin

105

104

103

102

102

101

103

104

106

105

(b)

Figure 4 Comparing with state-of-the-art method

2k

4k

6k

8k

10k

504540353025201510

Elap

sed

time

Number of instances

120591 = 1

120591 = 2

120591 = 3

120591 = 4

(a)

010 20 30 40 50

Elap

sed

time (

s)

Number of instances and number of graphs (times2000)

5k

4k

3k

2k

1k

(b)

Figure 5 Speedup and Scale-up

10 The Scientific World Journal

applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce

Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]

8 Conclusion

In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs

As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)

References

[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002

[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004

[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006

[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005

[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008

[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983

[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983

[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979

[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012

[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA

[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970

[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004

[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003

[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010

[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012

[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

4 The Scientific World Journal

We input ⟨119903119894119889 (119904119894119889 119898119891)⟩ to theMap task and then output

exactly the same key-value pair Therefore the Reduce taskreceives all the graph pairs generated in job 1 with the specific119904119894119889 As the graph pair may have more than one type ofcommon signatures 119898

119891is summed with the same graph pair

for each identical signature respectively Subsequently thegraph pairs with less than 119871119861 (1) common signatures will bediscarded The remaining graph pairs are candidate pairs forverification The function for Reduce is as follows

Reduce⟨119903119894119889 119897119894119904119905(119904119894119889 + 119898119891)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(1) receive output from Mappers the specific 119903119894119889and a list of 119904119894119889 with 119898

119891

(2) sum 119898119891and calculate the number of common

signatures for each pair ⟨119903119894119889 119904119894119889⟩(3) conduct the count filtering for each pair and

output pairs to DFS whose common signaturesare more than 119871119861

32 Verification In the verification phase candidate pairs⟨119903119894119889 119904119894119889⟩ are to be verified where the graphs 119903 and 119904 arerequired that is join operations are necessary to retrievegraphs 119903 and 119904 by their 119894119889rsquos Hence we allocate two MapRe-duce jobs to join (119903119894119889 119903) (119903119894119889 119904119894119889) and (119904119894119889 119904)

Job 1 Job 1 replaces 119904119894119889 with graph 119904 The map function takesgraph set 119878 and ⟨119903119894119889 119904119894119889⟩ as input and emits ⟨119903119894119889 119904⟩ which islisted as follows

Map⟨119903119894119889 119904119894119889⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119889 119903119894119889119904⟩

(1) input candidate pair ⟨119903119894119889 119904119894119889⟩ and graph set 119878(2) for ⟨119903119894119889 119904119894119889⟩ emit ⟨119904119894119889 119903119894119889⟩ and for ⟨119904119894119889 119904⟩

emit it exactly both of which take 119904119894119889 as the key

TheReduce task gathers the list 119903119894119889 and graph 119904 for the key119904119894119889 and then outputs the key-value pair ⟨119903119894119889 119904⟩ The functionfor Reducer is as follows

Reduce⟨119904119894119889 119897119894119904119905(119903119894119889119904)⟩ rarr ⟨119903119894119889 119904⟩

(1) receive a list of 119903119894119889 and graph set 119878 with thespecific key 119904119894119889

(2) for ⟨119903119894119889 119904119894119889⟩ replace 119904119894119889 with 119904 and output pair⟨119903119894119889 119904⟩

Job 2 Job 2 replaces 119903119894119889 with graph 119903 invokes label filteringconditions and calculates GED to find the similar graphpairs

The function for Map task is as follows

Map⟨119903119894119889 119904⟩⟨119903119894119889 119903⟩ rarr ⟨119903119894119889 119903119904⟩

(1) input the candidate pair ⟨119903119894119889 119904⟩ and graph set119877(2) emit the key-value pair it reads where the group

key is 119903119894119889 for both cases

The function for Reduce task is as follows

Reduce⟨119903119894119889 119897119894119904119905(119903119904)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(1) receive a list of values that consisted of graphs 119903

and 119904 corresponding to the key 119903119894119889(2) replace 119903119894119889 with graph 119903(3) calculate GED for pairs ⟨119903 119904⟩ and output similar

pairs

33 Correctness and Complexity Analysis All graph pairs thatsatisfy the edit distance constraints are returned error-freewhich justifies its correctness

For algorithm complexity we take all three phasesmdashMap Reduce and Shufflemdashinto consideration IO readingoverhead from distributed file system (DFS) is considered forMap task whereas IO writing overhead into DFS is analyzedfor Reduce task Both tasks also take time complexity intoconsideration Shuffle considers the network communicationcost

Some parameters are defined preceding the analysis |119892|

denotes the average size of a graph 120572 means candidate ratiothat is the percentage of candidate pairs from all graph pairsand 120573 means the ratio of similar pairs from pairs that passedcount filtering We assume that the size of key-value pair thatcontains only graph IDs is 1

In job 1 of filter phase Map reads the data of graph sets119877 and 119878 from DFS the cost of which is 119874(|119892| sdot (|119877| + |119878|))Consider the worst case in Map task where 119902-grams aregenerated for each graph 119903(119904) and the number of signaturesgenerated is |119881|

119902

so the time complexity for generating119902-grams can be estimated as 119874(|119881|

119902

) Therefore the timecomplexity for Map task is 119874(|119881|

119902

sdot (|119877| + |119878|)) As eachgenerated 119902-gram signature forms an output key-value pair(the size for the pair is 1) the communication cost for Shuffleis119874(|119881|

119902

sdot(|119877|+|119878|)) In theReduce task all graphs representedby 119894119889 containing the same signature are acquired and pairedThus the IO cost is 119874(|119877||119878|) Then the occurrences of119903119894119889 and 119904119894119889 are counted In other words all the key-valuepairs generated from the Map task are counted so the timecomplexity for Reduce is 119874(|119881|

119902

sdot (|119877| + |119878|))Consider job 2 in filter phase where all the output key-

value pairs from job 1 are read intoMap task of job 2 so the IOcost is 119874(|119877||119878|) Map outputs the key-value pairs it reads sothe time complexity forMap task and communication cost forShuffle are119874(|119877||119878|) In Reduce task all pairs are traversed sothe time complexity is119874(|119877||119878|) As we have 120572|119877||119878| candidatekey-value pairs the IO cost for Reduce task is 119874(120572|119877||119878|)

In verification phase of job 1 Map reads the candidatepairs represented by ⟨119903119894119889 119904119894119889⟩ and graph set 119878 from DFSwhere the IO overhead requires 119874(120572|119877||119878| + |119892||119878|) The timecomplexity of Map task and communication cost of Shuffleare also 119874(120572|119877||119878| + |119892||119878|) because in Map function we justemit what has been exactly input into Then in Reduce taskwe replace 119904119894119889with graph 119904 the time complexity is119874(120572|119877||119878|)and the IO overhead is 119874(120572|119892||119877||119878|)

In verification phase of job 2 Map reads the candidatepairs represented by ⟨119903119894119889 119904⟩ and graph set 119877 from DFSwhere the IO overhead requires 119874(|119892||119877|(120572|119878| + 1)) The

The Scientific World Journal 5

Input graph object sets 119877 and 119878 GED threshold 120591

Output similar graph pairs ⟨119903 119904⟩

(1) Filter(2) Job 1 count the same type of common signatures for graph pairs(3) Job 2 count the total common signatures and check the count filtering fir graph pairs(4) Verification(5) Job 1 replace sid with graph 119904(6) Job 2 replace rid with graph 119903 and calculate GED for candidate pairs

Algorithm 1MGSJoin

(1)Map ⟨119903119894119889 119903⟩ ⟨119904119894119889 119904⟩ rarr ⟨119903119894119889 119904119887119891119903⟩ ⟨119904119894119889 119904119887119891

119904⟩

(2) create SBF for each graph and insert the generated 119902-gram signatures into it(3) output ⟨119903119894119889 119904119887119891

119903⟩ or ⟨119904119894119889 119904119887119891

119904⟩

(4) Shuffle conduct Cartesian product between pairs ⟨119903119894119889 119904119887119891119903⟩ and pairs ⟨119904119894119889 119904119887119891

119904⟩

(5) Reduce ⟨⟨119903119894119889 119904119887119891119903⟩ 119897119894119904119905 (⟨119904119894119889 119904119887119891

119904⟩)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(6) for any ⟨119903119894119889 119904119894119889⟩ calculate the intersection of 119904119887119891119903and 119904119887119891

119904and estimate

the number of common signatures(7) invoke the count filtering for each pair ⟨119903119894119889 119904119894119889⟩ and output it if it passes the condition

Algorithm 2 Replacement of filter of Algorithm 1

time complexity of Map task and communication cost ofShuffle are also 119874(|119892||119877|(120572|119878| + 1)) for the same reason aboveThen in Reduce task we first replace 119903119894119889 with graph 119903 andthen calculate GED for graph pairs The GED calculation by

theAlowast-based algorithm requires 119874(|119881||119881|

) In the worst case

the time complexity of Reduce is 119874(120572|119877||119878||119881||119881|

) Finallysimilar graph pairs are emitted into DFS which requires119874(120572120573|119877||119878|)

4 Incorporating Bloom Filters

In the filtering phase of Algorithm 1 twoMapReduce jobs arerequired with many intermediate key-value pairs generatedand transmitted These increase the IO and communicationcost which can be fairly time-consuming This sectionintroduces the Bloom filter technique to reduce such costNext we first recall the concept of spectral Bloom filters

41 Spectral Bloom Filter Bloom filters [12] are space efficientdata structures which allow fast membership queries over agiven set A Bloom filter uses 119896 hash functions ℎ

1 ℎ2 ℎ

119896

to hash elements into an array of size 119898 For an element 119890 inthe set the bit at positions ℎ

1(119890) ℎ2(119890) ℎ

119896(119890) in the array is

set to 1 Given a query item 119902 we check its membership in theset by examining the bits at positions ℎ

1(119902) ℎ2(119902) ℎ

119896(119902)

of the array The item 119902 is reported to be contained in theset if (and only if) all the aforementioned bits are 1 Thismethod brings a small probability of false-positive that isit may return a positive result for an item which actually isnot contained in the set but no false-negative while gainingsubstantial space savings [13]

Spectral Bloom filter (SBF) [14] generalized the basicBloom filter to be able to record the element frequency

which is thus adopted in this paper SBF is represented by⟨119860 119891⟩ where 119860 is a set and 119891 is a map from 119860 to naturalnumbers that is 119891 119860 rarr 119873 where 119873 is the universe ofnatural numbers SBF replaces the bit vector with a vectorof 119898 counters For insertion of item 119890 the counters 119862 =

119862ℎ1(119890)

119862ℎ2(119890)

119862ℎ119896(119890)

are increased by 1 for insertion anddecreased by 1 for deletion Let 119891(119902) denote the frequency of119902 A basic query for SBF on an item 119902 returns an estimationon 119891(119902) that is 119891(119902) = min

119894isin1119896119862ℎ119894(119902)

Note that similarto Bloom filters SBF also never underestimate 119891(119902)

42 Algorithm Incorporating SBF not only reduces the num-ber of key-value pairs but also contracts the two MapReducejobs into one in the filtering phase In particular the Maptask takes graph sets 119877 and 119878 as input Then we createa SBF for each graph by adding the 119902-gram signaturesCartesian product is conducted for the output key-valuepairs ⟨119903119894119889 119904119887119891

119903⟩ and ⟨119904119894119889 119904119887119891

119904⟩ In the Reduce task for

each graph pair their SBFs are intersected to estimate thenumber of common signatures By intersecting two SBFswithcounters 119862

(1) and 119862(2) respectively it returns another SBF

with counters119862lowast119862lowast119894

= min119862(1)

119894 119862(2)

119894 119894 isin 1 119898 Hence

the number of common signatures could be estimated bylfloor(1119896) sum

119898

119894=1119862lowast

119894rfloor Subsequently the graph pairs which have

less than 119871119861 common signatures given by count filteringcondition will be discarded and the remaining pairs formthe candidate set

We provide the pseudocode of the aforementioned pro-cess in Algorithm 2

43 Correctness and Complexity Analysis There is smallprobability that the false positive case happens with SBFSpecifically a query for an item 119902 in SBF on 119891(119902) 119891(119902)

6 The Scientific World Journal

Table 1 Complexity analysis of filtering

Phase Job 1 Job 2 +SBF

Map IO 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|))

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902

sdot (|119877| + |119878|))

Shuffle 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)

Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)

may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters

Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|

119902

(|119877|+

|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898

2

|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1

5 Optimizing Verification Phase

In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈

119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one

51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))

(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations

119892 (119909) = GED (119903119901 119904119901)

ℎ (119909) = Γ (119871119881

(119903119901) 119871119881

(119904119902)) + Γ (119871

119864(119903119902) 119871119864

(119904119902))

(2)

119903119901consists of the vertices that have been mapped and edges

connecting them while 119903119902consists of the vertices unmapped

yet as well as their resident edges The equation for ℎ(119909)

represents the label difference of two graphsThe search space forAlowast-based approach is very large

requiring 119874(|119881||119881|

) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round

The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree

52 Multiway Join In relational database a multiway joincan process 119879

1(119860 119861) ⋈ 119879

2(119861 119862) ⋈ 119879

3(119862 119863) together in one

round where 119879119894(119883 119884) is a relational table with attributes

119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899

2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905

2(119887 119888) isin 119879

2(119861 119862) is sent to theReducer numbered

(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879

3(119888 119889)) is sent

to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909

(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]

We encapsulate the improved verification procedure inAlgorithm 4

53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion

The Scientific World Journal 7

Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩

Output Graph edit distance of 119903 and 119904

(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903

119902is empty go to line (8)

(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩

(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V

1isin 119903119902(119909) to a vertex V

2isin 119904119902(119909) for any V

1and V

2

(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it

Algorithm 3 MRGED (119903 119904)

(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩

(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)

(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED

calculation else invoke the Alowast-based in-memory algorithm

Algorithm 4 Replacement of verification of Algorithm 1

In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +

|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +

120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce

task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2

6 Experiments

61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem

Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)

62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF

The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved

63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED

and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1

8 The Scientific World Journal

Table 2 Complexity analysis of verification

Phase Job 1 Job 2 +MJ

Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)

Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Reduce IO 119874 (1205721003816100381610038161003816119892

1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)

Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|

) 119874 (120572|119877||119878|119881||119881|

)

0 1 2 3 4 5

Elap

sed

time (

s)

GED threshold (120591)

Basic

+SBF

15k

12k

9k

6k

3k

(a)

0

50

100

150

200

250

300

1 2 3 4 5GED threshold (120591)

Num

ber o

f Δ (c

andi

date

pai

rs)

(b)

Figure 2 Evaluating filtering

Table 3 Dataset statistics

Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07

is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory

64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded

65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that

with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel

66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin

7 Related Work

Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper

MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been

The Scientific World Journal 9

Elap

sed

time (

s)

1 2 3 4 5GED threshold (120591)

16k

12k

8k

4k

MGSJoin

+MRGED

(a)

Elap

sed

time (

s)

Basic

+MR

105

104

103

102

1 2 3 4 5GED threshold (120591)

(b)

Figure 3 Evaluating verification

0

Elap

sed

time (

s)

GSimJoin

MGSJoin

1 2 3 4 5GED threshold (120591)

5k

4k

3k

2k

1k

(a)

Elap

sed

time (

s)

Number of graphs

GSimJoin

MGSJoin

105

104

103

102

102

101

103

104

106

105

(b)

Figure 4 Comparing with state-of-the-art method

2k

4k

6k

8k

10k

504540353025201510

Elap

sed

time

Number of instances

120591 = 1

120591 = 2

120591 = 3

120591 = 4

(a)

010 20 30 40 50

Elap

sed

time (

s)

Number of instances and number of graphs (times2000)

5k

4k

3k

2k

1k

(b)

Figure 5 Speedup and Scale-up

10 The Scientific World Journal

applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce

Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]

8 Conclusion

In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs

As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)

References

[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002

[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004

[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006

[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005

[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008

[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983

[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983

[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979

[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012

[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA

[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970

[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004

[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003

[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010

[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012

[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

The Scientific World Journal 5

Input graph object sets 119877 and 119878 GED threshold 120591

Output similar graph pairs ⟨119903 119904⟩

(1) Filter(2) Job 1 count the same type of common signatures for graph pairs(3) Job 2 count the total common signatures and check the count filtering fir graph pairs(4) Verification(5) Job 1 replace sid with graph 119904(6) Job 2 replace rid with graph 119903 and calculate GED for candidate pairs

Algorithm 1MGSJoin

(1)Map ⟨119903119894119889 119903⟩ ⟨119904119894119889 119904⟩ rarr ⟨119903119894119889 119904119887119891119903⟩ ⟨119904119894119889 119904119887119891

119904⟩

(2) create SBF for each graph and insert the generated 119902-gram signatures into it(3) output ⟨119903119894119889 119904119887119891

119903⟩ or ⟨119904119894119889 119904119887119891

119904⟩

(4) Shuffle conduct Cartesian product between pairs ⟨119903119894119889 119904119887119891119903⟩ and pairs ⟨119904119894119889 119904119887119891

119904⟩

(5) Reduce ⟨⟨119903119894119889 119904119887119891119903⟩ 119897119894119904119905 (⟨119904119894119889 119904119887119891

119904⟩)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(6) for any ⟨119903119894119889 119904119894119889⟩ calculate the intersection of 119904119887119891119903and 119904119887119891

119904and estimate

the number of common signatures(7) invoke the count filtering for each pair ⟨119903119894119889 119904119894119889⟩ and output it if it passes the condition

Algorithm 2 Replacement of filter of Algorithm 1

time complexity of Map task and communication cost ofShuffle are also 119874(|119892||119877|(120572|119878| + 1)) for the same reason aboveThen in Reduce task we first replace 119903119894119889 with graph 119903 andthen calculate GED for graph pairs The GED calculation by

theAlowast-based algorithm requires 119874(|119881||119881|

) In the worst case

the time complexity of Reduce is 119874(120572|119877||119878||119881||119881|

) Finallysimilar graph pairs are emitted into DFS which requires119874(120572120573|119877||119878|)

4 Incorporating Bloom Filters

In the filtering phase of Algorithm 1 twoMapReduce jobs arerequired with many intermediate key-value pairs generatedand transmitted These increase the IO and communicationcost which can be fairly time-consuming This sectionintroduces the Bloom filter technique to reduce such costNext we first recall the concept of spectral Bloom filters

41 Spectral Bloom Filter Bloom filters [12] are space efficientdata structures which allow fast membership queries over agiven set A Bloom filter uses 119896 hash functions ℎ

1 ℎ2 ℎ

119896

to hash elements into an array of size 119898 For an element 119890 inthe set the bit at positions ℎ

1(119890) ℎ2(119890) ℎ

119896(119890) in the array is

set to 1 Given a query item 119902 we check its membership in theset by examining the bits at positions ℎ

1(119902) ℎ2(119902) ℎ

119896(119902)

of the array The item 119902 is reported to be contained in theset if (and only if) all the aforementioned bits are 1 Thismethod brings a small probability of false-positive that isit may return a positive result for an item which actually isnot contained in the set but no false-negative while gainingsubstantial space savings [13]

Spectral Bloom filter (SBF) [14] generalized the basicBloom filter to be able to record the element frequency

which is thus adopted in this paper SBF is represented by⟨119860 119891⟩ where 119860 is a set and 119891 is a map from 119860 to naturalnumbers that is 119891 119860 rarr 119873 where 119873 is the universe ofnatural numbers SBF replaces the bit vector with a vectorof 119898 counters For insertion of item 119890 the counters 119862 =

119862ℎ1(119890)

119862ℎ2(119890)

119862ℎ119896(119890)

are increased by 1 for insertion anddecreased by 1 for deletion Let 119891(119902) denote the frequency of119902 A basic query for SBF on an item 119902 returns an estimationon 119891(119902) that is 119891(119902) = min

119894isin1119896119862ℎ119894(119902)

Note that similarto Bloom filters SBF also never underestimate 119891(119902)

42 Algorithm Incorporating SBF not only reduces the num-ber of key-value pairs but also contracts the two MapReducejobs into one in the filtering phase In particular the Maptask takes graph sets 119877 and 119878 as input Then we createa SBF for each graph by adding the 119902-gram signaturesCartesian product is conducted for the output key-valuepairs ⟨119903119894119889 119904119887119891

119903⟩ and ⟨119904119894119889 119904119887119891

119904⟩ In the Reduce task for

each graph pair their SBFs are intersected to estimate thenumber of common signatures By intersecting two SBFswithcounters 119862

(1) and 119862(2) respectively it returns another SBF

with counters119862lowast119862lowast119894

= min119862(1)

119894 119862(2)

119894 119894 isin 1 119898 Hence

the number of common signatures could be estimated bylfloor(1119896) sum

119898

119894=1119862lowast

119894rfloor Subsequently the graph pairs which have

less than 119871119861 common signatures given by count filteringcondition will be discarded and the remaining pairs formthe candidate set

We provide the pseudocode of the aforementioned pro-cess in Algorithm 2

43 Correctness and Complexity Analysis There is smallprobability that the false positive case happens with SBFSpecifically a query for an item 119902 in SBF on 119891(119902) 119891(119902)

6 The Scientific World Journal

Table 1 Complexity analysis of filtering

Phase Job 1 Job 2 +SBF

Map IO 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|))

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902

sdot (|119877| + |119878|))

Shuffle 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)

Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)

may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters

Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|

119902

(|119877|+

|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898

2

|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1

5 Optimizing Verification Phase

In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈

119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one

51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))

(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations

119892 (119909) = GED (119903119901 119904119901)

ℎ (119909) = Γ (119871119881

(119903119901) 119871119881

(119904119902)) + Γ (119871

119864(119903119902) 119871119864

(119904119902))

(2)

119903119901consists of the vertices that have been mapped and edges

connecting them while 119903119902consists of the vertices unmapped

yet as well as their resident edges The equation for ℎ(119909)

represents the label difference of two graphsThe search space forAlowast-based approach is very large

requiring 119874(|119881||119881|

) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round

The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree

52 Multiway Join In relational database a multiway joincan process 119879

1(119860 119861) ⋈ 119879

2(119861 119862) ⋈ 119879

3(119862 119863) together in one

round where 119879119894(119883 119884) is a relational table with attributes

119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899

2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905

2(119887 119888) isin 119879

2(119861 119862) is sent to theReducer numbered

(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879

3(119888 119889)) is sent

to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909

(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]

We encapsulate the improved verification procedure inAlgorithm 4

53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion

The Scientific World Journal 7

Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩

Output Graph edit distance of 119903 and 119904

(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903

119902is empty go to line (8)

(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩

(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V

1isin 119903119902(119909) to a vertex V

2isin 119904119902(119909) for any V

1and V

2

(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it

Algorithm 3 MRGED (119903 119904)

(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩

(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)

(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED

calculation else invoke the Alowast-based in-memory algorithm

Algorithm 4 Replacement of verification of Algorithm 1

In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +

|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +

120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce

task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2

6 Experiments

61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem

Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)

62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF

The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved

63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED

and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1

8 The Scientific World Journal

Table 2 Complexity analysis of verification

Phase Job 1 Job 2 +MJ

Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)

Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Reduce IO 119874 (1205721003816100381610038161003816119892

1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)

Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|

) 119874 (120572|119877||119878|119881||119881|

)

0 1 2 3 4 5

Elap

sed

time (

s)

GED threshold (120591)

Basic

+SBF

15k

12k

9k

6k

3k

(a)

0

50

100

150

200

250

300

1 2 3 4 5GED threshold (120591)

Num

ber o

f Δ (c

andi

date

pai

rs)

(b)

Figure 2 Evaluating filtering

Table 3 Dataset statistics

Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07

is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory

64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded

65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that

with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel

66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin

7 Related Work

Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper

MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been

The Scientific World Journal 9

Elap

sed

time (

s)

1 2 3 4 5GED threshold (120591)

16k

12k

8k

4k

MGSJoin

+MRGED

(a)

Elap

sed

time (

s)

Basic

+MR

105

104

103

102

1 2 3 4 5GED threshold (120591)

(b)

Figure 3 Evaluating verification

0

Elap

sed

time (

s)

GSimJoin

MGSJoin

1 2 3 4 5GED threshold (120591)

5k

4k

3k

2k

1k

(a)

Elap

sed

time (

s)

Number of graphs

GSimJoin

MGSJoin

105

104

103

102

102

101

103

104

106

105

(b)

Figure 4 Comparing with state-of-the-art method

2k

4k

6k

8k

10k

504540353025201510

Elap

sed

time

Number of instances

120591 = 1

120591 = 2

120591 = 3

120591 = 4

(a)

010 20 30 40 50

Elap

sed

time (

s)

Number of instances and number of graphs (times2000)

5k

4k

3k

2k

1k

(b)

Figure 5 Speedup and Scale-up

10 The Scientific World Journal

applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce

Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]

8 Conclusion

In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs

As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)

References

[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002

[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004

[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006

[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005

[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008

[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983

[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983

[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979

[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012

[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA

[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970

[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004

[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003

[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010

[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012

[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

6 The Scientific World Journal

Table 1 Complexity analysis of filtering

Phase Job 1 Job 2 +SBF

Map IO 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 sdot (|119877| + |119878|))

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902

sdot (|119877| + |119878|))

Shuffle 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)

Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)

Time 119874 ((|119881|119902

sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)

may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters

Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|

119902

(|119877|+

|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898

2

|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1

5 Optimizing Verification Phase

In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈

119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one

51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))

(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations

119892 (119909) = GED (119903119901 119904119901)

ℎ (119909) = Γ (119871119881

(119903119901) 119871119881

(119904119902)) + Γ (119871

119864(119903119902) 119871119864

(119904119902))

(2)

119903119901consists of the vertices that have been mapped and edges

connecting them while 119903119902consists of the vertices unmapped

yet as well as their resident edges The equation for ℎ(119909)

represents the label difference of two graphsThe search space forAlowast-based approach is very large

requiring 119874(|119881||119881|

) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round

The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree

52 Multiway Join In relational database a multiway joincan process 119879

1(119860 119861) ⋈ 119879

2(119861 119862) ⋈ 119879

3(119862 119863) together in one

round where 119879119894(119883 119884) is a relational table with attributes

119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899

2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905

2(119887 119888) isin 119879

2(119861 119862) is sent to theReducer numbered

(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879

3(119888 119889)) is sent

to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909

(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]

We encapsulate the improved verification procedure inAlgorithm 4

53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion

The Scientific World Journal 7

Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩

Output Graph edit distance of 119903 and 119904

(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903

119902is empty go to line (8)

(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩

(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V

1isin 119903119902(119909) to a vertex V

2isin 119904119902(119909) for any V

1and V

2

(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it

Algorithm 3 MRGED (119903 119904)

(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩

(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)

(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED

calculation else invoke the Alowast-based in-memory algorithm

Algorithm 4 Replacement of verification of Algorithm 1

In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +

|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +

120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce

task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2

6 Experiments

61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem

Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)

62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF

The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved

63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED

and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1

8 The Scientific World Journal

Table 2 Complexity analysis of verification

Phase Job 1 Job 2 +MJ

Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)

Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Reduce IO 119874 (1205721003816100381610038161003816119892

1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)

Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|

) 119874 (120572|119877||119878|119881||119881|

)

0 1 2 3 4 5

Elap

sed

time (

s)

GED threshold (120591)

Basic

+SBF

15k

12k

9k

6k

3k

(a)

0

50

100

150

200

250

300

1 2 3 4 5GED threshold (120591)

Num

ber o

f Δ (c

andi

date

pai

rs)

(b)

Figure 2 Evaluating filtering

Table 3 Dataset statistics

Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07

is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory

64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded

65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that

with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel

66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin

7 Related Work

Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper

MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been

The Scientific World Journal 9

Elap

sed

time (

s)

1 2 3 4 5GED threshold (120591)

16k

12k

8k

4k

MGSJoin

+MRGED

(a)

Elap

sed

time (

s)

Basic

+MR

105

104

103

102

1 2 3 4 5GED threshold (120591)

(b)

Figure 3 Evaluating verification

0

Elap

sed

time (

s)

GSimJoin

MGSJoin

1 2 3 4 5GED threshold (120591)

5k

4k

3k

2k

1k

(a)

Elap

sed

time (

s)

Number of graphs

GSimJoin

MGSJoin

105

104

103

102

102

101

103

104

106

105

(b)

Figure 4 Comparing with state-of-the-art method

2k

4k

6k

8k

10k

504540353025201510

Elap

sed

time

Number of instances

120591 = 1

120591 = 2

120591 = 3

120591 = 4

(a)

010 20 30 40 50

Elap

sed

time (

s)

Number of instances and number of graphs (times2000)

5k

4k

3k

2k

1k

(b)

Figure 5 Speedup and Scale-up

10 The Scientific World Journal

applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce

Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]

8 Conclusion

In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs

As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)

References

[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002

[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004

[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006

[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005

[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008

[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983

[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983

[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979

[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012

[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA

[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970

[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004

[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003

[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010

[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012

[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

The Scientific World Journal 7

Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩

Output Graph edit distance of 119903 and 119904

(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903

119902is empty go to line (8)

(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩

(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V

1isin 119903119902(119909) to a vertex V

2isin 119904119902(119909) for any V

1and V

2

(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it

Algorithm 3 MRGED (119903 119904)

(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩

(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)

(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩

(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED

calculation else invoke the Alowast-based in-memory algorithm

Algorithm 4 Replacement of verification of Algorithm 1

In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +

|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +

120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce

task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2

6 Experiments

61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem

Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)

62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF

The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved

63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED

and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1

8 The Scientific World Journal

Table 2 Complexity analysis of verification

Phase Job 1 Job 2 +MJ

Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)

Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Reduce IO 119874 (1205721003816100381610038161003816119892

1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)

Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|

) 119874 (120572|119877||119878|119881||119881|

)

0 1 2 3 4 5

Elap

sed

time (

s)

GED threshold (120591)

Basic

+SBF

15k

12k

9k

6k

3k

(a)

0

50

100

150

200

250

300

1 2 3 4 5GED threshold (120591)

Num

ber o

f Δ (c

andi

date

pai

rs)

(b)

Figure 2 Evaluating filtering

Table 3 Dataset statistics

Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07

is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory

64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded

65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that

with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel

66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin

7 Related Work

Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper

MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been

The Scientific World Journal 9

Elap

sed

time (

s)

1 2 3 4 5GED threshold (120591)

16k

12k

8k

4k

MGSJoin

+MRGED

(a)

Elap

sed

time (

s)

Basic

+MR

105

104

103

102

1 2 3 4 5GED threshold (120591)

(b)

Figure 3 Evaluating verification

0

Elap

sed

time (

s)

GSimJoin

MGSJoin

1 2 3 4 5GED threshold (120591)

5k

4k

3k

2k

1k

(a)

Elap

sed

time (

s)

Number of graphs

GSimJoin

MGSJoin

105

104

103

102

102

101

103

104

106

105

(b)

Figure 4 Comparing with state-of-the-art method

2k

4k

6k

8k

10k

504540353025201510

Elap

sed

time

Number of instances

120591 = 1

120591 = 2

120591 = 3

120591 = 4

(a)

010 20 30 40 50

Elap

sed

time (

s)

Number of instances and number of graphs (times2000)

5k

4k

3k

2k

1k

(b)

Figure 5 Speedup and Scale-up

10 The Scientific World Journal

applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce

Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]

8 Conclusion

In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs

As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)

References

[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002

[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004

[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006

[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005

[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008

[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983

[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983

[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979

[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012

[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA

[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970

[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004

[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003

[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010

[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012

[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

8 The Scientific World Journal

Table 2 Complexity analysis of verification

Phase Job 1 Job 2 +MJ

Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)

Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892

1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892

1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892

1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)

Reduce IO 119874 (1205721003816100381610038161003816119892

1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)

Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|

) 119874 (120572|119877||119878|119881||119881|

)

0 1 2 3 4 5

Elap

sed

time (

s)

GED threshold (120591)

Basic

+SBF

15k

12k

9k

6k

3k

(a)

0

50

100

150

200

250

300

1 2 3 4 5GED threshold (120591)

Num

ber o

f Δ (c

andi

date

pai

rs)

(b)

Figure 2 Evaluating filtering

Table 3 Dataset statistics

Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07

is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory

64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded

65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that

with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel

66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin

7 Related Work

Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper

MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been

The Scientific World Journal 9

Elap

sed

time (

s)

1 2 3 4 5GED threshold (120591)

16k

12k

8k

4k

MGSJoin

+MRGED

(a)

Elap

sed

time (

s)

Basic

+MR

105

104

103

102

1 2 3 4 5GED threshold (120591)

(b)

Figure 3 Evaluating verification

0

Elap

sed

time (

s)

GSimJoin

MGSJoin

1 2 3 4 5GED threshold (120591)

5k

4k

3k

2k

1k

(a)

Elap

sed

time (

s)

Number of graphs

GSimJoin

MGSJoin

105

104

103

102

102

101

103

104

106

105

(b)

Figure 4 Comparing with state-of-the-art method

2k

4k

6k

8k

10k

504540353025201510

Elap

sed

time

Number of instances

120591 = 1

120591 = 2

120591 = 3

120591 = 4

(a)

010 20 30 40 50

Elap

sed

time (

s)

Number of instances and number of graphs (times2000)

5k

4k

3k

2k

1k

(b)

Figure 5 Speedup and Scale-up

10 The Scientific World Journal

applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce

Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]

8 Conclusion

In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs

As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)

References

[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002

[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004

[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006

[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005

[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008

[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983

[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983

[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979

[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012

[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA

[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970

[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004

[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003

[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010

[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012

[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

The Scientific World Journal 9

Elap

sed

time (

s)

1 2 3 4 5GED threshold (120591)

16k

12k

8k

4k

MGSJoin

+MRGED

(a)

Elap

sed

time (

s)

Basic

+MR

105

104

103

102

1 2 3 4 5GED threshold (120591)

(b)

Figure 3 Evaluating verification

0

Elap

sed

time (

s)

GSimJoin

MGSJoin

1 2 3 4 5GED threshold (120591)

5k

4k

3k

2k

1k

(a)

Elap

sed

time (

s)

Number of graphs

GSimJoin

MGSJoin

105

104

103

102

102

101

103

104

106

105

(b)

Figure 4 Comparing with state-of-the-art method

2k

4k

6k

8k

10k

504540353025201510

Elap

sed

time

Number of instances

120591 = 1

120591 = 2

120591 = 3

120591 = 4

(a)

010 20 30 40 50

Elap

sed

time (

s)

Number of instances and number of graphs (times2000)

5k

4k

3k

2k

1k

(b)

Figure 5 Speedup and Scale-up

10 The Scientific World Journal

applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce

Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]

8 Conclusion

In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs

As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)

References

[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002

[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004

[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006

[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005

[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008

[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983

[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983

[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979

[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012

[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA

[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970

[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004

[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003

[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010

[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012

[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

10 The Scientific World Journal

applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce

Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]

8 Conclusion

In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs

As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)

References

[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002

[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004

[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006

[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005

[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008

[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983

[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983

[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979

[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012

[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA

[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970

[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004

[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003

[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010

[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012

[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

The Scientific World Journal 11

[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009

[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011

[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011

[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011

[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012

[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013

[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013

[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010

[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011

[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012

[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013

[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article Efficient and Scalable Graph …downloads.hindawi.com/journals/tswj/2014/749028.pdfResearch Article Efficient and Scalable Graph Similarity Joins in MapReduce YifanChen,

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014