11
Datenbank Spektrum DOI 10.1007/s13222-014-0154-1 SCHWERPUNKTBEITRAG Iterative Computation of Connected Graph Components with MapReduce Lars Kolb · Ziad Sehili · Erhard Rahm Received: 14 February 2014 / Accepted: 7 April 2014 © Springer-Verlag Berlin Heidelberg 2014 Abstract The use of the MapReduce framework for itera- tive graph algorithms is challenging. To achieve high per- formance it is critical to limit the amount of intermediate results as well as the number of necessary iterations. We address these issues for the important problem of finding connected components in large graphs. We analyze an exist- ing MapReduce algorithm, CC-MR, and present techniques to improve its performance including a memory-based con- nection of subgraphs in the map phase. Our evaluation with several large graph datasets shows that the improvements can substantially reduce the amount of generated data by up to a factor of 8.8 and runtime by up to factor of 3.5. Keywords MapReduce · Hadoop · Connected graph components · Transitive closure 1 Introduction Many Big Data applications require the efficient processing of very large graphs, e.g., for social networks or bibliographic datasets. In enterprise applications, there are also numerous interconnected entities such as customers, products, employ- ees and associated business activities like quotations and in- voices that can be represented in large graphs for improved analysis [17]. Finding connected graph components within L. Kolb () · Z. Sehili · E. Rahm Institut für Informatik, Universität Leipzig, PF 100920, 04009, Leipzig, Germany e-mail: [email protected] Z. Sehili e-mail: [email protected] E. Rahm e-mail: [email protected] such graphs is a fundamental step to cluster related entities or to find new patterns among entities. A connected component (CC) of an undirected graph is a maximal subgraph in which any two vertices are interconnected by a path. Fig. 1 shows a graph consisting of two CCs. Efficiently analyzing large graphs with millions of ver- tices and edges requires a parallel processing. MapReduce (MR) is a popular framework for parallel data processing in cluster environments providing scalability while largely hid- ing the complexity of a parallel system. We study the prob- lem of efficiently computing the CCs of very large graphs with MR. Finding connected vertices is an inherently itera- tive process so that a MR-based implementation results in the repeated execution of an MR program where the output of an iteration serves as input for the next iteration. Although the MR framework might not be the best choice for iterative graph algorithms, the huge popularity of the freely available Apache Hadoop implementation makes it important to find efficient MR-based implementations. We implemented our approaches as part of our existing MR-based entity resolu- tion framework [13, 14]. The computation of the transitive closure of a binary rela- tion, expressed as a graph, is closely related to the CC prob- lem. For entity resolution, it is a typical post-processing step to compute the transitive closure of matching entity pairs to find additional, indirectly matching entities. For an ef- ficient MR-based computation of the transitive closure it is important to not explicitly determine all matching pairs which would result in an enormous amount of intermediate data to store and exchange between iterations. For instance, the left CC of Fig. 1 with nine entities results in a transi- tive closure consisting of 36 pairs (A, B ), (B , C), ...,(H , I ). Instead, we only need to determine the set or cluster of matching entities, i.e. {A, B , C, D, E, F , G, H , I }, which all

Iterative Computation of Connected Graph Components with MapReduce

  • Upload
    erhard

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Iterative Computation of Connected Graph Components with MapReduce

Datenbank SpektrumDOI 10.1007/s13222-014-0154-1

S C HWE R PU N KTB E ITRAG

Iterative Computation of Connected Graph Components withMapReduce

Lars Kolb · Ziad Sehili · Erhard Rahm

Received: 14 February 2014 / Accepted: 7 April 2014© Springer-Verlag Berlin Heidelberg 2014

Abstract The use of the MapReduce framework for itera-tive graph algorithms is challenging. To achieve high per-formance it is critical to limit the amount of intermediateresults as well as the number of necessary iterations. Weaddress these issues for the important problem of findingconnected components in large graphs. We analyze an exist-ing MapReduce algorithm, CC-MR, and present techniquesto improve its performance including a memory-based con-nection of subgraphs in the map phase. Our evaluation withseveral large graph datasets shows that the improvementscan substantially reduce the amount of generated data by upto a factor of 8.8 and runtime by up to factor of 3.5.

Keywords MapReduce · Hadoop · Connected graphcomponents · Transitive closure

1 Introduction

Many Big Data applications require the efficient processingof very large graphs, e.g., for social networks or bibliographicdatasets. In enterprise applications, there are also numerousinterconnected entities such as customers, products, employ-ees and associated business activities like quotations and in-voices that can be represented in large graphs for improvedanalysis [17]. Finding connected graph components within

L. Kolb ( ) · Z. Sehili · E. RahmInstitut für Informatik, Universität Leipzig,PF 100920, 04009, Leipzig, Germanye-mail: [email protected]

Z. Sehilie-mail: [email protected]

E. Rahme-mail: [email protected]

such graphs is a fundamental step to cluster related entities orto find new patterns among entities. A connected component(CC) of an undirected graph is a maximal subgraph in whichany two vertices are interconnected by a path. Fig. 1 showsa graph consisting of two CCs.

Efficiently analyzing large graphs with millions of ver-tices and edges requires a parallel processing. MapReduce(MR) is a popular framework for parallel data processing incluster environments providing scalability while largely hid-ing the complexity of a parallel system. We study the prob-lem of efficiently computing the CCs of very large graphswith MR. Finding connected vertices is an inherently itera-tive process so that a MR-based implementation results inthe repeated execution of an MR program where the outputof an iteration serves as input for the next iteration. Althoughthe MR framework might not be the best choice for iterativegraph algorithms, the huge popularity of the freely availableApache Hadoop implementation makes it important to findefficient MR-based implementations. We implemented ourapproaches as part of our existing MR-based entity resolu-tion framework [13, 14].

The computation of the transitive closure of a binary rela-tion, expressed as a graph, is closely related to the CC prob-lem. For entity resolution, it is a typical post-processing stepto compute the transitive closure of matching entity pairsto find additional, indirectly matching entities. For an ef-ficient MR-based computation of the transitive closure itis important to not explicitly determine all matching pairswhich would result in an enormous amount of intermediatedata to store and exchange between iterations. For instance,the left CC of Fig. 1 with nine entities results in a transi-tive closure consisting of 36 pairs (A, B), (B, C), . . ., (H , I ).Instead, we only need to determine the set or cluster ofmatching entities, i.e. {A, B, C, D, E, F , G, H , I }, which all

Page 2: Iterative Computation of Connected Graph Components with MapReduce

L. Kolb et al.

Fig. 1 Example graph with two connected components

individual pairs can be easily derived from, if necessary.Determining these clusters is analogous to determining theCCs with the minimal number of edges, e.g. by choosinga CC/cluster representative, say A, and having an edge be-tween A and any other element of the CC, resulting in theeight pairs (A, B), (A, C), . . .(A, I ) for our example. The al-gorithms we consider determine such star-like connectedcomponents.

For the efficient use of MapReduce for iterative algo-rithms, it is of critical importance to keep the number ofiterations as small as possible because each iteration impliesa significant amount of task scheduling and I/O overhead.Second, the amount of generated (intermediate) data shouldbe minimized to reduce the I/O overhead per iteration. Fur-thermore, one may cache data needed in different iterations(similar to [4]). To this end, we propose and evaluate sev-eral enhancements over previous MR-based algorithms. Ourspecific contributions are as follows:

• We review existing approaches to determine the CCs ofan undirected graph (Sect. 2) and present an efficient MR-based algorithm, CC-MR, (Sect. 3) as a basis for compar-ison.

• We propose several algorithmic extensions over CC-MRto reduce the number of iterations, the amount of inter-mediate data, and, ultimately, execution time (Sect. 4). Amajor improvement is to connect edges early during themap phase, so that the remaining work for the reduce phaseand further iterations is lowered.

• We perform a comprehensive evaluation for several largedatasets to analyze the proposed techniques in comparisonwith CC-MR (Sect. 5).

2 Related Work

2.1 Parallel Transitive Closure Computation

An early iterative algorithm, TCPO, to compute the transi-tive closure of a binary relation was proposed in the contextof parallel database systems [22]. It relies on a distributedcomputation of join and union operations.

An improved version of TCPO proposed in [6] uses a dou-ble hashing technique to reduce the amount of data reparti-tioning in each round. Both approaches are parallel imple-mentations of sequential iterative algorithms [3] which ter-minate after d iterations where d is the depth of the graph. TheSmart algorithm [11] improves these sequential approachesby limiting the number of required iterations to log d + 1.Al-though not evaluated, a possible parallel MR implementationof the Smart algorithm was discussed in [1]. The proposedapproach translates each iteration of the Smart algorithminto multiple MR jobs that must be executed sequentially.However, because MR relies on materializing (intermedi-ate) results, the proposed approach is not feasible for largegraphs.

2.2 Detection of Connected Components

Finding the connected components of a graph is a wellstudied problem. Traditional approaches have a linear run-time complexity and traverse the graph using depth first (orbreadth first) search to discover connected components [21].For the efficient handling of large graphs, parallel algorithmswith logarithmic time complexity were proposed [2, 10, 20](see [9] for a comparison). Those approaches rely on a sharedmemory system and are not applicable for the MR program-ming model which relies on shared nothing clusters. Theauthors of [5] proposed an algorithm for distributed memorycluster environments in which nodes communicate with eachother to access remote memory. However, the MR frame-work is designed for independent parallel batch processingof disjoint data partitions.

An MR algorithm to detect CC in graphs was proposedin [7]. The main drawback of this algorithm is that it needsthree MapReduce jobs for each iteration. There are furtherapproaches that strive to minimize the number of requirediterations [12, 15]. All approaches are clearly outperformedby the CC-MR algorithm proposed in [19]. For a graph withdepth d , CC-MR requires d iterations in the worst case butneeds in practice only a logarithmic number of iterationsaccording to [19].

This algorithm will be described in more detail in the fol-lowing section. A very similar approach was independentlyproposed at the same time in [18]. The authors suggest fourdifferent algorithms of which the one with the best perfor-mance for large graphs corresponds to CC-MR.

Recent distributed graph processing frameworks likeGoogle Pregel [16] rely on the Bulk Synchronous Parallel(BSP) paradigm which segments distributed computationsinto a sequence of supersteps consisting of a parallelcomputation phase followed by a data exchange phase anda synchronization barrier. BSP algorithms are generallyconsidered to be more efficient for iterative graph algorithms

Page 3: Iterative Computation of Connected Graph Components with MapReduce

Iterative Computation of Connected Graph Components with MapReduce

than MR, mainly due to the significantly smaller overheadper iteration. However, [18] showed that in a congested clus-ter, MR algorithms can outperform BSP algorithms for largegraphs.

2.3 MapReduce

MapReduce (MR) is a programming model designed forparallelizing data-intensive computing in clusters [8]. MRimplementations such as Hadoop rely on a distributed filesystem (DFS) that can be accessed by all nodes. Data isrepresented by key-value pairs and a computation is ex-pressed employing two user-defined functions, map and re-duce, which are processed by a fixed number of map andreduce tasks.

map : (keyin, valin) → list(keytmp, valtmp)

reduce : (keytmp, list(valtmp)) → list(keyout , valout )

For each intermediate key-value pair produced in the mapphase, a target reduce task is determined by applying a par-titioning function that operates on the pair’s key. The re-duce tasks first sort incoming pairs by their intermediatekeys. The sorted pairs are then grouped and the reduce func-tion is invoked on all adjacent pairs of the same group.This simple processing model supports an automatic parallelprocessing on partitioned data for many resource-intensivetasks.

3 The CC-MR Algorithm

The input of the CC-MR algorithm [19] is a graph (V , E)with a set of vertices V and a set of edges E ⊆ V × V . Thegoal of CC-MR is to transform the input graph into a setof star-like subgraphs by iteratively assigning each vertex toits “smallest” neighbor, using a total ordering of the verticessuch as the lexicographic order of the vertex labels. Duringthe computation of the CCs, CC-MR checks for each ver-tex v and its (current) adjacent vertices adj (v) whether vis the smallest of these vertices. If this is already the case(local max state), all u ∈ adj (v) are assigned to v. A sub-graph in the local max state does already constitute a CC butthere may be further vertices that belong to this CC but stillneed to be discovered. If v ist not smaller than all its adja-cent vertices, then there is a vertex u ∈ adj (v) with u < v.In this merge case, v and adj (v)\{u} are assigned to u sothat the component of v becomes a component of u. The de-scribed steps are applied iteratively until no more mergesoccur.

Algorithm 1: CC-MR (reduce)1 reduce(Vertex source, Iterator<Vertex> values)2 locMaxState ← false;3 first ← values.next();4 if source.id < f irst.id then5 locMaxState ← true;6 output(source, first); // Forward edge

7 last ← first;8 while values.hasNext() do9 cur ← values.next();

10 if cur.id = last.id then11 continue ; // Remove duplicates

12 if locMaxState then13 output(source, cur); // Forward edge

14 else15 output(first, cur); // Forward edge16 output(cur, first); // Backward edge

17 lastId ← cur.id;

18 if ¬locMaxState ∧ (source.id < lastId) then19 output(source, first); // Backward edge

In the following we sketch the MR-based implementationof this approach. We also discuss CC-MR’s load balancingto deal with skewed component sizes. Figure 2 illustrateshow CC-MR finds the two CCs for the graph of Fig. 1. Thereare three iterations necessary resulting in the two star-likecomponents shown in the lower right corner of Fig. 2 withthe component centers A and J .

3.1 MapReduce Processing

An edge v − u of the graph is represented by a key-valuepair (v, u). To decide for each vertex v, whether it is in localmax state or in merge state, v and each u ∈ adj (v) are re-distributed to the same reduce task. As illustrated in Fig. 2,for each edge (v, u) of the input graph, the map function ofthe first iteration outputs a key-value pair (v.u, u) as well asan inversed pair (u.v, v). The output pairs are redistributed tothe reduce tasks by applying a partitioning function whichutilizes only the first component of the composite map out-put keys so that all vertices u that are connected to vertex vwill be sent to the same reduce task and vice versa. CC-MRmakes use of the secondary sorting technique. The reducetasks sort the incoming key-value pairs by the entire key andgroup adjacent pairs by the first key component only.Agroupv : [val1, . . ., valn] consists of a key for vertex v and a sortedlist of values val1, . . ., valn corresponding to v′s neighborsadj (v). For example, the first reduce task in Fig. 2 receivesgroup A : [B, C, D]

The pseudo-code of the reduce function is shown in Algo-rithm 1. The reduce function compares each vertex v, with itssmallest neighbor f irst ∈ adj (v). If v < f irst (local maxstate), then v is already the smallest vertex in the (sub)com-pon-ent and a key-value pair (v, u) is outputted for each valueu ∈ adj (v) (see Lines 4–6 and 12–13 of Algorithm 1). For

Page 4: Iterative Computation of Connected Graph Components with MapReduce

L. Kolb et al.

Fig. 2 Example dataflow (upper part) of the CC-MR algorithm for the example graph of Fig. 1.A red L indicates a local max state, whereas relevantmerge states are indicated by a red M . Newly discovered edges are highlighted in boldface. The algorithm terminates after three iterations sinceno new backward edges (italic) are generated. The map phase of each iteration i > 1 emits each output edge of the previous iteration unchangedand is omitted to save space. The lower part of the figure shows the resulting graph after each iteration. Red vertices indicate the smallest verticesof the components. Green vertices are already assigned correctly whereas blue vertices still need to be reassigned

example in the first iteration of Fig. 2, the (sub)componentsA : [B, C, D] and J : [K , L] are in the LocMaxState (markedwith an L).

For the merge case, i.e. v > f irst , all u ∈ adj (v)\{f irst} in the value list are assigned to vertex f irst . To thisend, the reduce function emits a key-value pair (f irst , u) foreach such vertex u (Line 15).Additionally, reduce outputs theinverse key-value pair (u, f irst) for each such u (Line 16)as well as a final pair (v, f irst) if v is not the largest vertex inadj (f irst) (Line 19). The latter two kinds of reduce outputpairs represent so-called backward edges that are temporar-ily added to the graph. Backward edges serve as bridges toconnect f irst with further neighbors of u and v (that mighteven be smaller than f irst) in the following iterations.

In the example of Fig. 2, relevant merge states are markedwith an M . The reduce input group F : [D, G] of the first it-eration results in a newly discovered edge (D, G) and back-ward edges (G, D), (F , D). Group D : [A, E, F , H , I ] alsoreaches the merge state and generates (amongst others) thebackward edge (D, A). In the second iteration, the first re-duce task extends the component for vertex A by addingthe newly generated neighbors of A. The second reduce taskmerges A and G (for group D : [A, G]) as well as A and D.Note, that edges might be detected multiple times, e.g. theedge (A, D) is generated by both reduce tasks in the seconditeration. Such duplicates are removed in the next iteration

(by the first reduce task of the third iteration in the example)according to Line 11 of Algorithm 1.

The output of iteration i serves as input for iteration i + 1.The map phase of the following iterations outputs each inputedge unchanged (aside from the construction of compositekeys), i.e., no reverse edges are generated as in the first iter-ation. The algorithm terminates when no further backwardedges are generated. This can be determined by a driver pro-gram which repeatably executes the same MR job by ana-lyzing the job counter values that are collected by the slavenodes and are aggregated by the master node at the end ofthe job execution. As iteration i > 1 only depends on theoutput of iteration i − 1, the driver program can replace theinput directory of iteration i − 1 with the output directory ofiteration i − 1.

3.2 Load Balancing

In the basic approach, all vertices of a component are pro-cessed by a single reduce task. This deteriorates the scal-ability and runtime efficiency of CC-MR if the graph hasmany small but also a few large components containing themajority of all vertices. CC-MR therefore provides a simpleload balancing approach to deal with large (sub)components.Such components are identified in the reduce phase by count-ing the number of neighbors per group. For large groups

Page 5: Iterative Computation of Connected Graph Components with MapReduce

Iterative Computation of Connected Graph Components with MapReduce

Fig. 3 Load balancing mechanism of CC-MR

whose size exceed a certain threshold, the reduce task recordsthe smallest group vertex v in a separate output file usedby the map tasks in subsequent iterations. For a forwardedge (v, u) of such large components, the map task now ap-plies a different redistribution of the generated key-value pair(v.u, u) by applying the partitioning function on the secondinstead of the first part of the key. This evenly distributesall u ∈ adj (v) across the available reduce tasks and, thus,achieves load balancing. Backward edges (v, u), with v be-ing the smallest element of a large component, are broadcastto all reduce tasks to ensure that all w ∈ adj (v) can be con-nected to u. Apart from these modifications, the algorithmremains unchanged.

The load balancing approach is illustrated in Fig. 3. In theexample, the component B : [C, D, E, F , G, H ] is identifiedas a candidate for load balancing. In the following map phase,the neighbors of B are distributed across all (two) reducetasks based on the neighbor label. This evenly balances thefurther processing of this large component. The backwardedge (B, A) is broadcast to both reduce tasks and ensuresthat all neighbors of B can become A’s neighbors in mergestates of the following iteration.

4 Optimizing CC-MR

Despite its efficiency, CC-MR can be further optimized to re-duce the number of iterations and the amount of intermediatedata.

To this end, we present three extensions in this section.First, we propose to select a CC center based on the num-ber of neighbors rather than simply choosing the vertex withthe smallest label. Second, we extend the map phase to al-ready connect vertices there to lower the amount of remain-ing work for the reduce phase and further iterations. Finally,we propose to identify stable components that do not growanymore (e.g., J : [K , L] in the example) to avoid their fur-

ther processing. The proposed enhancements do not affectthe general complexity of the original algorithm, CC-MR,so that we further expect a logarithmic number of iterationsw.r.t to the depth of the input graph.

4.1 Selecting CC Centers

CC-MR assigns all interconnected vertices to the smallestvertex of the component, based on the lexical ordering ofvertex labels. This is a straight-forward and natural approachthat also exploits the built-in sorting of records in the reducephase. However, this approach does not consider the exist-ing graph structure so that a high number of iterations maybecome necessary to find all CCs. For example, consider theleft part of the initial example of Fig. 1 in which the fivevertices E, F , G, H , I have to be assigned to vertex A (ver-tices B, C, D are already neighbors of A). If we would usevertex D as the CC center instead, only the three verticesB, C, G need to be reassigned. In the worst case, the inputgraph consists of a long chain with a smallest vertex locatedat the head of the chain. In this case, a vertex located in themiddle of the chain would be a better component center.

We propose the use of a simple heuristic called CC-MRVDto select the vertex with the highest degree, i.e. the highestnumber of direct neighbors, as a component center.While thismight not be an optimal solution, it promises to reduce theoverall number of vertex reassignments, and, thus, the num-ber of edges generated per iteration and possibly the numberof required iterations. If the vertex degrees are known, anadditional optimization can be applied. The output of back-ward edges (v, u) in the reduce phase can entirely be savedif v has vertex degree 1 because v has no further neighborsto be connected with u. This simple idea reduces the numberof edges to be processed further.

To determine the vertex degree for each graph vertex,CC-MRVD requires an additional, light-weight MR job asa pre-processing step. This job exploits Hadoop’s MapFile-OutputFormat to produce indexed files supporting an on-disklookup of the vertex degree for a given vertex label. The re-sulting data structure is distributed to all cluster nodes usingHadoop’s Distributed Cache mechanism. In the first iteration,the map tasks of the adapted CC-MR computation look up(and cache) vertex degrees of their input edges. Throughoutthe algorithm, each vertex is then annotated with the vertexdegree and vertices are not solely sorted by their labels butfirst by vertex degree in descending order and second by la-bel in ascending order. The first vertex in this sort order, thus,becomes the one with the highest vertex degree.

For the running example, Fig. 4 shows the resulting graphafter each iteration when applying CC-MRVD. Based on thechanged sort order, we choose vertex D as the center of thelargest component which saves one iteration. Furthermore,the number of generated edges is almost reduced by half.

Page 6: Iterative Computation of Connected Graph Components with MapReduce

L. Kolb et al.

Fig. 4 Edges generated when taking the vertex degree into account(lower part) and comparison of the number of generated edges (upperpart)

Unfortunately, CC-MRVD also has some drawbacks.First, it needs an additional MR job to determine the ver-tex degrees. Second, the map and reduce output records arelarger due the augmentation by the vertex degree. Further-more, the disk-based random access of vertex degrees in thefirst map phase introduces additional overhead proportionalto the graph size. Our evaluation will show whether the ex-pected savings in the number of edges and iterations canoutweigh these negative effects.

4.2 Computing local CCs in the map phase

CC-MR applies a stateless processing of the map functionwhere a map task redistributes each key-value pair (i.e. edge)of its input partition to one of the reduce tasks. We propose anextension CC-MRMEM where a map task buffers a prede-termined number of input edges in memory to find already(sub)components among the buffered edges. The determi-nation of such local components uses the same “assign-to-smallest-vertex” strategy as in the reduce tasks. The gener-ated edges are emitted and the buffer is cleared for the next setof input records. Overlapping local CCs that are computedin different rounds or by different map tasks are merged inthe reduce phase, as before.

An important aspect of CC-MRMEM is an efficient com-putation of local components. We organize sets of connectedvertices as hash tables and maintain the smallest vertex perset. Each vertex is mapped to the component set it belongsto as illustrated in Fig. 5. A set and its minimal vertex areupdated when a new vertex is added or a merge with an-

Fig. 5 Computation of local CCs in the map phase

other set (component) occurs. In the example of Fig. 5, thefirst two edges result in different sets (components) whilethe third edge (A, C) leads to the addition of vertex C tothe first set. The fourth edge (A, D) connects the two com-ponents so that the sets are merged. Merging is realized byadding the elements of the smaller set to the larger set andupdating the pointers of the vertices of the smaller set to thelarger set. Once all edges in the input buffer are processed,the determined sets are used to generate the output pairs fordistribution among the reduce tasks (see right part of Fig. 5).

CC-MRMEM thus finds already some com-pon-ents inthe map phase so that the amount of work for the reducetasks is reduced. Furthermore, the amount of intermediatedata (number of edges) to be exchanged via the distributedfile system as well as the required number of iterations canbe reduced. This comes at the cost of increased memory andprocessing requirements in the map phase. The size of themap input buffer is a configuration parameter that allowstuning the trade-off between additional map overhead andachievable savings. CC-MRMEM can be combined with theCC-MRVD approach.

Algorithm 2 shows the pseudo-code of the map functionof CC-MRMEM. Input edges are added to the componentsmap as described above. If its size exceeds a threshold or ifthere are no further input edges in the map task’s input par-tition, the computed local CCs will be outputted as follows.For each vertex v �= c.min of a local CC c, two key-valuepairs (min.v, v) (forward edge) and (v.min, min) (backwardedge) are emitted. The partitioning, sorting, and groupingbehavior as well as the reduce function is the same as as inAlgorithm 1. The only exception is that no backward edgesneed to be generated by the reduce function, since this is al-ready done in the map phase. Therefore, Lines 16 and 19 of

Page 7: Iterative Computation of Connected Graph Components with MapReduce

Iterative Computation of Connected Graph Components with MapReduce

Algorithm 2: CC-MR-Mem (map phase)1 map_configure(JobConf job)2 max ← job.getBufferSize();3 components ← new HashMap<Vertex,MinSet>(max);

4 map(Vertex u, Vertex v)5 if components.size() ≥ max then6 generateOutput();

7 comp1 ← components.get(u);8 comp2 ← components.get(v);9 if (comp1 �=null) ∧ (comp2 �=null) then

10 if comp1 �=comp2 then // Merge11 if comp1.size() ≥ comp2.size() then12 comp1.addAll(comp2);13 foreach Vertex v ∈ comp2 do14 components.put(v, comp1);

15 else16 comp2.addAll(comp1);17 foreach Vertex v ∈ comp1 do18 components.put(v, comp2);

19 else if comp1 �=null then // Add Vertex20 comp1.add(v);21 components.put(v, comp1);

22 else if comp2 �=null then // Add Vertex23 comp2.add(u);24 components.put(u, comp2);

25 else // New component26 MinSet component= new MinSet(u, v);27 components.put(u, component);28 components.put(v, component);

29 map_close()30 generateOutput();

31 generateOutput()32 foreach component ∈ components.values() do33 if ¬component.isMarkedAsProcessed() then34 component.markAsProcessed();35 min ← component.min;36 foreach Vertex v ∈ component do37 if v�=min then38 output(min.v, v); // Forward edge39 output(v.min, min); // Backward edge

40 components.clear();

Algorithm 1 are omitted. As for CC-MR, backward edges oflarge components are broadcast to all reduce tasks to achieveload balancing.

4.3 Separation of Stable Components

Large graphs can be composed of many CCs of largelyvarying sizes. Typically, small and medium-sized CCs arecompletely discovered much earlier than large components.When a CC does neither disappear (due to a merge withanother component) nor grows during an iteration, it willnot grow any further and, thus, can be considered as stable.For example, component J : [K , L] is identified during thefirst iteration of CC-MR and does not change in the seconditeration so that no further edges are generated. Hence, thiscomponent remains stable until the algorithm terminates. Weefficiently want to identify such stable components and sepa-

Fig. 6 Detection and separation of stable components in the reducephase of the CC-MR algorithm

rate them from unstable components to avoid their unneces-sary processing in further iterations. Stable components arewritten to different output files which are not read by maptasks of the following iterations.

First, we describe the approach for the original CC-MR(and CC-MRVD) algorithm. Figure 6 illustrates the approachfor the running example. A component with the smallest ver-tex v grows if there is at least one merge case u : [v, w, . . .]with v < u. In this case, the reduce function generates newforward edges (v, w), . . . as well as corresponding backwardedges (w, v), . . . . Due to the latter ones, the component maygrow in the next iteration. To notify for the next iterationthat the component with center v is not yet stable, we aug-ment its first forward edge with a special expanded flag, e.g.(v, we). In Fig. 6, we, thus, augment the first forward edgefor the merge case D : [A, E, F , H , I ] (resulting in a compo-nent with the smallest vertex A) with an expanded flag (edge(A, Ee)). If there are several merge cases in a reduce task withthe same smallest vertex, we set the expanded flag only forone to limit the further overhead for processing componentsmarked as expanded. For example, in the second iteration ofFig. 6 there are two merge states D : [A, G] and F : [A, D]processed by the same reduce task (see Fig. 2) but only edge(A, Ge) is flagged.

In the reduce function of iteration i > 1, we check for eachcomponent in local max state whether it has some expandedvertex we indicating that some vertex was newly assignedin the previous iteration. If such a vertex is not found, weconsider the component as stable. In the second iteration ofFig. 6, the expanded vertex Ee is found and, thus, A can notbe separated. Determining whether a component is stable isonly known after the last edge for the component has beenprocessed. Since components can be very large, it is generally

Page 8: Iterative Computation of Connected Graph Components with MapReduce

L. Kolb et al.

not possible to keep all its edges in memory at a reduce task.Therefore, the reduce tasks continuously generate the outputedges as usual but augment the last output edge with a stableflag, e.g. (v, zs), if v did not grow.

A stable component is then separated in the followingiteration. When a map task of the following iteration reads astable edge (v, zs), it outputs a (v.⊥, zs) instead of a (v.z, zs)pair (alternatively the empty string or Int.MinValue could beused instead of ⊥). This causes zs to be the first vertex in thelist of v’s reduce input values and, thus, the stable componentcan be found immediately and separated from the regularreduce output. The final result of the algorithm consists ofthe regular output files generated in the final iteration andthe additional stable files of each iteration. The describedapproach introduces nearly no additional overhead in termsof data volume and memory requirements.

In Fig. 6, there are no expanded vertices for componentJ : [K , L] in the second iteration so that this component isidentified as stable. The last output edge, i.e. J − Ls , is, thus,augmented by a stable flag. Component J is then separatedduring the third iteration (which is the last one for our smallexample).

For large components that are split across several reducetasks for load balancing reasons, expanded and stable edgesneed to be broadcast to all reduce tasks. To avoid duplicatesin the final output, expanded and stable edges of large compo-nents in the local max state are outputted only by one reducetask, e.g. the reduce task with index zero.

The described approach can also be used for CC-MRMEM. If a component is known to be stable, the outputof backward edges in Line 39 of Algorithm 2 can be saved.An important difference is that components can grow in themap phase as well so that expanded edges (v.w, we_map) arealso generated in this phase. In reduce, a component in localmax state is considered as stable if there is neither a vertexwe nor a vertex wemap in the value list. Furthermore, in con-trast to expanded edges generated in the reduce phase of theprevious iteration, expanded edges that were generated in themap phase of the current iteration need to be forwarded tothe next iteration.

5 Evaluation

5.1 Experimental Setup

In our first experiment, we compare the original CC-MR al-gorithm with our extensions for the first three graph datasets1

shown in Fig. 7(a). The Google print data-set is the smallestgraph containing hyperlinks between print pages. The Patent

1http://snap.stanford.edu/data/

citations dataset is four times larger and contains citationsbetween granted patents. The Live Journal graph representsfriendship relationships between users of an online commu-nity. It has a similar number of vertices but about four timesas many edges than the Patent graph. The experiments areconducted on Amazon EC2 using 20 worker instances oftype c1.medium (providing two virtual cores) and a dedi-cated master instance of type m1.small. Each node is set upwith Hadoop 0.20.2 and a capacity of two map and reducetasks. The overall number of reduce tasks scheduled per it-eration is set to 40.

In a second experiment, we evaluate the effect of the earlyseparation of stable components. For this purpose, we usea fourth graph from the Memetracker dataset which tracksprint documents (along with their link structure) containingcertain frequent quotes or phrases. In contrast to the first threegraphs, the Memetracker graph is a sparse graph containingmany small CCs and isolated vertices. Due to the large sizeof the input graph, we increase the cluster size to 40 EC2worker instances of type m1.xlarge which again can run twomap and reduce tasks in parallel.

The third experiment analyzes the scalability of the CC-MR and CC-MRMEM algorithms for cluster sizes of up to100 nodes. For this experiment, we again use the Meme-tracker dataset and EC2 worker nodes of type m1.xlarge.Again each node runs at most two map and reduce tasks inparallel. Thus, for n nodes, the cluster’s map and reduce taskcapacity is 2 · n, i.e. adding new nodes leads to additionalmap and reduce tasks.

All experiments are conducted with load balancing turnedon. As in [19], we consider a component as large if its sizeexceeds a threshold of 1 % of the number of forward edgesgenerated in the previous iteration. For CC-MRMEM, themaximum number of vertices that are buffered in memory isset to 20, 000.

5.2 Comparison with CC-MR

We first evaluate the three algorithms CC-MR, CC-MRVD(consideration of vertex degrees) and CC-MRMEM (com-putation of local CCs in the map phase) for the first threedata-sets. We consider the following four criteria: number ofiterations, execution time, overall number of edges writtento the DFS (across all iterations) as well as the correspondingoverall data volume. The results are listed in Fig. 7(a) andillustrated in Fig. 7(b).

The results show that CC-MRMEM outperforms CC-MRfor all cases. The improvements in execution time increasewith larger datasets up to a factor of 3.2 for the Live Journalgraph. Furthermore, CC-MRMEM significantly reduces theoverall amount of data written to the DFS by up to a factorof 8.8 for Live Journal. It strongly profits from the densityof the input graphs and is able to already connect overlap-

Page 9: Iterative Computation of Connected Graph Components with MapReduce

Iterative Computation of Connected Graph Components with MapReduce

Fig. 7 a, b, c Comparison of CC-MR, CC-MRVD, and CC-MRMEM for the Google print, Patent Citations, and Live Journal data-sets. Theexecution times and DFS output volumes of CC-MRVD include the additional overhead of the vertex degree computation

ping subcomponents in the map input partitions of each itera-tion that otherwise would have been connected in the reducephase of later iterations. This causes fewer generated forwardand backward edges in the reduce phase which increases theprobability of an earlier termination of the computation. Foreach dataset, CC-MRMEM needs two iterations less thanCC-MR.

For all datasets, CC-MRVD’s consideration of vertex de-grees leads to a significant reduction in the number of gen-erated edges (at the cost of an additional analysis job). Thenumber of merge cases could be reduced which in turn led toimproved execution times for Google print and the Patent Ci-tations graph. However, the resulting data volume is mostlylarger than for CC-MR since all vertices (which are repre-sented by an integer in the datasets) are augmented with theirvertex degree (an additional integer). Note, that for largervertex representations (e.g. string-valued URLs of printsites)this might not hold. For the large Live Journal graph, CC-MRVD suffered from the overhead of pre-computing the ver-tex degrees before and reading them from the distributedcache during the first iteration, which is its main drawbackfor large graphs. Hence, CC-MRVD is in its current form nota viable extension for large graphs.

5.3 Separation of Stable Components

In our second experiment, we study the effects of the earlyseparation of stable components that do not grow further inthe following iterations. To this end, we utilized a graph de-rived from the Memetracker dataset which has a significantlylower degree of connectivity compared to the other datasets.Without separating stable components, the execution time ofCC-MRMEM to find all CCs is by a factor of 2.6 lower thanfor CC-MR (Fig. 8).

With separation turned on, we observe only a small im-provement of 4 % of the execution time for CC-MR. Ap-parently, due to the MR overhead for job and task submis-sion, the amount of 85 GB which is saved across 11 itera-tions has only a small influence on the overall runtime fora cluster consisting of 40 nodes. However, CC-MRMEMstrongly benefits from the separation of stable components

Fig. 8 Results of CC-MR and CC-MRMEM for the Memetrackerdataset with and without separation of stable components

Page 10: Iterative Computation of Connected Graph Components with MapReduce

L. Kolb et al.

Fig. 9 Execution times and speedup values using CC-MR and CC-MRMEM for the Memetracker dataset (without separation of stablecomponents)

which leads to an improvement of 25 %. Compared to theregular CC-MR algorithm, the execution time is improvedby a factor of 3.5 (21.6 vs. 74.7 min). An interesting obser-vation is that the number of iterations could be reduced from9 to 8 for CC-MRMEM. This is caused by the fact that due tothe separation of stable components from the regular reduceoutput, the input data of a map task does no longer containstable components. This in turn leads to a higher probabil-ity that two components that would be separated by a stablecomponents otherwise are merged in the map phase already.Compared to the regular CC-MRMEM, the amount of datawritten to HDFS could be reduced by 37 % which improvedthe execution time by 25 %.

5.4 Scalability

The third experiment focuses on the scalability of the CC-MR algorithm and the CC-MRMEM extension. We thereforeuse the large Memetracker dataset and vary the cluster sizen from 1 up to 100 nodes. Figure 9 shows the resulting exe-cution times and speedup values.

The execution of the CC-MR algorithm did not succeedfor n = 1. This is because in some iteration, the sizes of theinput graph, the output of the previous iteration, the map out-put (containing replicated backward edges for load balancingpurposes), and the reduce output exceeded the available diskcapacity. To this end, CC-MR’s execution time for n = 1was estimated as twice as long as for n = 2. By contrast, dueto its reduced amount of generated data, CC-MRMEM wasable to compute all CCs of the input graph on a single nodeand did even complete earlier than CC-MR running on twonodes.

Overall, CC-MR MEM clearly outperformed CC-MR forall cluster sizes. The super-linear speedup values in Fig. 9

are caused by the execution times of the single node casewith only two parallel map and reduce tasks. Here, bothapproaches heavily suffer from load balancing problems inearly iterations.These load balancing issues are resolved withlarger configurations leading to a better utilization of theavailable nodes. Up to 60 nodes, the execution time for CC-MRMEM was a factor of 2–4 better than with CC-MR. Forexample, an execution time of 100 min is achieved with tennodes for CC-MRMEM vs. about 30 nodes for CC-MR. Theresults show that CC-MRMEM keeps its effectiveness evenfor an increasing number of map tasks which in principlereduces the local optimization potential per map task. Thisis influenced by the fact, that CC-MRMEM determines lo-cal components within fixed-sized (20, 000) groups of edgesand not within whole map input partitions. Still, the rela-tive improvement of CC-MRMEM over CC-MR decreasessomewhat when increasing the number of nodes beyond acertain level. In our experiment, the relative improvementwas a factor of 2 for 60 nodes and a factor of 1.7 for 100nodes (14.9 vs. 25.2 min).

6 Summary

The computation of connected components (CC) for largegraphs requires a parallel processing, e.g. on the popularHadoop platform. We proposed and evaluated three exten-sions over the previously proposed Map- Reduce-based im-plementation CC-MR to reduce both, the amount of inter-mediate data and the number of required iterations. The bestresults are achieved by CC-MRMEM which connects sub-components in both the reduce and in the map phase of eachiteration. The evaluation showed that CC-MRMEM signifi-cantly outperforms CC-MR for all considered graph datasets,especially for larger graphs. Furthermore, we proposed astrategy to early separate stable components from furtherprocessing. This approach introduces nearly no additionaloverhead but can significantly improve the performance ofCC-MRMEM for sparse graphs.

References

1. Afrati FN, Borkar VR, Carey MJ, Polyzotis N, Ullman JD (2011)Map-Reduce extensions and recursive queries. In: Proc. of intl.conference on extending database technology, pp 1–8

2. Awerbuch B, Shiloach Y (1987) New connectivity and MSF al-gorithms for shuffle-exchange network and PRAM. IEEE TransComput 36(10):1258–1263

3. Bancilhon F, Maier D, Sagiv Y, Ullman JD (1986) Magic setsand other strange ways to implement logic programs. In: Proc. ofsymposium on principles of database systems, pp 1–15

4. Bu Y, Howe B, Balazinska M, Ernst MD (2012) The HaLoopapproach to large-scale iterative data analysis. VLDB Journal21(2):169–190

Page 11: Iterative Computation of Connected Graph Components with MapReduce

Iterative Computation of Connected Graph Components with MapReduce

5. Bus L, Tvrdík P (2001) A parallel algorithm for connected com-ponents on distributed memory machines. In: Proc. of EuropeanPVM/MPI users’ group meeting, pp 280–287

6. Cheiney JP, de Maindreville C (1989) A parallel transitive closurealgorithm using hash-based clustering. In: Proc. of intl. workshopon database machines, pp 301–316

7. Cohen J (2009) Graph twiddling in a MapReduce world. ComputSci Eng 11(4):29–41

8. Dean J, Ghemawat S (2004) MapReduce: simplified data process-ing on large clusters. In: Proc. of symposium on operating systemdesign and implementation, pp 137–150

9. Greiner J (1994)Acomparison of parallel algorithms for connectedcomponents. In: Proc. of symposium on parallelism in algorithmsand architectures, pp 16–25

10. Hirschberg DS, Chandra AK, Sarwate DV (1979) Computingconnected components on parallel computers. Commun ACM22(8):461–464

11. Ioannidis YE (1986) On the computation of the transitive closureof relational operators. In: Proc. of intl. conference on very largedatabases, pp 403–411

12. Kang U, Tsourakakis CE, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system. In: Proc. of intl. conference on datamining, pp 229–238

13. Kolb L, Rahm E (2013) Parallel entity resolution with Dedoop.Datenbank-Spektrum 13(1):23–32

14. Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplicationwith Hadoop. Proceedings of the VLDB endowment 5(12):1878–1881

15. Lattanzi S, Moseley B, Suri S, Vassilvitskii S (2011) Filtering:a method for solving graph problems in MapReduce. In: Proc.of symposium on parallelism in algorithms and architectures,pp 85–94

16. Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, LeiserN, Czajkowski G (2010) Pregel: a system for large-scale graphprocessing. In: Proc. of the intl. conference on management ofdata, pp 135–146

17. Petermann A, Junghanns M, Mueller R, Rahm E (2014) BIIIG:enabling business intelligence with integrated instance graphs. In:Proc. of intl. workshop on graph data management (GDM)

18. Rastogi V, Machanavajjhala A, Chitnis L, Sarma AD (2013) Find-ing connected components in map-reduce in logarithmic rounds.In: Proc. of intl. conference on data engineering, pp 50–61

19. Seidl T, Boden B, Fries S (2012) CC-MR - finding connected com-ponents in huge graphs with MapReduce. In: Proc. of machinelearning and knowledge discovery in databases, pp 458–473

20. Shiloach Y, Vishkin U (1982) An O(log n) Parallel connectivityalgorithm. J Algorithms 3(1):57–67

21. Tarjan RE (1972) Depth-first search and linear graph algorithms.SIAM J Comput 1(2):146–160

22. Valduriez P, Khoshafian S (1988) Parallel evaluation of the transi-tive closure of a database relation. Int J Parallel Prog 17(1):19–37