46
Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi, V. Mirrokni, V. Rastogi, S. Vassilvitskii Google Monday, June 23, 14

Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

  • Upload
    voxuyen

  • View
    218

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Connected Components in MapReduce and Beyond

R. Kiveris, S. Lattanzi, V. Mirrokni, V. Rastogi, S. VassilvitskiiGoogle

Monday, June 23, 14

Page 2: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Modern Massive Algorithmics

Communication: – Can be the overwhelming cost – In practice constant factors matter a whole lot

Data Skew:– Most datasets are heavy tailed – Naive data distribution can be disastrous– In synchronous environments must wait for slowest shard

• “Curse of the last reducer”

2Monday, June 23, 14

Page 3: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Modern Massive Algorithmics

Communication: – Can be the overwhelming cost – In practice constant factors matter a whole lot

Data Skew:– Most datasets are heavy tailed – Naive data distribution can be disastrous– In synchronous environments must wait for slowest shard

• “Curse of the last reducer”

Algorithms:– Embarrassingly parallel may also be embarrassingly slow – New techniques to minimize communication & skew

3Monday, June 23, 14

Page 4: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Today: Graph Connectivity

Classical problem – Many parallel algorithms

• PRAM• MapReduce• Pregel• ...

– Subroutine in many other problems• MST• Clustering• Multiway cuts• ...

4Monday, June 23, 14

Page 5: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Today: Graph Connectivity

Classical problem – Many parallel algorithms

• PRAM• MapReduce• Pregel• ...

– Subroutine in many other problems• MST• Clustering• Multiway cuts• ...

Want to optimize for very large graphs– Billions of nodes, 100s of billions of edges – Typically sparse– Do not fit in memory (10s+ TBs)

5Monday, June 23, 14

Page 6: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Approach

Transform the graph into a union of stars, one for each connected component.

6Monday, June 23, 14

Page 7: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Approach

Transform the graph into a union of stars, one for each connected component.

Begin:– Every node has a unique id – Assigned arbitrarily

7

1

9

5

8

7

3

2

6

4

Monday, June 23, 14

Page 8: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Approach

Transform the graph into a union of stars, one for each connected component.

Begin:– Every node has a unique id – Assigned arbitrarily

Two Local Operations:– Only look at a node and its neighbors– Prescribe which edges should exist in the next round

8Monday, June 23, 14

Page 9: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Operations

– LargeStar(v): Connect all strictly larger neighbors to the min neighbor including self.

9Monday, June 23, 14

Page 10: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Operations

– LargeStar(v): Connect all strictly larger neighbors to the min neighbor including self.

– Do this in parallel on every node to build a new graph

10

1

9

5

7

8

1

9

5

7

8

Monday, June 23, 14

Page 11: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

11

1

9

5

8

7

3

2

6

4

1

9

5

8

7

3

2

6

4

Monday, June 23, 14

Page 12: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

12

1

9

5

8

7

3

2

6

4

5

8

7

3

2

6

4

1

9

Monday, June 23, 14

Page 13: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

13

1

9

5

8

7

3

2

6

4

5

7

2

6

4

1

9

83

Monday, June 23, 14

Page 14: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

14

1

9

5

8

7

3

2

6

4

7

2

6

4

1

9

3

5

8

Monday, June 23, 14

Page 15: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

15

1

9

5

8

7

3

2

6

4

7

2

6

4

9

3

5

1

1

8

Monday, June 23, 14

Page 16: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

16

1

9

5

8

7

3

2

6

4

7

2

6

4

9

3

51

8

Monday, June 23, 14

Page 17: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

17

1

9

5

8

7

3

2

6

4

9

51

8

7

2

3

6

4

Monday, June 23, 14

Page 18: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

LargeStar Connectivity

Lemma: Executing LargeStar in parallel preserves connectivity

– WLOG assume A > b. If b is min neighbor of A we are done– If b has no smaller neighbors (local min) we are done– Else: A > b > c and:

• Now need to reason about connectivity of (b,c)• Show (b,c) connected by induction on node rank

18

bA

A c

b b

A c

Monday, June 23, 14

Page 19: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

LargeStar: Reinterpretation

– LargeStar(v): Connect all strictly larger neighbors to the min neighbor including self.

– Orient all edges from larger to smaller – LargeStar = tell children to connect to smallest parent

19

1

9

5

7

8

1

9

5

7

8

Monday, June 23, 14

Page 20: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

LargeStar Fixed Point

Fixed point if:– Every node is a local min or connected to local minima– Orient edges from larger nodes to smaller nodes

• Fixed point if graph is DAG of height 2

20

1

9

5

8

7

3

2

6

4

Monday, June 23, 14

Page 21: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

LargeStar Fixed Point

Fixed point if:– Every node is a local min or connected to local minima– Orient edges from larger nodes to smaller nodes

• Fixed point if graph is DAG of height 2

Progress:– Every LargeStar operation reduces height by a constant factor

21

1

9

5

8

7

3

2

6

4

Monday, June 23, 14

Page 22: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

LargeStar Fixed Point

22

1

9

5

8

7

3

2

6

4

9

51

8

7

2

3

6

4

Monday, June 23, 14

Page 23: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Operations

– LargeStar(v): Connect all strictly larger neighbors to the min neighbor including self.

– Do this in parallel on every node to build a new graph

23

1

9

5

7

8

1

9

5

7

8

Monday, June 23, 14

Page 24: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Operations

– SmallStar(v): Connect all smaller neighbors and self to the min neighbor.

24

1

9

5

78

9

8

15

7

Monday, June 23, 14

Page 25: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

25

9

8

7

3

6

4

9

51

8

7

2

3

6

4

1 5

2

Monday, June 23, 14

Page 26: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

26

9

8

7

3

6

4

97

3

1 5

2

51

8

2

6

4

Monday, June 23, 14

Page 27: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

27

97

3

6

4

97

3

5

8

2

1

51

8

2

6

4

Monday, June 23, 14

Page 28: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Example

28

9

97

3

5

8

1

7

3

6

4

2

6

51

8

2

4

Monday, June 23, 14

Page 29: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Operations

– SmallStar(v): Connect all smaller neighbors and self to the min neighbor.

– Connect all parents (and self) to the minimum parent.

29

1

9

5

78

9

8

15

7

Monday, June 23, 14

Page 30: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

SmallStar Analysis

Lemma: SmallStar preserves connectivity – Similar argument as before

30Monday, June 23, 14

Page 31: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

SmallStar Analysis

Lemma: SmallStar preserves connectivity – Similar argument as before

Progress:– Run LargeStar to completion– Run one iteration of SmallStar– Run LargeStar to completion again– The number of local minima (maximal nodes in the DAG) reduces by a

constant factor

31Monday, June 23, 14

Page 32: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Overall Algorithm

Input:– Set of edges, with a unique label per node

Repeat until convergence

Repeat until convergence

LargeStar

SmallStar

Theorem:– The above algorithm converges in O(log2 n) rounds.

32Monday, June 23, 14

Page 33: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Implementation

33Monday, June 23, 14

Page 34: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Implementation

Both can be easily implemented in MapReduce– Or Pregel, or Giraph, or ...

LargeStar:

Map (u;v):– Emit (u;v), Emit (v;u)

Reduce (u; v1,v2,...,vk):

– m = argmin label(vi)

– Emit (v,m) for all v with label(v) > label(m)

34Monday, June 23, 14

Page 35: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Discussion

Convergence:– log2 n: is tight– The graph is used to define communication structure from time to

time – Number of edges does not increase at every time step

35Monday, June 23, 14

Page 36: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Making it Practical

36Monday, June 23, 14

Page 37: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Making it Practical

Approach 1 (Systems):– LargeStar is equivalent to finding one of the maxima in the DAG

reachable from each vertex– Can do this with a fast distributed hash table (DHT) to “walk up to the

root”• Keep the min id of the parent in a DHT • Similar to path compression

37Monday, June 23, 14

Page 38: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Making it Practical

Approach 1 (Systems):– LargeStar is equivalent to finding one of the maxima in the DAG

reachable from each vertex– Can do this with a fast distributed hash table (DHT) to “walk up to the

root”• Keep the min id of the parent in a DHT • Similar to path compression

Approach 2 (Algorithms):– Instead of waiting for LargeStar to converge, just interleave LargeStar and SmallStar.

• Repeat Until Convergence:

SmallStar

LargeStar

– Can prove convergence – Appears to converge even faster (conjecture O(log(n)) rounds)

38Monday, June 23, 14

Page 39: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Watching Out for Skew

Final output:– Union of stars, one for each component – This should worry you!

• In case of a single component, get one star with linear degree • In case of skewed component sizes, also get one star with linear degree

39Monday, June 23, 14

Page 40: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Dealing with Skew

Divide the computation of the minimum

– Can do this recursively c times– Increase number of rounds by 1/c, each node’s input at most nc

40

od

g

abc

i

j

ml

n 1

h

ef

k

hd

lgc

ae

ig

b

if

k

jo

11.1

1.2

1.3

1.4

Monday, June 23, 14

Page 41: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

But does it work?

Data (subset):– UK Web graph: 106M nodes, 6.6B edges– Google+ subgraph: 178M nodes, 2.9B edges– Keyword similarity : 371M nodes, 3.5B edges– Document similarity: 4,700M nodes, 452B edges

Algorithms:– Hash2Min (previous MapReduce state of the art)– DHT Implementation– Alternating algorithm(skew optimized & non-optimized)

Setup:– Loaded cluster, look at median running times

41Monday, June 23, 14

Page 42: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Speedups

42

– 20x-40x faster on the document similarity graph– Smaller improvements on smaller graphs

Monday, June 23, 14

Page 43: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Graph Size

43

22

24

26

28

30

32

0 2 3 4 5 6 8 9 10 11 12

Log

og t

he n

umbe

r of

edg

es

Round

Alternating Optimized AlternatingHash-To-Min

Log

of n

umbe

r of e

dges

Monday, June 23, 14

Page 44: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Conclusion

Connected Components– Simple, local algorithms with O(log2 n) round complexity– Communication efficient (number of edges non-increasing)– Open: Prove O(log n) – Open: Prove ~log n lower bounds!

44Monday, June 23, 14

Page 45: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Conclusion

Connected Components– Simple, local algorithms with O(log2 n) round complexity– Communication efficient (number of edges non-increasing)– Open: Prove O(log n) – Open: Prove ~log n lower bounds!

Algorithms:– Evolve with the underlying system architecture– Avoid embarrassingly slow embarrassingly parallel implementations

45Monday, June 23, 14

Page 46: Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… ·  · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Thank You

Monday, June 23, 14