74
GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold Xin, Ankur Dave, Daniel Crankshaw, Michael Franklin, and Ion Stoica OSDI 2014 UC BERKELEY

GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

  • Upload
    phamthu

  • View
    236

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

GraphX:���Graph Processing in a ���Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. ���Joint work with Reynold Xin, Ankur Dave, Daniel Crankshaw, Michael Franklin, and Ion Stoica OSDI 2014 UC  BERKELEY  

Page 2: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Modern Analytics

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages Title PR

Link Table Title Link

Editor Graph Community Detection

User Community

User Com.

Discussion Table

User Disc.

Top Communities Com. PR..

Page 3: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Tables

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages Title PR

Link Table Title Link

Editor Graph Community Detection

User Community

User Com.

Discussion Table

User Disc.

Top Communities Com. PR..

Page 4: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Graphs

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages Title PR

Link Table Title Link

Editor Graph Community Detection

User Community

User Com.

Discussion Table

User Disc.

Top Communities Com. PR..

Page 5: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Separate Systems

Tables Graphs

Page 6: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Separate Systems

Graphs Dataflow Systems

Table

Result

Row

Row

Row

Row

Page 7: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Separate Systems Dataflow Systems Graph Systems

Dependency Graph

6. Before

8. After

7. After

Table

Result

Row

Row

Row

Row

Page 8: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Difficult to Use Users must Learn, Deploy, and Manage

multiple systems

Leads to brittle and often ���complex interfaces

8  

Page 9: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Inefficient

9  

Extensive data movement and duplication across ���the network and file system

< / >!< / >!< / >!XML!

HDFS   HDFS   HDFS   HDFS  

Limited reuse internal data-structures ���across stages

Page 10: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Spark Dataflow Framework

GraphX

GraphX Unifies Computation on ��� Tables and Graphs

Table View Graph View

Enabling a single system to easily and efficiently support the entire pipeline

Page 11: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Separate Systems Dataflow Systems Graph Systems

Dependency Graph

6. Before

8. After

7. After

Table

Result

Row

Row

Row

Row

?

Page 12: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Separate Systems Dataflow Systems

Dependency Graph

Graph Systems 6. Before

8. After

7. After

Table

Result

Row

Row

Row

Row

Page 13: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

22

354

1340

0 200 400 600 800 1000 1200 1400 1600

GraphLab

Spark

Hadoop

Runtime (in seconds, PageRank for 10 iterations)

PageRank on the Live-Journal Graph

Hadoop is 60x slower than GraphLab Spark is 16x slower than GraphLab

?

Page 14: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Key Question

How can we naturally express and efficiently execute graph computation in a general purpose dataflow framework?

Distill the lessons learned���from specialized graph systems

Page 15: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Key Question

How can we naturally express and efficiently execute graph computation in a general purpose dataflow framework?

Representation Optimizations

Page 16: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages Title PR

Link Table Title Link

Editor Graph Community Detection

User Community

User Com.

Discussion Table

User Disc.

Top Communities Com. PR..

Page 17: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Express computation locally:

Iterate until convergence

Rank of Page i Weighted sum of

neighbors’ ranks

17

Example Computation: PageRank

Random Reset Prob.

R[i] = 0.15 +X

j2InLinks(i)

R[j]

OutLinks(j)

Page 18: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

“Think like a Vertex.” - Malewicz et al., SIGMOD’10

18

Page 19: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

19

Gather information from���neighboring vertices

Graph-Parallel Pattern Gonzalez et al. [OSDI’12]

Page 20: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

20

Apply an update the vertex property

Graph-Parallel Pattern Gonzalez et al. [OSDI’12]

Page 21: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

21

Scatter information to���neighboring vertices

Graph-Parallel Pattern Gonzalez et al. [OSDI’12]

Page 22: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Many Graph-Parallel Algorithms Collaborative Filtering » Alternating Least Squares » Stochastic Gradient Descent » Tensor Factorization

Structured Prediction » Loopy Belief Propagation » Max-Product Linear Programs » Gibbs Sampling

Semi-supervised ML » Graph SSL » CoEM

Community Detection » Triangle-Counting » K-core Decomposition » K-Truss

Graph Analytics » PageRank » Personalized PageRank » Shortest Path » Graph Coloring

MACHINE  LEARNING  

NETWORK    ANALYSIS  

22

Page 23: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Specialized Computational

Pattern

Specialized Graph

Optimizations

Page 24: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Graph System Optimizations

24  

Specialized���Data-Structures

Vertex-Cuts Partitioning

Remote Caching / Mirroring

Message Combiners Active Set Tracking

Page 25: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Representation

Optimizations

Distributed Graphs

Horizontally Partitioned Tables

Join

Vertex Programs

Dataflow Operators

Advances in Graph Processing Systems Distributed Join Optimization

Materialized View Maintenance

Page 26: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Property Graph Data Model

B C

A D

F E

A D D

Property Graph

B C

D

E

A A

F

Vertex Property: •  User Profile •  Current PageRank Value

Edge Property: •  Weights •  Timestamps

Page 27: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Part. 2

Part. 1

Vertex Table

(RDD)

B C

A D

F E

A D

Encoding Property Graphs as Tables

D

Property Graph

B C

D

E

A A

F

Machine 1

Machine 2

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

Routing Table

(RDD)

B

C

D

E

A

F

1  

2  

1   2  

1   2  

1  

2  

Vertex Cut

Page 28: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Separate Properties and Structure Reuse structural information across multiple graphs

Input Graph

Transform Vertex Properties

Transformed Graph Vertex Table

Routing ���Table

Edge Table

Vertex Table

Routing ���Table

Edge Table

Page 29: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Table Operators Table operators are inherited from Spark:

29

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Page 30: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])

// Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E]

}

Graph Operators (Scala)

30

Page 31: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])

// Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E]

}

Graph Operators (Scala)

31

def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E]

capture the Gather-Scatter pattern from specialized graph-processing systems

Page 32: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Triplets Join Vertices and Edges The triplets operator joins vertices and edges:

Triplets Vertices

B

A

C

D

Edges

A B A C B C C D

A B A

B A C B C C D

Page 33: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Map-Reduce Triplets Map-Reduce triplets collects information about the neighborhood of each vertex:

C D

A C

B C

A B

Src. or Dst.

MapFunction( ) à (B, )

MapFunction( ) à (C, ) MapFunction( ) à (C, )

MapFunction( ) à (D, )

Reduce

(B, )

(C, + )

(D, ) Message

Combiners

Page 34: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Using these basic GraphX operators ���we implemented Pregel and GraphLab ���

in under 50 lines of code!

34

Page 35: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

The GraphX Stack���(Lines of Code)

GraphX (2,500)

Spark (30,000)

Pregel API (34)

PageRank (20)

Connected Comp. (20)

K-core (60) Triangle

Count (50)

LDA (220)

SVD++ (110)

Some algorithms are more naturally expressed using the GraphX primitive operators

Page 36: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Representation

Optimizations

Distributed Graphs

Horizontally Partitioned Tables

Join

Vertex Programs

Dataflow Operators

Advances in Graph Processing Systems Distributed Join Optimization

Materialized View Maintenance

Page 37: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Vertex Table

(RDD)

Join Site Selection using Routing Tables Edge Table

(RDD) A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

Routing Table

(RDD)

B

C

D

E

A

F

1  

2  

1   2  

1   2  

1  

2  

Never Shuffle���Edges!

Page 38: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Vertex Table

(RDD)

Caching for Iterative mrTriplets Edge Table

(RDD) A B

A C

C D

B C

A E

A F

E F

E D

Mirror Cache

B C D

A

Mirror Cache

D E F

A

B

C

D

E

A

F

B

C

D

E

A

F

A

D

Scan Scan

Reusable Hash Index

Reusable Hash Index

Page 39: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Vertex Table

(RDD)

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

Mirror Cache

B C D

A

Mirror Cache

D E F

A

Incremental Updates for Triplets View

B

C

D

E

A

F

Change A A

Change E

Scan

Page 40: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Vertex Table

(RDD)

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

Mirror Cache

B C D

A

Mirror Cache

D E F

A

Aggregation for Iterative mrTriplets

B

C

D

E

A

F

Change

Change

Scan

Change

Change

Change

Change

Local Aggregate

Local Aggregate

B C

D

F

Page 41: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Reduction in Communication Due to Cached Updates

0.1

1

10

100

1000

10000

0 2 4 6 8 10 12 14 16

Net

wor

k Co

mm

. (M

B)

Iteration

Connected Components on Twitter Graph

Most vertices are within 8 hops���of all vertices in their comp.

Page 42: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Benefit of Indexing Active Vertices

0

5

10

15

20

25

30

0 2 4 6 8 10 12 14 16

Runt

ime

(Sec

onds

)

Iteration

Connected Components on Twitter Graph

Without Active Tracking

Active Vertex Tracking

Page 43: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Join Elimination Identify and bypass joins for unused triplet fields

»  Java bytecode inspection

43

0 2000 4000 6000 8000

10000 12000 14000

0 5 10 15 20

Com

mun

icatio

n (M

B)

Iteration

PageRank on Twitter

Factor of 2 reduction in communication

Better

Join Elimination

Without Join Elimination

Page 44: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Additional Optimizations Indexing and Bitmaps: » To accelerate joins across graphs » To efficiently construct sub-graphs

Lineage based fault-tolerance » Exploits Spark lineage to recover in parallel » Eliminates need for costly check-points

Substantial Index and Data Reuse: » Reuse routing tables across graphs and sub-graphs » Reuse edge adjacency information and indices

44

Page 45: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

System Comparison Goal:

Demonstrate that GraphX achieves performance parity with specialized graph-processing systems.

Setup: 16 node EC2 Cluster (m2.4xLarge) + 1GigE Compare against GraphLab/PowerGraph (C++), Giraph (Java), & Spark (Java/Scala)

Page 46: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

0 500

1000 1500 2000 2500 3000 3500

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)

PageRank Benchmark

GraphX performs comparably to ���state-of-the-art graph processing systems.

Runt

ime

(Sec

onds

)

EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE

7x 18x

Page 47: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Connected Comp. Benchmark

0

500

1000

1500

2000

2500

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)

GraphX performs comparably to ���state-of-the-art graph processing systems.

Out

-of-M

emor

y

EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE

Runt

ime

(Sec

onds

)

8x 10x

Page 48: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Graphs are just one stage…. ������

What about a pipeline?

Page 49: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

HDFS HDFS

Compute Spark Preprocess Spark Post.

A Small Pipeline in GraphX

Timed end-to-end GraphX is the fastest

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages

0 200 400 600 800 1000 1200 1400 1600

GraphX GraphLab + Spark

Giraph + Spark Spark

Total Runtime (in Seconds)

605 375

1492

342

Page 50: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Adoption and Impact GraphX is now part of Apache Spark

•  Part of Cloudera Hadoop Distribution

In production at Alibaba Taobao

•  Order of magnitude gains over Spark

Inspired GraphLab Inc. SFrame technology

•  Unifies Tables & Graphs on Disk

Page 51: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

GraphX à Unified Tables and Graphs

Enabling users to easily and efficiently express the entire analytics pipeline

New API Blurs the distinction between

Tables and Graphs

New System Unifies Data-Parallel ���

Graph-Parallel Systems

Page 52: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Graph Systems GraphX

Specialized Systems Integrated Frameworks

What did we Learn?

Page 53: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Graph Systems GraphX

Specialized Systems Integrated Frameworks

Parameter Server

?

Future Work

Page 54: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Graph Systems GraphX

Specialized Systems Integrated Frameworks

Parameter Server

Future Work

Asynchrony Non-deterministic

Shared-State

Page 55: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Thank You

[email protected]

http://amplab.cs.berkeley.edu/projects/graphx/

Reynold Xin

Ankur Dave

Daniel Crankshaw

Michael Franklin

Ion Stoica

Page 56: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Related Work Specialized Graph-Processing Systems: ���

GraphLab [UAI’10], Pregel [SIGMOD’10], Signal-Collect [ISWC’10], Combinatorial BLAS [IJHPCA’11], GraphChi [OSDI’12], PowerGraph [OSDI’12], ���Ligra [PPoPP’13], X-Stream [SOSP’13]

Alternative to Dataflow framework:��� Naiad [SOSP’13]: GraphLINQ ��� Hyracks: Pregelix [VLDB’15]

Distributed Join Optimization:��� Multicast Join [Afrati et al., EDBT’10]��� Semi-Join in MapReduce [Blanas et al., SIGMOD’10]

Page 57: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Edge Files Have Locality

0

200

400

600

800

1000

1200

GraphLab GraphX + Shuffle

GraphX

GraphLab rebalances the edge-files on-load.

GraphX preserves the on-disk layout through Spark. à Better Vertex-Cut

UK-Graph (106M Vertices, 3.7B Edges)

Runtime (Seconds)

Page 58: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Scalability

0

50

100

150

200

250

300

350

400

450

500

8 16 24 32 40 48 56 64

Linear Scaling

Twitter Graph (42M Vertices,1.5B Edges)

Scales slightly better than ���PowerGraph/GraphLab

Runt

ime

EC2-Nodes

Page 59: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Apache Spark Dataflow Platform Resilient Distributed Datasets (RDD):

Zaharia et al., NSDI’12

HDFS

HDFS

Map

Map

RDD RDD

Reduce

RDD

Load

Load

Page 60: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Apache Spark Dataflow Platform Resilient Distributed Datasets (RDD):

Zaharia et al., NSDI’12

HDFS

HDFS

Map

Map

RDD RDD

Reduce

Optimized for iterative access to data.

RDD

Load

Load

.cache()

Persist in Memory

Page 61: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

PageRank Benchmark

GraphX performs comparably to ���state-of-the-art graph processing systems.

0 500

1000 1500 2000 2500 3000 3500

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)

? ✓ Runt

ime

(Sec

onds

)

EC2 Cluster of 16 x m2.4xLarge Nodes + 1GigE

?

Page 62: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Shared Memory Advantage Spark Shared Nothing Model

Core Core Core Core

Shuffle Files

GraphLab Shared Memory

Core Core Core Core

Shared De-serialized In-Memory���Graph

Page 63: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Shared Memory Advantage

0 50

100 150 200 250 300 350 400 450 500

GraphLab GraphLab NoSHM

GraphX

Twitter Graph (42M Vertices,1.5B Edges) Spark Shared Nothing Model

GraphLab No SHM.

Core Core Core Core

Shuffle Files

Core Core Core Core

TCP/IP

Runtime (Seconds)

Page 64: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

PageRank Benchmark

0  

500  

1000  

1500  

2000  

2500  

3000  

3500  

0  1000  2000  3000  4000  5000  6000  7000  8000  9000  

Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)

GraphX performs comparably to ���state-of-the-art graph processing systems.

Page 65: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Connected Comp. Benchmark

0

500

1000

1500

2000

2500

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)

GraphX performs comparably to ���state-of-the-art graph processing systems.

Out

-of-M

emor

y

EC2 Cluster of 16 x m2.4xLarge Nodes + 1GigE

Runt

ime

(Sec

onds

)

8x

10x

Out

-of-M

emor

y

Page 66: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Fault-Tolerance Leverage Spark Fault-Tolerance Mechanism

0 100 200 300 400 500 600 700 800 900

1000

No Failure Lineage Restart

Runt

ime

(Sec

onds

)

Page 67: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

67

Graph-Processing Systems

oogle

Expose specialized API to simplify graph programming.

CombBLAS

Kineograph

X-Stream

GraphChi

Ligra

GPS

Representation

Page 68: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Vertex-Program Abstraction

i Pregel_PageRank(i,  messages)  :        //  Receive  all  the  messages      total  =  0      foreach(  msg  in  messages)  :          total  =  total  +  msg        //  Update  the  rank  of  this  vertex      R[i]  =  0.15  +  total        //  Send  new  messages  to  neighbors      foreach(j  in  out_neighbors[i])  :          Send    msg(R[i])  to  vertex  j  

68

Page 69: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

The Vertex-Program Abstraction

GraphLab_PageRank(i)        //  Compute  sum  over  neighbors      total  =  0      foreach(  j  in  neighbors(i)):            total  +=  R[j]  *  wji        //  Update  the  PageRank      R[i]  =  0.15  +  total      

69  

R[4]  *  w41  

+  +  

4 1

3 2

Low, Gonzalez, et al. [UAI’10]

Page 70: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

F  

E  

Example: Oldest Follower

D  

B  

A  

C  Calculate the number of older followers for each user?

val olderFollowerAge = graph .mrTriplets(

e => // Map if(e.src.age > e.dst.age) { (e.srcId, 1) else { Empty } , (a,b) => a + b // Reduce ) .vertices

23 42

30

19 75

16 70

Page 71: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Enhanced Pregel in GraphX

71

pregelPR(i, messageList ): !// Receive all the messages !total = 0 !foreach( msg in messageList) : ! total = total + msg!

// Update the rank of this vertex !R[i] = 0.15 + total !

// Send new messages to neighbors !foreach(j in out_neighbors[i]) : ! Send msg(R[i]/E[i,j]) to vertex!

Require  Message  Combiners  messageSum  

messageSum  

Remove  Message  Computation  

from  the  Vertex  Program  

sendMsg(iàj, R[i], R[j], E[i,j]): ! // Compute single message ! return msg(R[i]/E[i,j]) ! !

combineMsg(a, b): ! // Compute sum of two messages ! return a + b ! !

Page 72: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

PageRank in GraphX

72

// Load and initialize the graph !

val graph = GraphBuilder.text(“hdfs://web.txt”) !val prGraph = graph.joinVertices(graph.outDegrees) !!

// Implement and Run PageRank !

val pageRank = ! prGraph.pregel(initialMessage = 0.0, iter = 10)( ! (oldV, msgSum) => 0.15 + 0.85 * msgSum, ! triplet => triplet.src.pr / triplet.src.deg, !

(msgA, msgB) => msgA + msgB) !

Page 73: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Example Analytics Pipeline // Load raw data tables !

val articles = sc.textFile(“hdfs://wiki.xml”).map(xmlParser) !

val links = articles.flatMap(article => article.outLinks) !

// Build the graph from tables !

val graph = new Graph(articles, links) !

// Run PageRank Algorithm !

val pr = graph.PageRank(tol = 1.0e-5) !

// Extract and print the top 20 articles !

val topArticles = articles.join(pr).top(20).collect !

for ((article, pageRank) <- topArticles) { ! println(article.title + ‘\t’ + pageRank) !} !

Page 74: GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataflow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Apache Spark Dataflow Platform Resilient Distributed Datasets (RDD):

Zaharia et al., NSDI’12

HDFS

HDFS

Map

Map

RDD RDD

Reduce

Lineage:

RDD

Load

Load

HDFS RDD RDD Reduce Map RDD

Load

.cache()

Persist in Memory