GraphX : Graph Analytics on Spark

Preview:

DESCRIPTION

GraphX : Graph Analytics on Spark. Joseph Gonzalez, Reynold Xin , Ion Stoica , Michael Franklin Developed at the UC Berkeley AMPLab AMPCamp : August 29, 2013. Graphs are Essential to Data Mining and Machine Learning. Identify influential people and information Find communities - PowerPoint PPT Presentation

Citation preview

GraphX:Graph Analytics on SparkJoseph Gonzalez, Reynold Xin,Ion Stoica, Michael FranklinDeveloped at the UC Berkeley AMPLab

AMPCamp: August 29, 2013

Graphs are Essential to Data Mining and Machine Learning

Identify influential people and informationFind communitiesUnderstand people’s shared interestsModel complex data dependencies

Liberal Conservative

Post

Post

Post

Post

Post

Post

Post

Post

Predicting Political Bias

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

??

?

?

??

?

? ??

?

?

??

? ?

?

?

?

?

?

?

?

?

?

?

?

? ?

?

3

Conditional Random FieldBelief Propagation

Triangle CountingCount the triangles passing through each vertex:

Measures “cohesiveness” of local community

More TrianglesStronger Community

Fewer TrianglesWeaker Community

12 3

4

Collaborative FilteringRatings Item

sUser

s

6

Many More Graph Algorithms

• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient Descent– Tensor Factorization– SVD

• Structured Prediction– Loopy Belief Propagation– Max-Product Linear

Programs– Gibbs Sampling

• Semi-supervised ML– Graph SSL – CoEM

• Graph Analytics– PageRank– Single Source Shortest Path– Triangle-Counting– Graph Coloring– K-core Decomposition– Personalized PageRank

• Classification– Neural Networks– Lasso…

7

Dependency Graph

Table

Structure of Computation

Result

Data-Parallel Graph-Parallel

Row

Row

Row

Row

Pregel

The Graph-Parallel AbstractionA user-defined Vertex-Program runs on each vertexGraph constrains interaction along edges

Using messages (e.g. Pregel [PODC’09, SIGMOD’10])Through shared state (e.g., GraphLab [UAI’10, VLDB’12])

Parallelism: run multiple vertex programs simultaneously

8

By exploiting graph-structure

Graph-Parallel systems can be orders-of-

magnitude faster.

9

Counted: 34.8 Billion Triangles

10

Triangle Counting on Twitter

64 Machines15 SecondsGraphLab

1536 Machines423 Minutes

Hadoop[WWW’11]

S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

1000 x Faster

40M Users, 1.4 Billion Links

Pregel

Specialized Graph Systems

Specialized Graph Systems

1. APIs to capture complex graph dependencies

2. Exploit graph structure toreduce communicationand computation

Why GraphX?

13

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Graph

Lab Hadoop Graph AlgorithmsGraph CreationPostProc

.

The Bigger Picture

Time Spent in Data Pipeline

Vertices

Edges

Edges

Limitations of Specialized Graph-Parallel Systems

No support for Construction & Post ProcessingNot interactive Requires maintaining multiple platforms

Spark excels at these!

GraphX Unifies Data-Parallel and Graph-

Parallel Systems

Spark Table API

RDDs, Fault-tolerance, and task scheduling

GraphLabGraph API

graph representation and

execution

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Graph Construction ComputationPost-Processingone system for the entire graph pipeline

Enable Joining Tables and Graphs

User Data

ProductRatings

Friend Graph

ETL

Product Rec.Graph

Join Inf.

Prod.Rec.

Tables Graphs

20

The GraphX Resilient Distributed

GraphId

RxinJegonzalFranklinIstoica

SrcId DstIdrxin jegonzal

franklin

rxin

istoica franklinfrankli

njegonzal

R

J

F

IAttribute (E)

FriendAdvisor

CoworkerPI

Attribute (V)(Stu., Berk.)

(PstDoc, Berk.)(Prof., Berk)(Prof., Berk)

class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]

// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]

// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]

// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,

reduceF: (T, T) => T, direction: EdgeDir):

Graph[T, E]}

GraphX API

F

E

Aggregate NeighborsMap-Reduce for each vertex

D

B

A

C

mapF( )A B

mapF( )A C

a1

a2

reduceF( , )a1 a2 A

F

E

Example: Oldest Follower

D

B

A

CWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices

23 42

30

19 75

16

We can express both Pregel and GraphLab using

aggregateNeighbors in 40 lines of code!

Performance Optimizations

Replicate & co-partition vertices with edges

»GraphLab (PowerGraph) style vertex-cut partitioning

»Minimize communication by avoiding edge data movement in JOINs

In-memory hash index for fast joins

Early Performance

GraphLab

GraphX

Hadoop

0 200 400 600 800 1000 1200 1400 1600

22

165

1340

Runtime (in seconds, PageRank for 10 iter-ations)

In Progress Optimizations

Byte-code inspection of user functions»E.g. if mapf does not need edge data, we

can rewrite the query to delay the join

Execution strategies optimizer»Scan edges randomly accessing vertices»Scan vertices randomly accessing edges

Current Implementation

Pregel (20)

PageRank (5)

GraphX

Spark (relational operators)

Connected

Comp. (10)

Shortest Path (10)

ALS(40)

GraphLab (20)

DemoReynold Xin

Summary1. Graph-parallel primitives on Spark.2. Currently slower than GraphLab, but

»No need for specialized systems»Easier ETL, and easier consumption of

output»Interactive graph data mining

3. Future work will bring performance closer to specialized engines.

StatusCurrently finalizing the APIs

»Feedback wanted: http://bit.ly/graph-api

Also working on improving system performanceWill be part of Spark 0.9

Questions?jegonzal@eecs.berkeley.edu

rxin@eecs.berkeley.edu

Backup slides

Vertex Cut Partitioning

Vertex Cut Partitioning

aggregateNeighbors

aggregateNeighbors

aggregateNeighbors

aggregateNeighbors

Example: Vertex Degree

Example: Vertex Degree

Example: Vertex DegreeA: 5B: 0C: 0D: 0E: 0F: 0

F

E

Example: Oldest Follower

D

B

A

CWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices

Specialized Graph Systems

47

Shared State[UAI’10, VLDB’12]

PregelMessaging

[PODC’09, SIGMOD’10]

Many OthersGiraph, Stanford GPS, Signal-Collect,

Combinatorial BLAS, BoostPGL, …

class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]

// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]

// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]

// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,

reduceF: (T, T) => T, direction: EdgeDir):

Graph[T, E]}

GraphX API

Recommended