GraphX : Graph Analytics on Spark

GraphX:Graph Analytics on SparkJoseph Gonzalez, Reynold Xin,Ion Stoica, Michael FranklinDeveloped at the UC Berkeley AMPLab

AMPCamp: August 29, 2013

Graphs are Essential to Data Mining and Machine Learning

Identify influential people and informationFind communitiesUnderstand people’s shared interestsModel complex data dependencies

Liberal Conservative

Predicting Political Bias

Conditional Random FieldBelief Propagation

Triangle CountingCount the triangles passing through each vertex:

Measures “cohesiveness” of local community

More TrianglesStronger Community

Fewer TrianglesWeaker Community

Collaborative FilteringRatings Item

Many More Graph Algorithms

• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient Descent– Tensor Factorization– SVD

• Structured Prediction– Loopy Belief Propagation– Max-Product Linear

Programs– Gibbs Sampling

• Semi-supervised ML– Graph SSL – CoEM

• Graph Analytics– PageRank– Single Source Shortest Path– Triangle-Counting– Graph Coloring– K-core Decomposition– Personalized PageRank

• Classification– Neural Networks– Lasso…

Dependency Graph

Structure of Computation

Result

Data-Parallel Graph-Parallel

Pregel

The Graph-Parallel AbstractionA user-defined Vertex-Program runs on each vertexGraph constrains interaction along edges

Using messages (e.g. Pregel [PODC’09, SIGMOD’10])Through shared state (e.g., GraphLab [UAI’10, VLDB’12])

Parallelism: run multiple vertex programs simultaneously

By exploiting graph-structure

Graph-Parallel systems can be orders-of-

magnitude faster.

Counted: 34.8 Billion Triangles

Triangle Counting on Twitter

64 Machines15 SecondsGraphLab

1536 Machines423 Minutes

Hadoop[WWW’11]

S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

1000 x Faster

40M Users, 1.4 Billion Links

Pregel

Specialized Graph Systems

1. APIs to capture complex graph dependencies

2. Exploit graph structure toreduce communicationand computation

Why GraphX?

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Lab Hadoop Graph AlgorithmsGraph CreationPostProc

The Bigger Picture

Time Spent in Data Pipeline

Vertices

Limitations of Specialized Graph-Parallel Systems

No support for Construction & Post ProcessingNot interactive Requires maintaining multiple platforms

Spark excels at these!

GraphX Unifies Data-Parallel and Graph-

Parallel Systems

Spark Table API

RDDs, Fault-tolerance, and task scheduling

GraphLabGraph API

graph representation and

execution

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Graph Construction ComputationPost-Processingone system for the entire graph pipeline

Enable Joining Tables and Graphs

User Data

ProductRatings

Friend Graph

Product Rec.Graph

Join Inf.

Prod.Rec.

Tables Graphs

The GraphX Resilient Distributed

GraphId

RxinJegonzalFranklinIstoica

SrcId DstIdrxin jegonzal

franklin

istoica franklinfrankli

njegonzal

IAttribute (E)

FriendAdvisor

CoworkerPI

Attribute (V)(Stu., Berk.)

(PstDoc, Berk.)(Prof., Berk)(Prof., Berk)

class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]

// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]

// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]

// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,

reduceF: (T, T) => T, direction: EdgeDir):

Graph[T, E]}

GraphX API

Aggregate NeighborsMap-Reduce for each vertex

mapF( )A B

mapF( )A C

reduceF( , )a1 a2 A

Example: Oldest Follower

CWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices

We can express both Pregel and GraphLab using

aggregateNeighbors in 40 lines of code!

Performance Optimizations

Replicate & co-partition vertices with edges

»GraphLab (PowerGraph) style vertex-cut partitioning

»Minimize communication by avoiding edge data movement in JOINs

In-memory hash index for fast joins

Early Performance

GraphLab

GraphX

Hadoop

0 200 400 600 800 1000 1200 1400 1600

Runtime (in seconds, PageRank for 10 iter-ations)

In Progress Optimizations

Byte-code inspection of user functions»E.g. if mapf does not need edge data, we

can rewrite the query to delay the join

Execution strategies optimizer»Scan edges randomly accessing vertices»Scan vertices randomly accessing edges

Current Implementation

Pregel (20)

PageRank (5)

GraphX

Spark (relational operators)

Connected

Comp. (10)

Shortest Path (10)

ALS(40)

GraphLab (20)

DemoReynold Xin

Summary1. Graph-parallel primitives on Spark.2. Currently slower than GraphLab, but

»No need for specialized systems»Easier ETL, and easier consumption of

output»Interactive graph data mining

3. Future work will bring performance closer to specialized engines.

StatusCurrently finalizing the APIs

»Feedback wanted: http://bit.ly/graph-api

Also working on improving system performanceWill be part of Spark 0.9

Questions?jegonzal@eecs.berkeley.edu

rxin@eecs.berkeley.edu

Backup slides

Vertex Cut Partitioning

aggregateNeighbors

Example: Vertex Degree

Example: Vertex DegreeA: 5B: 0C: 0D: 0E: 0F: 0

Example: Oldest Follower

CWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices

Specialized Graph Systems

Shared State[UAI’10, VLDB’12]

PregelMessaging

[PODC’09, SIGMOD’10]

Many OthersGiraph, Stanford GPS, Signal-Collect,

Combinatorial BLAS, BoostPGL, …

class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]

// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]

// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]

// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,

reduceF: (T, T) => T, direction: EdgeDir):

Graph[T, E]}

GraphX API

GraphX : Graph Analytics on Spark

Documents

What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao)

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics · Graph-parallel computation is the analogue of data-parallel computation applied to graph data (i.e., property graphs)

The Pregel Programming Model with Spark GraphX

Spark, GraphX, and Scalaorg. apache. spark. graphx. PartitionStrategy} // Load the edges in canonical order and partition the graph for triangle count val graph = GraphLoader. edgeLfstFfIe(sc,

GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Large Scale Graph Processing - X-Stream and GraphX · Large Scale Graph Processing - X-Stream and GraphX Amir H. Payberah payberah@kth.se 2020-09-29

GraphX: Graph Processing in a Distributed Dataflow Framework · iterative graph algorithms which repeatedly and randomly access subsets of the graph. Second, early distributed dataﬂow

GraphX: Unifying Table and Graph Analyticsjegonzal/assets/... · GraphX: Unifying Table and Graph Analytics ! Presented by Joseph Gonzalez! " Joint work with Reynold Xin, Daniel Crankshaw,

Graph processing - Powergraph and GraphX

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

7 Steps for a Developer to Learn Apache Spark · Learning GraphX Graph Computation Spark R R on Spark Environments Applications Data Sources DataFrames / SQL / Datasets APIs RDD API

In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

GraphX - USENIX · GraphX:! Graph Processing in a ! Distributed Dataﬂow Framework "Joseph Gonzalez" Postdoc, UC-Berkeley AMPLab" Co-founder, GraphLab Inc."! Joint work with Reynold

Graph Analytics in Spark · Graph Analytics in Spark Ankur Dave! ... Many Graph-Parallel Algorithms. ... collaborating with Intel, SPARK-3789 2. More algorithms a) LDA

Spark Concepts - Spark SQL, Graphx, Streaming

Graphs are everywhere! Distributed graph computing with Spark GraphX

GraphX: Graph Processing in a Distributed Dataßow Frameworkbrents/cs494-cdcs/papers/GraphX.pdfsystem. We introduce GraphX, an embedded graph pro-cessing framework built on top of

GraphX: Graph Analytics on Spark Joseph Gonzalez, Reynold Xin, Ion Stoica, Michael Franklin Developed at the UC Berkeley AMPLab AMPCamp: August 29, 2013