31
Spark GraphX & Pregel Challenges and Best Practices Ashutosh Trivedi (IIIT Bangalore) Kaushik Ranjan (IIIT Bangalore) Sigmoid-Meetup Bangalore https://github.com/anantasty/SparkAlgorithms

GraphX and Pregel - Apache Spark

Embed Size (px)

Citation preview

Spark GraphX & Pregel

Challenges and Best Practices

Ashutosh Trivedi (IIIT Bangalore)Kaushik Ranjan (IIIT Bangalore)

Sigmoid-Meetup Bangalore

https://github.com/anantasty/SparkAlgorithms

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014

Agenda•Introduction to GraphX

– How to describe a graph– RDDs to store Graph– Algorithms available

•Application in graph algorithms– Feedback Vertex Set of a Graph– Identifying parallel parts of the solution.

•Challenges we faced •Best practices

2

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 33

GraphX - Representation

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014

Graph Representation

4

class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])

• The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional constraint that each VertexID occurs only once.

• Moreover, VertexRDD[A] represents a set of vertices each with an attribute of type A

• The EdgeRDD[ED], extends RDD[Edge[ED]]

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 5

GraphX - Representation

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 6

A BA

Vertex and Edges

Vertex Edge

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014

Triplets Join Vertices and Edges

• The triplets operator joins vertices and edges:

TripletsVertices

BA

CD

Edges

A B

A C

B C

C D

A BAB A C

B C

C D

7

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 88

Triplets elements

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 9

Subgraphs

Predicates vpred and epred

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 10

Feedback Vertex Set

• A feedback vertex set of a graph is a set of vertices whose removal leaves a graph without cycles.

• Each feedback vertex set contains at least one vertex of any cycle in the graph.

• The feedback vertex set problem is an NP-complete problem in computational complexity theory

• Enumerate each simple cycle.

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 11

1 2

34

5

6

7

8

9

10

Strongly Connected Components

Each strongly connected component can be considered in parallel since they do not share any cycle

SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 12

FVS Algorithm#Greedy recursive solution

FVS(G)sccGraph = scc(G)

For each graph in sccGraph For each vertex

remove vertex and again calculate scc,vertex V = vertex which give max number of scc

#which means it kills maximum cycles subGraph = subgraph(remove V )

FVS (subGraph )

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 13

1 2

4 3

2

4 3

Graph Iteration SCC count

3

1

4 3

1

1 2

43

1 2

4 3

1 2

4 3

Remove 2

Remove 1

Remove 3

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 14

1

5

8 9

1 5 8 9Feedback Vertex Set

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 15

FVS – Spark Implementation

sccGraph has one more property sccID on each vertices, extract it

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 16

FVS – Spark Implementation

sccGraph = scc(G)For each graph in sccGraph

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 17

FVS – Spark Implementation

#Greedy recursive function

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 18

FVS – Spark Implementation

For each vertex remove vertex and again calculate scc,

# Z is a list of scc count after removing each vertex

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 19

vertex V = vertex which give max number of scc #which means it kills maximum cycles

FVS – Spark Implementation

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 20

subGraph = subgraph(remove V )FVS (subGraph )

FVS – Spark Implementation

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 21

Pregel

• Graph DB– Data Storage– Data Mining

• Advantages– Large-scale distributed computations– Parallel-algorithms for graphs on multiple machines– Fault tolerance and distributability

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 22

Oldest Follower What is the age of oldest follower of each user ?

Val oldestFollowerAge = graph.aggregateMessages(

#map word => (word.dst.id, word.src.age),#reduce (a,b) => max(a, b)

) .vertices

mapReduceTriplets is now aggregateMessages

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 23

In aggregateMessages : • EdgeContext which exposes the triplet fields .

• functions to explicitly send messages to the source and destination vertex.

• It require the user to indicate what fields in the triplet are actually required.

New in GraphX

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014

Theory – it’s GoodHow it works – that’s awesome

24

Graph’s are recursive data-structures, where the property of a vertex is dependent on the properties of it’s neighbors, which in turn are dependent on the properties of their neighbors.

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014

Graph.Pregel ( initialMessage ) (#message consumption( vertexID, initialProperty, message ) → compute new property,

#message generationtriplet → .. code ..

Iterator( vertexID, message )Iterator.empty

,

#message aggregation( existing message set, new message ) → NEW message set

)

25

Architecture

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 26

1 2

4 3

1030

30 20

1 2

4 3

1030

30 20

max [30,10,20]

max [20] max [10]

1 2

4 3

100

10 10

1 2

4 3

100

10 10

max [10] max [10]

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 27

Example - output

1 2

4 3

100

0 0

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014

Applications - GIS• Algorithm – to compute all vertices in a directed graph, that can

reach out to a given vertex. • Can be used for watershed delineation in Geographic Information

Systems

28

Vertices that can reach out to E are A and B

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014

AlgorithmGraph.Pregel( Seq[vertexID’s] ) (

#message consumptionif vertex.state == 1

vertex.state → 2else if vertex.state == 0

if ( vertex.adjacentVertices ∩ Seq[ vertexID’s ] ) isNotEmpty vertex.state → 2

#message aggregatorSeq[existing vertex ID’s] U Seq[new vertex ID]

)

29

Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 30

#message generation for each triplet

if destinationVertex.state == 1message( sourceVertexID, Seq[destinationVertexID] )message( destinationVertexID, Seq[destinationVertexID] )

else if sourceVertex.state == 1 and destinationVertex.state == 2message( sourceVertexID, Seq[destinationVertexID] )

else message( empty )

Algorithm

References

• Fork our repository at • https://github.com/anantasty/SparkAlgorithms

• Follow us at• https://github.com/codeAshu• https://github.com/kaushikranjan

• https://spark.apache.org/docs/latest/graphx-programming-guide.html

31