Upload
ashutosh-trivedi
View
514
Download
1
Tags:
Embed Size (px)
Citation preview
Spark GraphX & Pregel
Challenges and Best Practices
Ashutosh Trivedi (IIIT Bangalore)Kaushik Ranjan (IIIT Bangalore)
Sigmoid-Meetup Bangalore
https://github.com/anantasty/SparkAlgorithms
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014
Agenda•Introduction to GraphX
– How to describe a graph– RDDs to store Graph– Algorithms available
•Application in graph algorithms– Feedback Vertex Set of a Graph– Identifying parallel parts of the solution.
•Challenges we faced •Best practices
2
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014
Graph Representation
4
class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])
• The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional constraint that each VertexID occurs only once.
• Moreover, VertexRDD[A] represents a set of vertices each with an attribute of type A
• The EdgeRDD[ED], extends RDD[Edge[ED]]
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014
Triplets Join Vertices and Edges
• The triplets operator joins vertices and edges:
TripletsVertices
BA
CD
Edges
A B
A C
B C
C D
A BAB A C
B C
C D
7
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 10
Feedback Vertex Set
• A feedback vertex set of a graph is a set of vertices whose removal leaves a graph without cycles.
• Each feedback vertex set contains at least one vertex of any cycle in the graph.
• The feedback vertex set problem is an NP-complete problem in computational complexity theory
• Enumerate each simple cycle.
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 11
1 2
34
5
6
7
8
9
10
Strongly Connected Components
Each strongly connected component can be considered in parallel since they do not share any cycle
SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 12
FVS Algorithm#Greedy recursive solution
FVS(G)sccGraph = scc(G)
For each graph in sccGraph For each vertex
remove vertex and again calculate scc,vertex V = vertex which give max number of scc
#which means it kills maximum cycles subGraph = subgraph(remove V )
FVS (subGraph )
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 13
1 2
4 3
2
4 3
Graph Iteration SCC count
3
1
4 3
1
1 2
43
1 2
4 3
1 2
4 3
Remove 2
Remove 1
Remove 3
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 15
FVS – Spark Implementation
sccGraph has one more property sccID on each vertices, extract it
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 16
FVS – Spark Implementation
sccGraph = scc(G)For each graph in sccGraph
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 17
FVS – Spark Implementation
#Greedy recursive function
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 18
FVS – Spark Implementation
For each vertex remove vertex and again calculate scc,
# Z is a list of scc count after removing each vertex
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 19
vertex V = vertex which give max number of scc #which means it kills maximum cycles
FVS – Spark Implementation
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 20
subGraph = subgraph(remove V )FVS (subGraph )
FVS – Spark Implementation
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 21
Pregel
• Graph DB– Data Storage– Data Mining
• Advantages– Large-scale distributed computations– Parallel-algorithms for graphs on multiple machines– Fault tolerance and distributability
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 22
Oldest Follower What is the age of oldest follower of each user ?
Val oldestFollowerAge = graph.aggregateMessages(
#map word => (word.dst.id, word.src.age),#reduce (a,b) => max(a, b)
) .vertices
mapReduceTriplets is now aggregateMessages
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 23
In aggregateMessages : • EdgeContext which exposes the triplet fields .
• functions to explicitly send messages to the source and destination vertex.
• It require the user to indicate what fields in the triplet are actually required.
New in GraphX
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014
Theory – it’s GoodHow it works – that’s awesome
24
Graph’s are recursive data-structures, where the property of a vertex is dependent on the properties of it’s neighbors, which in turn are dependent on the properties of their neighbors.
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014
Graph.Pregel ( initialMessage ) (#message consumption( vertexID, initialProperty, message ) → compute new property,
#message generationtriplet → .. code ..
Iterator( vertexID, message )Iterator.empty
,
#message aggregation( existing message set, new message ) → NEW message set
)
25
Architecture
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 26
1 2
4 3
1030
30 20
1 2
4 3
1030
30 20
max [30,10,20]
max [20] max [10]
1 2
4 3
100
10 10
1 2
4 3
100
10 10
max [10] max [10]
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014
Applications - GIS• Algorithm – to compute all vertices in a directed graph, that can
reach out to a given vertex. • Can be used for watershed delineation in Geographic Information
Systems
28
Vertices that can reach out to E are A and B
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014
AlgorithmGraph.Pregel( Seq[vertexID’s] ) (
#message consumptionif vertex.state == 1
vertex.state → 2else if vertex.state == 0
if ( vertex.adjacentVertices ∩ Seq[ vertexID’s ] ) isNotEmpty vertex.state → 2
#message aggregatorSeq[existing vertex ID’s] U Seq[new vertex ID]
)
29
Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 30
#message generation for each triplet
if destinationVertex.state == 1message( sourceVertexID, Seq[destinationVertexID] )message( destinationVertexID, Seq[destinationVertexID] )
else if sourceVertex.state == 1 and destinationVertex.state == 2message( sourceVertexID, Seq[destinationVertexID] )
else message( empty )
Algorithm