Upload
vasia-kalavri
View
12.568
Download
4
Embed Size (px)
Citation preview
Vasia KalavriApache Flink PMC, PhD student @KTH
@vkalavri - [email protected] 27, 2015
Gelly: Large-Scale Graph Processing
with Apache Flink
Overview
- Why Graph Processing with Flink?
- Gelly: Flink’s Graph Processing API
- Iterative Graph Processing in Gelly
- What’s next, Gelly?
2
Why Graph Processing with Flink?
4
A data analysis pipeline
load cleancreate graph
analyze graph
clean transformload result
clean transformload
5
A more user-friendly pipeline
load
load result
load
6
General-purpose or specialized?
- fast application development and deployment
- easier maintenance- non-intuitive APIs
- time-consuming- use, configure and
integrate different systems
- hard to maintain- rich APIs and features
general-purpose specialized
what about performance?
7
Native Iterations
● the runtime is aware of the iterative execution● no scheduling overhead between iterations● caching and state maintenance are handled automatically
Caching Loop-invariant DataPushing work“out of the loop”
Maintain state as index
8
Flink Iteration Operators
Iterate IterateDelta
9
Input
Iterative Update Function
Result
Rep
lace
Workset
IterativeUpdate Function
Result
Solution Set
State
Many iterative algorithms exhibit sparse computational dependencies and asymmetric convergence
Delta Iterations
10
Example: Connected Components
11
Only vertices that change their value are in the workset for the next iteration.
Beyond Iterations
12
Performance & Scalability● own memory management● own serialization framework● operate on binary data when possibleAutomatic Optimizations● choose best execution strategy● cache invariant data
Gellythe Flink Graph API
● Java & Scala Graph APIs on top of Flink● Ships already with Flink 0.9 (Beta)● Can be seamlessly mixed with the DataSet
Flink API→ easily implement applications that use both record-based and graph-based analysis
Meet Gelly
14
Gelly in the Flink stack
15
Scala API(batch and streaming)
Java API(batch and streaming)
ML library Gelly
Runtime and Execution Environments
Data storage
Table API Python API
Transformations and Utilities
Iterative Graph Processing
Graph Library
Hello, Gelly!
// create a graph from a Dataset of Vertices and a DataSet of Edges
Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env);
16
// create a graph from a Collection of Edges and initialize the Vertex valuesGraph<String, Long, Double> graph = Graph.fromCollection(edges,
new MapFunction<String, Long>() {
public Long map(String vertexId) { return 1l; }
}, env);
vertex ID type
vertex value type
edge value type
edges is a Java collection
assigned vertex value
● Graph Properties○ getVertexIds○ getEdgeIds○ numberOfVertices○ numberOfEdges○ getDegrees○ isWeaklyConnected○ ...
Available Methods
● Transformations○ map, filter, join○ subgraph, union,
difference○ reverse, undirected○ getTriplets
● Mutations○ add vertex/edge○ remove vertex/edge
17
Example: mapVertices
18
// increment each vertex value by one
Graph<Long, Long, Long> updatedGraph = graph.mapVertices(
new MapFunction<Vertex<Long, Long>, Long>() {
public Long map(Vertex<Long, Long> value) {
return value.getValue() + 1;
}
});
4
28
5
53
17
4
5
Example: subGraph
19
Graph<Long, Long, Long> subgraph = graph.subgraph(
new FilterFunction<Vertex<Long, Long>>() {
public boolean filter(Vertex<Long, Long> vertex) {
// keep only vertices with positive values
return (vertex.getValue() > 0);
}
},
new FilterFunction<Edge<Long, Long>>() {
public boolean filter(Edge<Long, Long> edge) {
// keep only edges with negative values
return (edge.getValue() < 0);
}
})
- Apply a reduce function to the 1st-hop neighborhood of each vertex in parallel
Neighborhood Methods
3
47
4
4
graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT);
3
97
4
5
20
Iterative Graph Processing
Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations● vertex-centric (similar to Pregel, Giraph)● gather-sum-apply (similar to PowerGraph)
21
MessagingFunction: what messages to send to other vertices
VertexUpdateFunction: how to update a vertex value, based on the received messages
The workset contains only active vertices (received a msg in the previous superstep)
Vertex-centric Iterations
22
Vertex-centric SSSPshortestPaths = graph.runVertexCentricIteration(
new DistanceUpdater(), new DistanceMessenger()).getVertices();
DistanceUpdater: VertexUpdateFunction DistanceMessenger: MessagingFunction
23
sendMessages(K key, Double newDist) {
for (Edge edge : getOutgoingEdges()) {
sendMessageTo(edge.getTarget(),
newDist + edge.getValue());
}
updateVertex(Vertex<K, Double> vertex, MessageIterator msgs) {
Double minDist = Double.MAX_VALUE; for (double msg : msgs) { if (msg < minDist) minDist = msg; } if (vertex.getValue() > minDist) setNewVertexValue(minDist);}
Gather: how to transform each of the edges and neighbors of each vertex
Sum: how to aggregate the partial values of Gather to a single value
Apply: how to update the vertex value, based on the new aggregate and the current value
The workset contains the active vertices (user-defined activation)
Gather-Sum-Apply Iterations
24
Gather-Sum-Apply SSSPshortestPaths = graph.runGatherSumApplyIteration(new CalculateDistances(),
new ChooseMinDistance(), new UpdateDistance()).getVertices();
CalculateDistances: Gather
25
gather(Neighbor<Double, Double> neighbor) {return neighbor.getNeighborValue() + neighbor.getEdgeValue();
}
ChooseMinDistance: Sum
ChooseMinDistance: Apply
sum(Double newValue, Double currentValue) {return Math.min(newValue, currentValue);
}
apply(Double newDistance, Double oldDistance) {if (newDistance < oldDistance) { setResult(newDistance); }
}
Iteration Configuration
● Parallelism● Aggregators● Broadcast Variables● Messaging Direction (IN, OUT, ALL)● Access vertex degrees and number of
vertices
26
Vertex-Centric or GSA?
27
vertex-centric GSA
when message creation is independent & expensive (parallelized in Gather)
when the graph is skewed (~ 1.5x faster)
if update is associative-commutative
when all messages are needed for the update
when a vertex needs to send messages to non-neighbors
● PageRank*● Single Source Shortest Paths*● Label Propagation● Weakly Connected Components*● Community DetectionUpcoming: Triangle Count, HITS, Affinity Propagation, Graph Summarization graphWithRanks = inputGraph.run(new PageRank(maxIterations, dampingFactor));
*: both vertex-centric and GSA implementations
Library of Algorithms
28
What’s next, Gelly?
● Integration with the Flink Streaming API● Specialized Operators for Skewed Graphs● Partitioning Methods● More graph processing models: neighborhood-centric,
partition-centric
Check our Roadmap!https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly
29
Feeling Gelly? Try it!
● Grab the Flink master: https://github.com/apache/flink● Read the Gelly programming guide: https://ci.apache.
org/projects/flink/flink-docs-master/libs/gelly_guide.html● Follow the Gelly School tutorials (Beta): http://gellyschool.com● Read our blog post: https://flink.apache.
org/news/2015/08/24/introducing-flink-gelly.html● Come to Flink Forward (October 12-13, Berlin): http://flink-
forward.org/
30