Gelly in Apache Flink Bay Area Meetup

Preview:

Citation preview

Vasia KalavriApache Flink PMC, PhD student @KTH

@vkalavri - vasia@apache.orgAugust 27, 2015

Gelly: Large-Scale Graph Processing

with Apache Flink

Overview

- Why Graph Processing with Flink?

- Gelly: Flink’s Graph Processing API

- Iterative Graph Processing in Gelly

- What’s next, Gelly?

2

Why Graph Processing with Flink?

4

A data analysis pipeline

load cleancreate graph

analyze graph

clean transformload result

clean transformload

5

A more user-friendly pipeline

load

load result

load

6

General-purpose or specialized?

- fast application development and deployment

- easier maintenance- non-intuitive APIs

- time-consuming- use, configure and

integrate different systems

- hard to maintain- rich APIs and features

general-purpose specialized

what about performance?

7

Native Iterations

● the runtime is aware of the iterative execution● no scheduling overhead between iterations● caching and state maintenance are handled automatically

Caching Loop-invariant DataPushing work“out of the loop”

Maintain state as index

8

Flink Iteration Operators

Iterate IterateDelta

9

Input

Iterative Update Function

Result

Rep

lace

Workset

IterativeUpdate Function

Result

Solution Set

State

Many iterative algorithms exhibit sparse computational dependencies and asymmetric convergence

Delta Iterations

10

Example: Connected Components

11

Only vertices that change their value are in the workset for the next iteration.

Beyond Iterations

12

Performance & Scalability● own memory management● own serialization framework● operate on binary data when possibleAutomatic Optimizations● choose best execution strategy● cache invariant data

Gellythe Flink Graph API

● Java & Scala Graph APIs on top of Flink● Ships already with Flink 0.9 (Beta)● Can be seamlessly mixed with the DataSet

Flink API→ easily implement applications that use both record-based and graph-based analysis

Meet Gelly

14

Gelly in the Flink stack

15

Scala API(batch and streaming)

Java API(batch and streaming)

ML library Gelly

Runtime and Execution Environments

Data storage

Table API Python API

Transformations and Utilities

Iterative Graph Processing

Graph Library

Hello, Gelly!

// create a graph from a Dataset of Vertices and a DataSet of Edges

Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env);

16

// create a graph from a Collection of Edges and initialize the Vertex valuesGraph<String, Long, Double> graph = Graph.fromCollection(edges,

new MapFunction<String, Long>() {

public Long map(String vertexId) { return 1l; }

}, env);

vertex ID type

vertex value type

edge value type

edges is a Java collection

assigned vertex value

● Graph Properties○ getVertexIds○ getEdgeIds○ numberOfVertices○ numberOfEdges○ getDegrees○ isWeaklyConnected○ ...

Available Methods

● Transformations○ map, filter, join○ subgraph, union,

difference○ reverse, undirected○ getTriplets

● Mutations○ add vertex/edge○ remove vertex/edge

17

Example: mapVertices

18

// increment each vertex value by one

Graph<Long, Long, Long> updatedGraph = graph.mapVertices(

new MapFunction<Vertex<Long, Long>, Long>() {

public Long map(Vertex<Long, Long> value) {

return value.getValue() + 1;

}

});

4

28

5

53

17

4

5

Example: subGraph

19

Graph<Long, Long, Long> subgraph = graph.subgraph(

new FilterFunction<Vertex<Long, Long>>() {

public boolean filter(Vertex<Long, Long> vertex) {

// keep only vertices with positive values

return (vertex.getValue() > 0);

}

},

new FilterFunction<Edge<Long, Long>>() {

public boolean filter(Edge<Long, Long> edge) {

// keep only edges with negative values

return (edge.getValue() < 0);

}

})

- Apply a reduce function to the 1st-hop neighborhood of each vertex in parallel

Neighborhood Methods

3

47

4

4

graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT);

3

97

4

5

20

Iterative Graph Processing

Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations● vertex-centric (similar to Pregel, Giraph)● gather-sum-apply (similar to PowerGraph)

21

MessagingFunction: what messages to send to other vertices

VertexUpdateFunction: how to update a vertex value, based on the received messages

The workset contains only active vertices (received a msg in the previous superstep)

Vertex-centric Iterations

22

Vertex-centric SSSPshortestPaths = graph.runVertexCentricIteration(

new DistanceUpdater(), new DistanceMessenger()).getVertices();

DistanceUpdater: VertexUpdateFunction DistanceMessenger: MessagingFunction

23

sendMessages(K key, Double newDist) {

for (Edge edge : getOutgoingEdges()) {

sendMessageTo(edge.getTarget(),

newDist + edge.getValue());

}

updateVertex(Vertex<K, Double> vertex, MessageIterator msgs) {

Double minDist = Double.MAX_VALUE; for (double msg : msgs) { if (msg < minDist) minDist = msg; } if (vertex.getValue() > minDist) setNewVertexValue(minDist);}

Gather: how to transform each of the edges and neighbors of each vertex

Sum: how to aggregate the partial values of Gather to a single value

Apply: how to update the vertex value, based on the new aggregate and the current value

The workset contains the active vertices (user-defined activation)

Gather-Sum-Apply Iterations

24

Gather-Sum-Apply SSSPshortestPaths = graph.runGatherSumApplyIteration(new CalculateDistances(),

new ChooseMinDistance(), new UpdateDistance()).getVertices();

CalculateDistances: Gather

25

gather(Neighbor<Double, Double> neighbor) {return neighbor.getNeighborValue() + neighbor.getEdgeValue();

}

ChooseMinDistance: Sum

ChooseMinDistance: Apply

sum(Double newValue, Double currentValue) {return Math.min(newValue, currentValue);

}

apply(Double newDistance, Double oldDistance) {if (newDistance < oldDistance) { setResult(newDistance); }

}

Iteration Configuration

● Parallelism● Aggregators● Broadcast Variables● Messaging Direction (IN, OUT, ALL)● Access vertex degrees and number of

vertices

26

Vertex-Centric or GSA?

27

vertex-centric GSA

when message creation is independent & expensive (parallelized in Gather)

when the graph is skewed (~ 1.5x faster)

if update is associative-commutative

when all messages are needed for the update

when a vertex needs to send messages to non-neighbors

● PageRank*● Single Source Shortest Paths*● Label Propagation● Weakly Connected Components*● Community DetectionUpcoming: Triangle Count, HITS, Affinity Propagation, Graph Summarization graphWithRanks = inputGraph.run(new PageRank(maxIterations, dampingFactor));

*: both vertex-centric and GSA implementations

Library of Algorithms

28

What’s next, Gelly?

● Integration with the Flink Streaming API● Specialized Operators for Skewed Graphs● Partitioning Methods● More graph processing models: neighborhood-centric,

partition-centric

Check our Roadmap!https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly

29

Feeling Gelly? Try it!

● Grab the Flink master: https://github.com/apache/flink● Read the Gelly programming guide: https://ci.apache.

org/projects/flink/flink-docs-master/libs/gelly_guide.html● Follow the Gelly School tutorials (Beta): http://gellyschool.com● Read our blog post: https://flink.apache.

org/news/2015/08/24/introducing-flink-gelly.html● Come to Flink Forward (October 12-13, Berlin): http://flink-

forward.org/

30

Recommended