Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

GRADOOP: Scalable Graph Analytics with Apache Flink

Martin Junghanns University of Leipzig

About the speaker and the team

2011 Bachelor of Engineering Thesis: Partitioning of Dynamic Graphs

2014 Master of Science

Thesis: Graph Database Systems for Business Intelligence

Now: PhD Student, Database Group, University of Leipzig

Distributed Systems Distributed Graph Data Management Graph Theory & Algorithms

Professional Experience: sones GraphDB, SAP

André, PhD Student

Martin, PhD Student

Kevin, M.Sc. Student Niklas, M.Sc. Student

Motivation

𝑮𝑟𝑟𝑟𝑟 = (𝑽𝑒𝑟𝑒𝑒𝑒𝑒𝑒,𝑬𝑑𝑑𝑒𝑒)

“Graphs are everywhere”

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

Mallory

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝐹𝐹𝐹𝐹𝐹𝑒𝑟𝑒)

Mallory

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝐹𝐹𝐹𝐹𝐹𝑒𝑟𝑒)

Mallory

𝐺𝑟𝑟𝑟𝑟 = (𝐂𝐂𝐂𝐂𝐔𝐔,𝐶𝐹𝐹𝐹𝑒𝑒𝑒𝑒𝐹𝐹𝑒)

Leipzig pop: 544K

Dresden pop: 536K

Berlin pop: 3.5M

Hamburg pop: 1.7M

Munich pop: 1.4M

Chemnitz pop: 243K

Nuremberg pop: 500K

Cologne pop: 1M

World Wide Web ca. 1 billion websites

“Graphs are large”

Facebook ca. 1.49 billion active users ca. 340 friends per user

End-to-End Graph Analytics

Data Integration Graph Analytics Representation

Integrate data from one or more sources into a dedicated graph storage with common graph data model

Definition of analytical workflows from operator algebra

Definition of analytical workflows from operator algebra Result representation in a meaningful way

Graph Data Management Graph Database Systems Neo4j, OrientDB

Graph Processing Systems Pregel, Giraph

Distributed Workflow Systems Flink Gelly, Spark GraphX

Data Model Rich Graph Models

Generic Graph Models Generic Graph Models

Focus Local ACID Operations

Global Graph Operations Global Data and Graph Operations

Query Language Yes No No

Persistency Yes No No

Scalability Vertical Horizontal Horizontal

Workflows No No Yes

Data Integration No No No

Graph Analytics No Yes Yes

Representation Yes No No

Workflows No No Yes

What‘s missing?

An end-to-end framework and research platform for efficient, distributed and domain independent

graph data management and analytics.

What‘s missing?

An end-to-end framework and research platform for efficient, distributed and domain independent

graph data management and analytics.

Gradoop Architecture & Data Model

High Level Architecture

HDFS/YARN Cluster

HBase Distributed Graph Store

Extended Property Graph Model

Flink Operator Implementations

Data Integration

Flink Operator Execution

Workflow Declaration

Visual

GrALa DSL Representation

Data flow

Control flow

Graph Analytics Representation

Workflow Execution

High Level Architecture

HBase Distributed Graph Store

Flink Operator Implementations

Data Integration

Flink Operator Execution

Workflow Declaration

Visual

GrALa DSL Representation

Data flow

Control flow

Graph Analytics Representation

Workflow Execution

HDFS/YARN Cluster

Graph Operators

Operator GrALa notation

Binary

Combination graph.combine(otherGraph) : Graph

Overlap graph.overlap(otherGraph) : Graph

Exclusion graph.exclude(otherGraph) : Graph

Isomorphism graph.isIsomorphicTo(otherGraph) : Boolean

Pattern Matching graph.match(patternGraph,predicate) : Collection

Aggregation graph.aggregate(propertyKey,aggregateFunction) : Graph

Projection graph.project(vertexFunction,edgeFunction) : Graph

Summarization graph.summarize( vertexGroupKeys,vertexAggregateFunction, edgeGroupKeys,edgeAggregateFunction) : Graph

Combination

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])

Combination

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])

Graph Operators

Operator GrALa notation

Binary

Combination graph.combine(otherGraph) : Graph

Overlap graph.overlap(otherGraph) : Graph

Exclusion graph.exclude(otherGraph) : Graph

Isomorphism graph.isIsomorphicTo(otherGraph) : Boolean

Pattern Matching graph.match(patternGraph,predicate) : Collection

Aggregation graph.aggregate(propertyKey,aggregateFunction) : Graph

Projection graph.project(vertexFunction,edgeFunction) : Graph

Summarization graph.summarize( vertexGroupKeys,vertexAggregateFunction, edgeGroupKeys,edgeAggregateFunction) : Graph

Summarization

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)

Summarization

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)

Graph Collection Operators

Operator GrALa notation Collection

Selection collection.select(predicate) : Collection

Distinct collection.distinct() : Collection

Sort by collection.sortBy(key, [:asc|:desc]) : Collection

Top collection.top(limit) : Collection

Union collection.union(otherCollection) : Collection

Intersection collection.intersect(otherCollection) : Collection

Difference collection.difference(otherCollection) : Collection

Auxiliary

Apply collection.apply(unaryGraphOperator) : Collection

Reduce collection.reduce(binaryGraphOperator) : Graph

Call [graph|collection].callFor[Graph|Collection]( algorithm,parameters) : [Graph|Collection]

Selection

1: collection = <db.G[0],db.G[1],db.G[2]> 2: predicate = (Graph g => |g.V| > 3) 3: result = collection.select(predicate)

Selection

1: collection = <db.G[0],db.G[1],db.G[2]> 2: predicate = (Graph g => |g.V| > 3) 3: result = collection.select(predicate)

Graph Collection Operators

Operator GrALa notation Collection

Selection collection.select(predicate) : Collection

Distinct collection.distinct() : Collection

Sort by collection.sortBy(key, [:asc|:desc]) : Collection

Top collection.top(limit) : Collection

Union collection.union(otherCollection) : Collection

Intersection collection.intersect(otherCollection) : Collection

Difference collection.difference(otherCollection) : Collection

Auxiliary

Apply collection.apply(unaryGraphOperator) : Collection

Reduce collection.reduce(binaryGraphOperator) : Graph

Call [graph|collection].callFor[Graph|Collection]( algorithm,parameters) : [Graph|Collection]

Extended Property Graph Model in Flink

ID Label Properties Graphs

ID Label Properties Source Vertex

Target Vertex

Graphs

VertexData

EdgeData

GraphData

ID Label Properties

DataSet<Vertex<ID,VertexData>>

DataSet<Edge<ID,EdgeData>>

DataSet<Subgraph<ID,GraphData>>

Pojo Representation

Extended Property Graph Model in Flink

VertexData

EdgeData

GraphData

DataSet<Vertex<ID,VertexData>>

DataSet<Edge<ID,EdgeData>>

DataSet<Subgraph<ID,GraphData>>

VertexData

EdgeData

GraphData

DataSet<VertexData>

DataSet<EdgeData>

DataSet<GraphData>

Pojo Representation

Tuple Representation

Target Vertex

Graphs

ID Label Properties

Target Vertex

Graphs

ID Label Properties

Summarization in Flink

VID City

EID S T

L [0,1]

D [2,3,4]

VID City Count

VID Rep

ID S T

0,0 [0,1]

0,2 [2]

2,0 [3,6,7]

2,2 [4,5]

5,2 [8,9]

EID S T Count

0 0 1 2

2 0 2 1

3 2 0 3

4 2 2 2

8 5 2 2

join(VID==S)

ℰ’

𝒱′

groupBy(City)

reduceGroup + filter + map

groupBy(S,T)

join(VID==T)

Use Case: Graph Business Intelligence

Business intelligence usually based on relational data warehouses Enterprise data is integrated within dimensional schema Analysis limited to predefined relationships No support for relationship-oriented data mining

Use Case: Graph Business Intelligence

Business intelligence usually based on relational data warehouses Enterprise data is integrated within dimensional schema Analysis limited to predefined relationships No support for relationship-oriented data mining

Graph-based approach Integrate data sources within an instance graph by preserving original

relationships between data objects (transactional and master data) Determine subgraphs (business transaction graphs) related to business

activities Analyze subgraphs or entire graphs with aggregation queries, mining

relationship patterns, etc.

Prerequisites: Data Integration

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

processedBy

CIT ERP

Employee Name: Dave

Employee Name: Bob

Ticket Expense: 500

SalesQuotation

PurchaseOrder

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

CIT ERP

Employee Name: Dave

Employee Name: Bob

Ticket Expense: 500

SalesQuotation

PurchaseOrder

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

CIT ERP

Employee Name: Dave

Employee Name: Bob

Ticket Expense: 500

SalesQuotation

PurchaseOrder

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

CIT ERP

Employee Name: Dave

Employee Name: Bob

Ticket Expense: 500

SalesQuotation

PurchaseOrder

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

CIT ERP

Employee Name: Dave

Employee Name: Bob

Ticket Expense: 500

SalesQuotation

PurchaseOrder

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

(1) BTG Extraction

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} )

(2) Profit Aggregation

CIT ERP

Employee Name: Dave

Employee Name: Bob

Ticket Expense: 500

SalesQuotation

PurchaseOrder

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} ) // define profit aggregate function aggFunc = ( Graph g => g.V.values(“Revenue").sum() - g.V.values(“Expense").sum() )

∑ Revenue ∑ Expenses Net Profit

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} ) // define profit aggregate function aggFunc = ( Graph g => g.V.values(“Revenue").sum() - g.V.values(“Expense").sum() ) // apply aggregate function and store result at new property btgs = btgs.apply( Graph g => g.aggregate( “Profit“ , aggFunc ) )

(3) BTG Clustering

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

(3) BTG Clustering

// select profit and loss clusters profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 ) lossBtgs = btgs.difference(profitBtgs)

(4) Cluster Characteristic Patterns

CIT ERP

Employee Name: Dave

Employee Name: Bob

Ticket Expense: 500

SalesQuotation

PurchaseOrder

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

CIT ERP

Employee Name: Dave

Employee Name: Bob

Ticket Expense: 500

SalesQuotation

PurchaseOrder

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

Ticket Alice

processedBy

createdBy

PurchaseOrder

// select profit and loss clusters profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 ) lossBtgs = btgs.difference(profitBtgs) // apply magic profitFreqPats = profitBtgs.callForCollection( :FrequentSubgraphs , {“Threshold”:0.7} ) lossFreqPats = lossBtgs.callForCollection( :FrequentSubgraphs , {“Threshold”:0.7} ) // determine cluster characteristic patterns trivialPats = profitFreqPats.intersect(lossFreqPats) profitCharPatterns = profitFreqPats.difference(trivialPats) lossCharPatterns = lossFreqPats.difference(trivialPats)

Current State & Future Work

Current State

0.0.1 First Prototype (May 2015) Hadoop MapReduce and Giraph for operator implementations Too much complexity Performance loss through serialization in HDFS/HBase

0.0.2 Using Flink as execution layer (June 2015) Basic operators

Currently 0.0.3-SNAPSHOT Performance improvements More operator implementations

Operator implementations (0.0.3-SNAPSHOT)

Unary Pattern Matching Collection Selection Algorithms LabelPropagation

Aggregation Distinct BTG Extraction

Projection Sort by FSM

Summarization Top

Binary Combination Union

Overlap Intersection

Exclusion Difference

Isomorphism Auxiliary Apply

Reduce

Future Work

Operator integration into Gelly Summarization FLINK-2411 Graph Sampling …

Graph Operations on streams (Flink) Graph Partitioning (maybe together with the Gelly people) Graph Versioning (Storage) Benchmarking GrALa Interpreter / Web UI

Benchmarks Sneak Preview

1 2 4 8 16

Time [s]

# Worker

Summarization (Vertex and Edge Labels)

16x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz (12 Cores), 48 GB RAM Hadoop 2.5.2, Flink 0.9.0

slots (per node) 12 jobmanager.heap.mb 2048 taskmanager.heap.mb 40960

Foodbroker Graph (https://github.com/dbs-leipzig/foodbroker) Generates BI process data 858,624,267 Vertices, 4,406,445,007 Edges, 663GB Payload

Web UI Sneak Preview

Contributions welcome

Code Operator implementations Performance Tuning Storage layout

Data! and Use Cases

We are researchers, we assume ... Getting real data (especially BI data) is nearly impossible

People Bachelor / Master / PhD Thesis

Thank you for building Flink!

www.gradoop.com

https://github.com/dbs-leipzig/gradoop http://dbs.uni-leipzig.de/file/GradoopTR.pdf

http://dbs.uni-leipzig.de/file/biiig-vldb2014.pdf

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Data & Analytics

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

Distributed Grouping of Property Graphs with GRADOOP · Distributed Grouping of Property Graphs with GRADOOP 3 built on top of Apache Flink [Al14, Ca15b], a scalable dataﬂow framework

Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

FastR+Apache Flink

Apache Flink Hands On

Apache flink

Apache Flink Training: System Overview

Large Scale Centrality Measures in Apache Flink and Apache ... · Programming model of Apache Giraph & Apache Flink for iterative graph processing • Apache Giraph, a vertex centric

Apache Flink - Overview

Apache Flink – Distributed Stream Processing

Apache Flink - SICS

Integrating Apache NiFi and Apache Flink

Apache Flink Meetup Berlin #6: Unified Batch & Stream Processing in Apache Flink

Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Apache Flink Stream Processing

Workshop Apache Flink Madrid

Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Apache Flink - tutorialspoint.comApache Flink was founded by Data Artisans company and is now developed under Apache License by Apache Flink Community. This community has over 479

Flink and Apache Spark Fernanda de Camargo Magano Dylan ... · Flink and Apache Spark Fernanda de Camargo Magano Dylan Guedes. About Flink ... Introduction to Apache Flink Book. Use