80
GRADOOP: Scalable Graph Analytics with Apache Flink Martin Junghanns Leipzig University Big Data User Group Dresden / Graph Databases Sachsen December 2015

Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Embed Size (px)

Citation preview

Page 1: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

GRADOOP: Scalable Graph Analytics with Apache Flink

Martin Junghanns Leipzig University

Big Data User Group Dresden / Graph Databases Sachsen December 2015

Page 2: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

About the speaker and the team

André, PhD Student Martin, PhD Student

Kevin, M.Sc. Student Niklas, M.Sc. Student

Prof. Dr. Erhard Rahm Database Chair

Page 3: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Outline

Motivation Gradoop Architecture Extended Property Graph Model (EPGM) Apache Flink EPGM on Apache Flink Business Intelligence Use Case Tooling Current State & Future Work

Page 4: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Motivation

Page 5: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

𝑮𝑟𝑟𝑟𝑟 = (𝑽𝑒𝑟𝑒𝑒𝑒𝑒𝑒,𝑬𝑑𝑑𝑒𝑒)

“Graphs are everywhere”

Page 6: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 7: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 8: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 9: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 10: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝐹𝐹𝐹𝐹𝐹𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

Page 11: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝐹𝐹𝐹𝐹𝐹𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

Page 12: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐂𝐂𝐂𝐂𝐔𝐔,𝐶𝐹𝐹𝐹𝑒𝑒𝑒𝑒𝐹𝐹𝑒)

“Graphs are everywhere”

Leipzig pop: 544K

Dresden pop: 536K

Berlin pop: 3.5M

Hamburg pop: 1.7M

Munich pop: 1.4M

Chemnitz pop: 243K

Nuremberg pop: 500K

Cologne pop: 1M

Page 13: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

World Wide Web ca. 1 billion websites

“Graphs are large”

Facebook ca. 1.49 billion active users ca. 340 friends per user

Page 14: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

End-to-End Graph Analytics

Data Integration Graph Analytics Representation

Integrate data from one or more sources into a dedicated graph storage with common graph data model

Definition of analytical workflows from operator algebra Result representation in a meaningful way

Page 15: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Data Management Graph Database Systems Neo4j, OrientDB

Graph Processing Systems Pregel, Giraph

Distributed Workflow Systems Flink Gelly, Spark GraphX

Data Model Rich Graph Models

Generic Graph Models Generic Graph Models

Focus Local ACID Operations

Global Graph Operations Global Data and Graph Operations

Query Language Yes No No

Persistency Yes No No

Scalability Vertical Horizontal Horizontal

Workflows No No Yes

Data Integration No No No

Graph Analytics No Yes Yes

Representation Yes No No

Page 16: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

What‘s missing?

An end-to-end framework and research platform for efficient, distributed and domain independent

graph data management and analytics.

Page 17: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Gradoop Architecture & Data Model

Page 18: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

High Level Architecture

HDFS/YARN Cluster

HBase Distributed Graph Store

Extended Property Graph Model

Flink Operator Implementations

Data Integration

Flink Operator Execution

Workflow Declaration

Visual

GrALa DSL Representation

Data flow

Control flow

Graph Analytics Representation

Workflow Execution

Page 19: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

Extended Property Graph Model [0] Tag

name : Databases [1] Tag

name : Graphs [2] Tag

name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

knows since : 2014

knows since : 2014

knows since : 2013

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

Page 20: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

[2] Community | interest : Graphs | vertexCount : 4

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

Extended Property Graph Model [0] Tag

name : Databases [1] Tag

name : Graphs [2] Tag

name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

knows since : 2014

knows since : 2014

knows since : 2013

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

Page 21: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching Projection

Summarization Equality Call *

Combination

Overlap Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction Label Propagation Graph Forecasting

Frequent Subgraphs

Top

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Page 22: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching Projection

Summarization Equality Call *

Combination

Overlap Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction Label Propagation Graph Forecasting

Frequent Subgraphs

Top

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Page 23: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Combination 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])

[2] Community | interest : Graphs| vertexCount : 4

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

knows since : 2014

knows since : 2014

knows since : 2013

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

DB

Page 24: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

[0] Community | interest : Graphs| vertexCount : 4

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

DB

Combination 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])

[4]

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

Page 25: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching Projection

Summarization Equality Call *

Combination

Overlap Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction Label Propagation Graph Forecasting

Frequent Subgraphs

Top

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Page 26: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

[0] Community | interest : Databases | vertexCount : 3

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

DB

Combination + Summarization

[4]

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:LABEL, “city”} 3: edgeGroupingKeys = {:LABEL} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)

Page 27: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

[0] Community | interest : Databases | vertexCount : 3

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

DB

Combination + Summarization

[4]

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:LABEL, “city”} 3: edgeGroupingKeys = {:LABEL} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)

[5]

[11] Person

city : Leipzig count : 2

[12] Person

city : Dresden count : 3

[13] Person

city : Berlin count : 1

24

25

26

27

28

knows count : 3

knows count : 1 knows

count : 2

knows count : 2

knows count : 2

Page 28: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

[0] Community | interest : Databases | vertexCount : 3

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

DB

Combination + Summarization + Aggregation

[4]

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:LABEL, “city”} 3: edgeGroupingKeys = {:LABEL} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc) 7: aggFunc = (Graph g => |g.E|) 8: aggGraph = sumGraph.aggregate(“edgeCount”, aggFunc)

[5]

[11] Person

city : Leipzig count : 2

[12] Person

city : Dresden count : 3

[13] Person

city : Berlin count : 1

24

25

26

27

28

knows count : 3

knows count : 1 knows

count : 2

knows count : 2

knows count : 2

Page 29: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

[0] Community | interest : Databases | vertexCount : 3

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

DB

Combination + Summarization + Aggregation

[4]

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:LABEL, “city”} 3: edgeGroupingKeys = {:LABEL} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc) 7: aggFunc = (Graph g => |g.E|) 8: aggGraph = sumGraph.aggregate(“edgeCount”, aggFunc)

[5] edgeCount : 5

[11] Person

city : Leipzig count : 2

[12] Person

city : Dresden count : 3

[13] Person

city : Berlin count : 1

24

25

26

27

28

knows count : 3

knows count : 1 knows

count : 2

knows count : 2

knows count : 2

Page 30: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching Projection

Summarization Equality Call *

Combination

Overlap Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction Label Propagation Graph Forecasting

Frequent Subgraphs

Top

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Page 31: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Selection 1: resultColl = db.G[0,1,2].select((Graph g => g[“vertexCount”] > 3))

[2] Community | interest : Graphs | vertexCount : 4

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

knows since : 2014

knows since : 2014

knows since : 2013

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

DB

Page 32: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Selection 1: resultColl = db.G[0,1,2].select((Graph g => g[“vertexCount”] > 3))

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

DB

[2] Community | interest : Graphs | vertexCount : 4

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

0

1

2

3

4

5

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

Page 33: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching Projection

Summarization Equality Call *

Combination

Overlap Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction Label Propagation Graph Forecasting

Frequent Subgraphs

Top

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Page 34: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Apache Flink

Page 35: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Apache Flink

http://www.slideshare.net/robertmetzger1/apache-flink-meetup-munich-november-2015-flink-overview-architecture-integrations-and-use-case

„Streaming Dataflow Engine that provides • data distribution, • communication, • and fault tolerance for distributed computations over data streams.“

HDFS LocalFS HBase JDBC Kafka RabbitMQ Flume (Neo4j) Embedded Tez Yarn Cluster Local

Streaming Dataflow Runtime

DataSet DataStream

Hado

op M

R

Tabl

e

Gelly

ML

Tabl

e

Zepp

elin

Casc

adin

g

MRQ

L

Data

flow

Stor

m (w

ip)

Data

flow

(wip

)

SAM

OA

Page 36: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Apache Flink – DataSet API

DataSet := Distributed Collection of Data Transformation := Operation applied on DataSet Flink Program := Composition of Transformations

DataSet

DataSet

DataSet

Transformation

Transformation

DataSet

DataSet

Transformation DataSet

Flink Program

Page 37: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Apache Flink – DataSet Transformations

aggregate coGroup cross distinct filter first-N flatMap groupBy

join leftOuterJoin rightOuterJoin fullOuterJoin map mapPartition reduce reduceGroup union

iterate iterateDelta

Page 38: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

The „Hello World“ of Big Data – Word Count

1: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 2: 3: DataSet<String> text = env.fromElements( // or env.readTextFile(„hdfs://…“) 4: „He who controls the past controls the future.“, 5: „He who controls the present controls the past.“); 6: 7: DataSet<Tuple2<String, Integer>> wordCounts = text 8: .flatMap(new LineSplitter()) // splits the line and outputs (word, 1) tuples 9: .groupBy(0) 10: .sum(1); 11: 12: wordCounts.print(); // trigger execution

flatMap

„He who controls the past controls the future.“ „He who controls the present controls the past.“

(He,1) (who,1) (controls,1) (the,1) (past,1) // ...

groupBy(0)

[(He,1),(He,1)] [(who,1),(who,1)] [(future,1)] [(past,1),(past,1)] [(present,1)] // ...

sum(1)

(He,2) (who,2) (future,1) (past,2) (present,1) // ...

Page 39: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

EPGM on Apache Flink

Page 40: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

EPGM in Apache Flink – User facing API

LogicalGraph

fromCollections(…) : LogicalGraph fromDataSets(…) : LogicalGraph fromGellyGraph(…) : LogicalGraph getGraphHead() : DataSet<EPGMGraphHead> toGellyGraph() : Graph combine(…) : LogicalGraph intersect(…) : LogicalGraph summarize(…) : LogicalGraph match(…) : GraphCollection // ...

GraphCollection

fromCollections(…) : GraphCollection

fromDataSets(…) : GraphCollection getGraphHeads() : DataSet<EPGMGraphHead> getGraph(…) : LogicalGraph getGraphs(…) : GraphCollection select(…) : GraphCollection union(…) : GraphCollection distinct(…) : GraphCollection sortBy(…) : GraphCollection // ...

GraphBase

getVertices() : DataSet<EPGMVertex> getEdges() : DataSet<EPGMEdge> // ...

graphHeads : DataSet<EPGMGraphHead> vertices : DataSet<EPGMVertex> edges : DataSet<EPGMEdge>

EPGMDatabase

fromCollections(…) : EPGMDatabase

fromJSONFile(…) : EPGMDatabase fromHBase(…) : EPGMDatabase writeAsJSON(…) : void writeToHBase(…) : void getDatabaseGraph() : LogicalGraph // ...

Page 41: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

EPGM in Apache Flink – DataSets

Id Label Properties Graphs

Id Label Properties SourceId TargetId Graphs

EPGMGraphHead

EPGMVertex

EPGMEdge

Id Label Properties POJO

POJO

POJO

DataSet<EPGMGraphHead>

DataSet<EPGMVertex>

DataSet<EPGMEdge>

Id Label Properties Graphs

EPGMVertex

GradoopId := UUID 128-bit

String PropertyList := List<Property> Property := (String, PropertyValue) PropertyValue := byte[]

GradoopIdSet := Set<GradoopId>

(55421132-f45b-40f0-8f6a-50ea13dbf2ea:Person{gender=f,city=Leipzig,name=Alice,age=20} @ [c2c0f288-9f27-4e55-b1c6-7a35e0eabe36, 77b710f9-07c2-49ab-b4bf-51e1a3138822])

Page 42: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

EPGM in Apache Flink – Exclusion

// input: firstGraph (G[0]), secondGraph (G[2]) 1: DataSet<GradoopId> graphId = secondGraph.getGraphHead() 2: .map(new Id<G>()); 3: 4: DataSet<V> newVertices = firstGraph.getVertices() 5: .filter(new NotInGraphBroadCast<V>()) 6: .withBroadcastSet(graphId, GRAPH_ID); 7: 8: DataSet<E> newEdges = firstGraph.getEdges() 9: .filter(new NotInGraphBroadCast<E>()) 10: .withBroadcastSet(graphId, GRAPH_ID) 11: .join(newVertices) 12: .where(new SourceId<E>().equalTo(new Id<V>()) 13: .with(new LeftSide<E, V>()) 14: .join(newVertices) 15: .where(new TargetId<E>().equalTo(new Id<V>()) 16: .with(new LeftSide<E, V>());

db.G[0].exclude(db.G[2])

[2] Community | interest : Graphs| vertexCount : 4

[0] Community | interest : Databases | vertexCount : 3

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

0

1

2

3

4

5

6 7

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2013

Page 43: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

EPGM in Apache Flink – Exclusion

Id Label Properties

2 Community interest: Graphs vertexCount: 4

graphId = secondGraph.getGraphHead()

Id

2

newVertices = firstGraph.getVertices() Id Label Properties Graphs

5 Person name: Alice gender: f … [0, 2]

6 Person name: Bob gender: m … [0, 2]

9 Person name: Eve gender: f … [0]

Id Label Properties Graphs

9 Person name: Eve gender: f … [0]

.map(new Id<G>());

.filter(new NotInGraphBroadCast<V>())

.withBroadcastSet(graphId, GRAPH_ID);

Page 44: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

EPGM in Apache Flink – Exclusion newEdges = firstGraph.getEdges()

Id Label SourceId TargetId Properties Graphs

0 knows 5 6 since: 2014 [0, 2]

1 knows 6 5 since: 2014 [0, 2]

6 knows 9 5 since: 2013 [0]

7 knows 9 6 since: 2015 [0]

Id Label SourceId TargetId Properties Graphs

6 knows 9 5 since: 2013 [0]

7 knows 9 6 since: 2015 [0]

Id Label SourceId TargetId … Id Label …

6 knows 9 5 … 9 Person …

7 knows 9 6 … 9 Person …

Id Label SourceId TargetId …

6 knows 9 5 …

7 knows 9 6 …

Id Label SourceId TargetId … Id Label …

Id Label SourceId TargetId … .with(new LeftSide<E, V>());

.join(newVertices)

.where(new TargetId<E>().equalTo(new Id<V>())

.with(new LeftSide<E, V>())

.join(newVertices)

.where(new SourceId<E>().equalTo(new Id<V>())

.filter(new NotInGraphBroadCast<E>())

.withBroadcastSet(graphId, GRAPH_ID)

Page 45: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Use Case: Graph Business Intelligence

Page 46: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Use Case: Graph Business Intelligence

Business intelligence usually based on relational data warehouses Enterprise data is integrated within dimensional schema Analysis limited to predefined relationships No support for relationship-oriented data mining

Graph-based approach Integrate data sources within an instance graph by preserving original

relationships between data objects (transactional and master data) Determine subgraphs (business transaction graphs) related to business

activities Analyze subgraphs or entire graphs with aggregation queries, mining

relationship patterns, etc.

Facts

Dim 1

Dim 2

Dim 3

Page 47: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Prerequisites: Data Integration

metadata

Data Sources Enterprise Service Bus

Unified Metadata Graph

Domain expert

(1) Metadata aquisition

(2) Graph integration

Integrated Instance Graph

data

Business Transaction Graphs

(3) Subgraph Detection

Page 48: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesInvoice Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

bills

bills

bills

processedBy

Page 49: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesInvoice Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 50: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesInvoice Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 51: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesInvoice Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 52: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesInvoice Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 53: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesInvoice Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 54: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

BTG 1

(1) BTG Extraction

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

Page 55: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(1) BTG Extraction

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} )

Page 56: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(2) Profit Aggregation

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesInvoice Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 57: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(2) Profit Aggregation

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} ) // define profit aggregate function aggFunc = ( Graph g => g.V.values(“Revenue").sum() - g.V.values(“Expense").sum() )

Page 58: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(2) Profit Aggregation

BTG 1

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

∑ Revenue ∑ Expenses Net Profit

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

Page 59: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(2) Profit Aggregation

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} ) // define profit aggregate function aggFunc = ( Graph g => g.V.values(“Revenue").sum() - g.V.values(“Expense").sum() ) // apply aggregate function and store result at new property btgs = btgs.apply( Graph g => g.aggregate( “Profit“ , aggFunc ) )

Page 60: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(3) BTG Clustering

BTG 1

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

∑ Revenue ∑ Expenses Net Profit

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

Page 61: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(3) BTG Clustering

// select profit and loss clusters profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 ) lossBtgs = btgs.difference(profitBtgs)

Page 62: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(4) Cluster Characteristic Patterns

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesInvoice Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 63: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(4) Cluster Characteristic Patterns

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesInvoice Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 64: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(4) Cluster Characteristic Patterns

BTG 1

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

∑ Revenue ∑ Expenses Net Profit

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

Ticket Alice

processedBy

Bob

createdBy

PurchaseOrder

Page 65: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

(4) Cluster Characteristic Patterns

// select profit and loss clusters profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 ) lossBtgs = btgs.difference(profitBtgs) // apply magic profitFreqPats = profitBtgs.callForCollection( :FrequentSubgraphs , {“Threshold”:0.7} ) lossFreqPats = lossBtgs.callForCollection( :FrequentSubgraphs , {“Threshold”:0.7} ) // determine cluster characteristic patterns trivialPats = profitFreqPats.intersect(lossFreqPats) profitCharPatterns = profitFreqPats.difference(trivialPats) lossCharPatterns = lossFreqPats.difference(trivialPats)

Page 66: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Tooling

Page 67: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Definition Language (Cypher for EPGM)

Unit Testing graph analytical operators can be hard

Page 68: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Definition Language (Cypher for EPGM)

Unit Testing graph analytical operators can be hard

Y U NO MAKE IT DECLARATIVE?

Page 69: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Definition Language (Cypher for EPGM)

Describe expected output in unit test

Page 70: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Definition Language (Cypher for EPGM)

FlinkAsciiGraphLoader Creates LogicalGraphs and GraphCollections based on ASCII graph

Based on Cypher: https://github.com/s1ck/gdl Define vertices

(alice:User {name = "Alice", age = 23}) Define edges

(alice)-[e1:knows {since = 2014}]->(bob) Define paths

(alice)-->(bob)<--(eve)-->(carol)-->(alice) Define graphs

g1:Community {title = "Graphs", memberCount = 3}[ (alice:User)-[:knows]->(bob:User) (bob)-[e:knows]->(eve:User) (eve) ]

Page 71: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Graph Definition Language (Cypher for EPGM)

Page 72: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

LDBC-Flink-Import

Linked Data Benchmark Council MapReduce-based data generator for social network data

http://ldbcouncil.org/

Page 73: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

LDBC-Flink-Import

Makes LDBC output available in Flink DataSets https://github.com/s1ck/ldbc-flink-import

1: LDBCToFlink ldbcToFlink = new LDBCToFlink( 2: "/path/to/ldbc/output", // or "hdfs://..." 3: ExecutionEnvironment.getExecutionEnvironment()); 4: 5: DataSet<LDBCVertex> vertices = ldbcToFlink.getVertices(); 6: DataSet<LDBCEdge> edges = ldbcToFlink.getEdges();

Page 74: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Current State & Future Work

Page 75: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Current State

0.0.1 First Prototype (May 2015) Hadoop MapReduce and Giraph for operator implementations Too much complexity Performance loss through serialization in HDFS/HBase

0.0.2 Using Flink as execution layer (June 2015) Basic operators

0.1 Today Improved ID handling Improved property handling More operator implementations (e.g. Equality, Bool operators) Code refactoring

0.2-SNAPSHOT Graph Pattern Matching Frequent Subgraph Mining

Page 76: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Current State Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching Projection

Summarization Equality Call *

Combination

Overlap Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction Label Propagation Graph Forecasting

Frequent Subgraphs

Top

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Page 77: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Benchmark Preview

0

200

400

600

800

1000

1200

1400

1 2 4 8 16

Time [s]

# Worker

Summarization (Vertex and Edge Labels)

16x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz (12 Cores), 48 GB RAM Hadoop 2.5.2, Flink 0.9.0

slots (per node) 12 jobmanager.heap.mb 2048 taskmanager.heap.mb 40960

Foodbroker Graph (https://github.com/dbs-leipzig/foodbroker) Generates BI process data 858,624,267 Vertices, 4,406,445,007 Edges, 663GB Payload

Page 78: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Web UI Preview

Page 79: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Contributions welcome!

Code Operator implementations Performance Tuning Extend HBase Storage

Data! and Use Cases

We are researchers, we assume ... Getting real data (especially BI data) is nearly impossible

Page 80: Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Thank you! www.gradoop.com https://flink.apache.org http://ldbcouncil.org/ http://dbs.uni-leipzig.de/file/GradoopTR.pdf http://dbs.uni-leipzig.de/file/biiig-vldb2014.pdf https://github.com/dbs-leipzig/gradoop https://github.com/s1ck/gdl https://github.com/s1ck/ldbc-flink-import (https://github.com/s1ck/flink-neo4j)