34
Apache Flink Deep Dive Vasia Kalavri Flink Committer & KTH PhD student [email protected] 1st Apache Flink Meetup Stockholm May 11, 2015

Apache Flink Deep Dive

Embed Size (px)

Citation preview

Page 1: Apache Flink Deep Dive

Apache Flink Deep Dive

Vasia KalavriFlink Committer & KTH PhD student

[email protected]

1st Apache Flink Meetup StockholmMay 11, 2015

Page 2: Apache Flink Deep Dive

Flink Internals

● Job Life-Cycle○ what happens after you submit a Flink job?

● The Batch Optimizer○ how are execution plans chosen?

● Delta Iterations○ how are Flink iterations special for Graph and ML

apps?

2

Page 3: Apache Flink Deep Dive

what happens after you submit a Flink job?

Page 4: Apache Flink Deep Dive

The Flink Stack

Pyt

hon

Gel

ly

Tabl

e

Flin

k M

L

SA

MO

A

Batch Optimizer

DataSet (Java/Scala) DataStream (Java/Scala)Hadoop M/R

Flink Runtime

Local Remote Yarn Tez EmbeddedD

ataf

low

*current Flink master + few PRs

Streaming Optimizer

4

Page 5: Apache Flink Deep Dive

DataSet<String> text = env.readTextFile(input);

DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("\\W")) { out.collect(new Tuple2(token, 1)); })

.groupBy(0).aggregate(SUM, 1);

1

32

Program Life-Cycle

4

5

Page 6: Apache Flink Deep Dive

Task Manager

Job Manager

Task Manager

Flink Client &Optimizer

DataSet<String> text = env.readTextFile(input);

DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("\\W")) { out.collect(new Tuple2(token, 1)); })

.groupBy(0).aggregate(SUM, 1);

O Romeo, Romeo, wherefore art thou Romeo?

O, 1Romeo, 3wherefore, 1art, 1thou, 1

6

Nor arm, nor face, nor any other part

nor, 3arm, 1face, 1,any, 1,other, 1part, 1

creates and submits the job graph

creates the execution graph and deploys tasks

execute tasks and send status updates

Page 7: Apache Flink Deep Dive

Input First SecondX Y

Operator X Operator Y

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();DataSet<String> input = env.readTextFile(input);

DataSet<String> first = input.filter (str -> str.contains(“Apache Flink“));DataSet<String> second = first.filter (str -> str.length() > 40);

second.print()env.execute();

Series of Transformations

7

Page 8: Apache Flink Deep Dive

DataSet AbstractionThink of it as a collection of data elements that can be produced/recovered in several ways:

… like a Java collection… like an RDD … perhaps it is never fully materialized (because the program does not need it to)… implicitly updated in an iteration

→ this is transparent to the user

8

Page 9: Apache Flink Deep Dive

Romeo, Romeo, where art thou Romeo?

Load Log

Search for str1

Search for str2

Search for str3

Grep 1

Grep 2

Grep 3

Example: grep

9

Page 10: Apache Flink Deep Dive

Romeo, Romeo, where art thou Romeo?

Load Log

Search for str1

Search for str2

Search for str3

Grep 1

Grep 2

Grep 3

Stage 1:Create/cache Log

Subsequent stages:Grep log for matches

Caching in-memory and disk if needed

Staged (batch) execution

10

Page 11: Apache Flink Deep Dive

Romeo, Romeo, where art thou Romeo?

Load Log

Search for str1

Search for str2

Search for str3

Grep 1

Grep 2

Grep 3

001100110011001100110011

Stage 1:Deploy and start operators

Data transfer in-memory and disk if needed

Note: Log DataSet is never “created”!

Pipelined execution

11

Page 12: Apache Flink Deep Dive

12

Page 13: Apache Flink Deep Dive

how are execution plans chosen?

Page 14: Apache Flink Deep Dive

Flink Batch Optimizer

Inspired by database optimizers, it creates and selects the execution plan for a user program

14

Page 15: Apache Flink Deep Dive

DataSet<Tuple5<Integer, String, String, String, Integer>> orders = … DataSet<Tuple2<Integer, Double>> lineitems = …

DataSet<Tuple2<Integer, Integer>> filteredOrders = orders .filter(. . .) .project(0,4).types(Integer.class, Integer.class);

DataSet<Tuple3<Integer, Integer, Double>> lineitemsOfOrders = filteredOrders .join(lineitems) .where(0).equalTo(0) .projectFirst(0,1).projectSecond(1) .types(Integer.class, Integer.class, Double.class);

DataSet<Tuple3<Integer, Integer, Double>> priceSums = lineitemsOfOrders .groupBy(0,1).aggregate(Aggregations.SUM, 2);

priceSums.writeAsCsv(outputPath);

A Simple Program

15

Page 16: Apache Flink Deep Dive

DataSourceorders.tbl

FilterMap DataSource

lineitem.tbl

JoinHybrid Hash

buildHT probe

broadcast forward

Combine

GroupRedsort

DataSourceorders.tbl

FilterMap DataSource

lineitem.tbl

JoinHybrid Hash

buildHT probe

hash-part [0] hash-part [0]

hash-part [0,1]

GroupRedsort

forwardBest plan depends onrelative sizes of input files

Alternative Execution Plans

16

Page 17: Apache Flink Deep Dive

17

Page 18: Apache Flink Deep Dive

● Evaluates physical execution strategies○ e.g. hash-join vs. sort-merge join

● Chooses data shipping strategies○ e.g. broadcast vs. partition

● Reuses partitioning and sort orders● Decides to cache loop-invariant data in

iterations

Optimization Examples

18

Page 19: Apache Flink Deep Dive

case class PageVisit(url: String, ip: String, userId: Long)

case class User(id: Long, name: String, email: String, country: String)

// get your data from somewhere

val visits: DataSet[PageVisit] = ...

val users: DataSet[User] = ...

// filter the users data set

val germanUsers = users.filter((u) => u.country.equals("de"))

// join data sets

val germanVisits: DataSet[(PageVisit, User)] =

// equi-join condition (PageVisit.userId = User.id)

visits.join(germanUsers).where("userId").equalTo("id")

Example: Distributed Joins

The join operator needs to create all the pairs of elements from the two inputs, for which the join condition evaluates to true

19

Page 20: Apache Flink Deep Dive

Example: Distributed Joins● Ship Strategy: The input data is distributed across all

parallel instances that participate in the join● Local Strategy: Each parallel instance performs a join

algorithm on its local partition

For both steps, there are multiple valid strategies which are favorable in different situations.

20

Page 21: Apache Flink Deep Dive

Repartition-Repartition Strategy

Partitions both inputs using the same partitioning function.

All elements that share the same join key are shipped to the same parallel instance and can be locally joined.

21

Page 22: Apache Flink Deep Dive

Broadcast-Forward Strategy

Sends one complete data set to each parallel instance that holds a partition of the other data.

The other Dataset remains local and is not shipped at all.

22

Page 23: Apache Flink Deep Dive

The optimizer will compute cost estimates for execution plans and will pick the “cheapest” plan:● amount of data shipped over the the network● if the data of one input is already partitioned

R-R Cost: Full shuffle of both data sets over the networkB-F Cost: Depends on the size of the dataset that is broadcasted and the number of parallel instancesRead more: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html

How does the Optimizer choose?

23

Page 24: Apache Flink Deep Dive

how are Flink iterations special?

Page 25: Apache Flink Deep Dive

● for/while loop in client submits one job per iteration step

● Data reuse by caching in memory and/or disk

Step Step Step Step Step

Client

Iterate by unrolling

25

Page 26: Apache Flink Deep Dive

Native Iterations● the runtime is aware of the iterative execution● no scheduling overhead between iterations● caching and state maintenance are handled automatically

Caching Loop-invariant DataPushing work“out of the loop”

Maintain state as index

26

Page 27: Apache Flink Deep Dive

Flink Iteration Operators

Iterate IterateDelta

Input

Iterative Update Function

Result

Rep

lace

Workset

IterativeUpdate Function

Result

Solution Set

State

27

Page 28: Apache Flink Deep Dive

Delta Iteration

● Not all the elements of the state are updated in each iteration.

● The elements that require an update, are stored in the workset.

● The step function is applied only to the workset elements.

28

Page 29: Apache Flink Deep Dive

Partition a graph into components by iteratively propagating the min vertex ID among neighbors

Example: Connected Components

29

Page 30: Apache Flink Deep Dive

Delta-Connected Components

30

Page 31: Apache Flink Deep Dive

31

Page 32: Apache Flink Deep Dive

Performance

32

Page 33: Apache Flink Deep Dive

Read the documentation and our blog posts!● Memory Management● Serialization and Type Extraction● Streaming Optimizations● Fault-Tolerance

Want to learn more?

33

Page 34: Apache Flink Deep Dive

Apache Flink Deep Dive

Vasia KalavriFlink Committer & KTH PhD student

[email protected]

1st Apache Flink Meetup StockholmMay 11, 2015