81
Apache Flink Meetup Berlin #6 April 29, 2015 Ufuk Celebi [email protected] Unified Batch & Stream Processing in Apache Flink

Apache Flink Meetup Berlin #6: Unified Batch & Stream Processing in Apache Flink

  • Upload
    ucelebi

  • View
    354

  • Download
    4

Embed Size (px)

Citation preview

Apache Flink Meetup Berlin #6 April 29, 2015

!Ufuk Celebi

[email protected]

Unified Batch & Stream Processing

in Apache Flink

2

Batch vs. Stream Processing

Batch Processing

High Latency

Batch Processors

Static Files

Stream Processing

Low Latency

Stream Processors

Event Streams

What's the difference between a Batch and Streaming Runtime?

4

Batch Processing

HDFSO Romeo, Romeo,

wherefore art thou Romeo?

map

Read complete

fileRomeo, 1Romeo, 1

wherefore, 1art, 1

thou, 1Romeo, 1

Write out intermediate

data set

reduce

HDFSRomeo, 3

wherefore, 1art, 1

thou, 1

Write out complete

result

5

Stream Processing

Kafka

map

reduce

Kafka

Romeo, 1Romeo, 1

wherefore, 1art, 1

thou, 1Romeo, 1

Keep data moving

Romeo, 3wherefore, 1

art, 1thou, 1

Write out incremental

updates

O

Read incremental

updates

Romeo Romeowherefore art thou Romeo

6

andBatch vs. Stream

Processing

7

Do we really need two separate systems for this?

8

DataSet (Java/Scala) DataStream (Java/Scala)

Flink Runtime

BatchProcessing

StreamProcessing

Apache Flink

9

Flink Runtime

Operator Intermediate Result

10

Map

Operators represent computations over data.

JoinReduce CoCroup ...User

Sort Hash ...Internal

Operators

11

Map

Join Sum

Operators

12

Logical handle to data produced by operators.

Intermediate Results

Intermediate Result

13

Map

Join Sum

Intermediate Results

Logical handle to data produced by operators.

14

Map-Reduce Example

DataSet<Tuple2<String, Integer>> counts; !counts = input .flatMap(new LineSplitter()) .groupBy(0) .sum(1);

Logical handle to data produced by operators.

Map

Intermediate Result

Reduce

15

Result Characteristics

Pipelined vs. Blocking

Ephemeral vs. Checkpointed

16

Result Characteristics

Pipelined vs. Blocking

Ephemeral vs. Checkpointed

How and when to do data exchange?

Map Blocking Result

17

Blocking Results

110101010100

Map Blocking Result

17

Blocking Results

110101010100

Map Blocking Result

17

Blocking Results

110101010100

Map Blocking Result

17

Blocking Results

11010101

0100

Map Blocking Result

17

Blocking Results

11010101

0100

Map Blocking Result

17

Blocking Results

110101010100

Map Blocking Result

17

Blocking Results

110101010100

Map Blocking Result Reduce

17

Blocking Results

110101010100

Map Blocking Result Reduce

17

Blocking Results

110101010100

Map Blocking Result Reduce

17

Blocking Results

110101010100

Map Blocking Result Reduce

17

Blocking Results

11010101

0100

Map Blocking Result Reduce

17

Blocking Results

110101010100

Map Pipelined Result

18

Pipelined Results

110101010100

Map Pipelined Result

18

Pipelined Results

110101010100

Map Pipelined Result

18

Pipelined Results

110101010100

Map Pipelined Result Reduce

18

Pipelined Results

110101010100

Map Pipelined Result Reduce

18

Pipelined Results

110101010100

Map Pipelined Result Reduce

18

Pipelined Results

11010101

0100

Map Pipelined Result Reduce

18

Pipelined Results

11010101

0100

Map Pipelined Result Reduce

18

Pipelined Results

11010101

0100

Map Pipelined Result Reduce

18

Pipelined Results

11010101

0100

Map Pipelined Result Reduce

18

Pipelined Results

110101010100

Map Pipelined Result Reduce

18

Pipelined Results

11010101

0100

Map Pipelined Result Reduce

18

Pipelined Results

110101010100

19

Result Characteristics

Ephemeral Checkpointed

Pipelined

Blocking

Low-latency Low-latency

Easy to reason aboutresource consumption

Easy to reason aboutresource consumption

20

Result Characteristics

Pipelined vs. Blocking

Ephemeral vs. Checkpointed

How long to keep results around?

Map Ephemeral Result

21

Ephemeral Results

110101010100

Map Ephemeral Result

21

Ephemeral Results

110101010100

Map Ephemeral Result

21

Ephemeral Results

110101010100

Map Ephemeral Result Reduce

21

Ephemeral Results

110101010100

Map Ephemeral Result Reduce

21

Ephemeral Results

110101010100

Map Ephemeral Result Reduce

21

Ephemeral Results

11010101

0100

Map Ephemeral Result Reduce

21

Ephemeral Results

11010101

0100

Map Ephemeral Result Reduce

21

Ephemeral Results

11010101

0100

Map Ephemeral Result Reduce

21

Ephemeral Results

11010101

0100

Map Ephemeral Result Reduce

21

Ephemeral Results

1101

01010100

Map Ephemeral Result Reduce

21

Ephemeral Results

1101

01010100

Map Ephemeral Result Reduce

21

Ephemeral Results

1101

01010100

Map Ephemeral Result Reduce

21

Ephemeral Results

11010101

0100

Map Ephemeral Result Reduce

21

Ephemeral Results

11010101

0100

Map Ephemeral Result Reduce

21

Ephemeral Results

11010101

0100

Map Ephemeral Result Reduce

21

Ephemeral Results

11010101

0100

Map Ephemeral Result Reduce

21

Ephemeral Results

11010101

0100

Map Checkpointed Result Reduce

22

Checkpointed Results

110101010100

Map Checkpointed Result Reduce

22

Checkpointed Results

110101010100

Map Checkpointed Result Reduce

22

Checkpointed Results

110101010100

Map Checkpointed Result Reduce

22

Checkpointed Results

110101010100

1101

Map Checkpointed Result Reduce

22

Checkpointed Results

11010101

0100

1101

Map Checkpointed Result Reduce

22

Checkpointed Results

11010101

0100

1101

Map Checkpointed Result Reduce

22

Checkpointed Results

11010101

0100

11010101

Map Checkpointed Result Reduce

22

Checkpointed Results

11010101010011010101

Map Checkpointed Result Reduce

22

Checkpointed Results

110101010100

11010101

Map Checkpointed Result Reduce

22

Checkpointed Results

110101010100

110101010100

Map Checkpointed Result Reduce

22

Checkpointed Results

110101010100

110101010100

Map Checkpointed Result Reduce

22

Checkpointed Results

110101010100

11010101

0100

Map Checkpointed Result Reduce

22

Checkpointed Results

110101010100

110101010100

23

Result Characteristics

Ephemeral Checkpointed

Pipelined

Blocking

Low-latency Low-latency

Fine-grained fault tolerance

Fine-grained fault tolerance

Easy to reason aboutresource consumption

Easy to reason aboutresource consumption

24

Benefits

• Very flexible design

• Decouples high-level requirements from runtime • Fault tolerance for batch vs. streaming • Different program optimization paths • Iterative programs • Interactive queries

25

Use cases

26

Gra

ph: G

elly

Flin

k M

L

Tabl

e

26

Optimizer Stream Builder

Flink Runtime

DataSet (Java/Scala) DataStream (Java/Scala)

Flink Stack

Tabl

e

27

Current Implementations

Pipelined vs. Blocking

Ephemeral vs. Checkpointed

Backpressure vs. No Backpressure

28

Current Implementations

Pipelined vs. Blocking

Ephemeral vs. Checkpointed

Backpressure vs. No Backpressure

• Default type for both batch and streaming programs: Pipelined

• In batch mode only: use blocking exchange if necessary (e.g. to avoid deadlocks or break up long pipelines)

• More details: https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

29

Pipelined vs. Blocking in Flink

30

ExecutionConfig

// Set up the execution environmentExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment(); ExecutionConfig conf = env.getConfig();

// Remote data exchange is blocking (local is pipelined)conf.setExecutionMode(ExecutionMode.BATCH);

// Remote and local data exchange is blockingconf.setExecutionMode(ExecutionMode.BATCH_FORCED);

// Remote and local data exchange is pipelined except// when necessary to avoid deadlocks etc. [DEFAULT]conf.setExecutionMode(ExecutionMode.PIPELINED);

// Remote and local data exchange is always pipelinedconf.setExecutionMode(ExecutionMode.PIPELINED_FORCED);

flink.apache.org@ApacheFlink