45
APACHE FLINK TM STREAM AND BATCH PROCESSING IN A SINGLE ENGINE Akshaya Kalyanaraman

APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

APACHE FLINKTM STREAM AND BATCH PROCESSING IN

A SINGLE ENGINE

Akshaya Kalyanaraman

Page 2: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

WHAT IS APACHE FLINK?

Batch Processing

process static and historic data

Data Stream Processing

realtime results from data streams

Event-driven Applications

data-driven actions and services

Page 3: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

MOTIVATION

•  In Lambda Architecture: Two separate execution engines for batch and streaming

• Unification of Batch and Stream Processing in a single framework, Flink.

• Apache Flink provides a highly flexible windowing mechanism.

•  Flink supports different notions of time.

Page 4: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

STREAM ANALYTICS

Page 5: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

NOTIONS OF TIME Event Time

Time when event happened. 12:23 am

1:37 pm

Processing Time

Time measured by system clock

Page 6: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

STATEFUL STREAMING Stateless Stream

Processing Stateful Stream

Processing

Op Op

State

Page 7: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

PROCESSING SEMANTICS

At-least once May over-count after

failure

Exactly Once Correct counts after

failures

End-to-end exactly once Correct counts in external system (e.g. DB, file system)

after failure

Page 8: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

PROCESSING SEMANTICS

Flink guarantees exactly once

End-to-end exactly once with specific sources and sinks (e.g. Kafka -> Flink -> HDFS)

Internally, Flink periodically takes consistent snapshots of the state without ever stopping computation

Page 9: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

WINDOWING •  Window configured using assigner and optionally trigger and evictor.

•  Assigner: assigns each record to logical windows.

•  Trigger: defines when the operation associated with the window definition is performed.

•  Evictor: determines which records to retain within each window.

Page 10: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

WINDOWING : EXAMPLE

•  Below is a window definition with a range of 6 seconds that slides every 2 seconds (the assigner).

•  The window results are computed once the watermark passes the end of the window (the trigger).

stream

.window(SlidingTimeWindows.of(Time.of(6,

SECONDS), Time.of(2, SECONDS))

.trigger(EventTimeTrigger.create())

Page 11: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

WINDOWING : EXAMPLE

•  A global window creates a single logical group. •  The example defines a global window (i.e., the assigner)

that invokes the operation on every 1000 events (i.e., the trigger) while keeping the last 100 elements (i.e., the evictor).

stream

.window(GlobalWindow.create())

.trigger(Count.of(1000))

.evict(Count.of(100))

Page 12: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Deployment Local Single JVM

Cluster Standalone,YARN, Mesos

Cloud AWS, Google

Core Runtime

Distributed Streaming Dataflow

DataSet API Batch Processing

FLINK STACK

DataStream API Stream Processing

API &

Libraries

FlinkML Machine Learning

Gelly Graph Processing

Table Relational

CEP Event Processing

Table Relational

Page 13: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

FLINK PROCESS MODEL

JobManager

Master

Client

TaskManager Worker

TaskManager Worker

TaskManager Worker

TaskManager Worker

User System

public class WordCount {

public static void main(String[] args) throws Exception { // Flink’s entry point StreamExecutionEnvironment env = StreamExecutionEnvironment

.getExecutionEnvironment();

DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?", "Deny thy father and refuse thy name", "Or, if thou wilt not, be but sworn my love,", "And I'll no longer be a Capulet.");

// Split by whitespace to (word, 1) and sum up ones DataStream<Tuple2<String, Integer>> counts = data

.flatMap(new SplitByWhitespace())

.keyBy(0)

.timeWindow(Time.of(10, TimeUnit.SECONDS))

.sum(1);

counts.print();

// Today: What happens now? env.execute();

} }

Submit Program

Schedule

Execute

Page 14: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

CLIENT

Translates the API code to a data flow graph called JobGraph and

submits it to the JobManager.

Source

Transform

Sink

public class WordCount {

public static void main(String[] args) throws Exception { // Flink’s entry point StreamExecutionEnvironment env = StreamExecutionEnvironment

.getExecutionEnvironment();

DataStream<String> data = env.fromElements(

"O Romeo, Romeo! wherefore art thou Romeo?", "Deny thy father and refuse thy name", "Or, if thou wilt not, be but sworn my love,", "And I'll no longer be a Capulet.");

// Split by whitespace to (word, 1) and sum up ones

DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) .keyBy(0) .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1);

counts.print();

// Today: What happens now? env.execute();

}

}

Translate

Page 15: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

JOBMANAGER • All coordination via JobManager (master):

Scheduling programs for execution Checkpoint coordination Monitoring workers

Actor System

Scheduling

Checkpoint Coordination

Page 16: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

TASK MANAGER

11

Task Slot Task Slot Task Slot Task Slot

Actor System

• All data processing in TaskManager (worker):

•  Communicate with JobManager via Actor messages •  Exchange data between themselves via dedicated data

connections •  Expose task slots for execution

I/O Manager

Memory Manager

Page 17: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

SCHEDULING

12

• • •

Each ExecutionVertex will be executed one or more times The JobManager maps Execution to task slots Pipelined execution in same slot where applicable

p=4 p=4 p=3

All to all Pointwise

TaskManager 1 TaskManager 2

Page 18: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

SAMPLE QUERY

•  dataStream

•  count = input.map {m.split(”1")};

•  .keyBy(count%2); //keys by odd count (1) or even count (0)

•  .window(TumblingEventTimeWindows.of(Time.seconds(3)));

•  .apply (new CoGroupFunction () {...});

•  .reduce(count);

Page 19: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

SCHEDULING

Page 20: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

E X E C U T I O N I N S L O T S

Page 21: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result

1101 0101 0100

PIPELINED RESULTS

1101 0101

Page 22: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result 1101 0101

0100

PIPELINED RESULTS

1101 0101

Page 23: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result

3 0101 0100

PIPELINED RESULTS

1101 0101

Page 24: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result KeyBy

3 0101 0100

PIPELINED RESULTS

1101 0101

Page 25: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result KeyBy

3 0101 0100

PIPELINED RESULTS

1101 0101

Page 26: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result KeyBy

3 0101

0100

PIPELINED RESULTS

1101 0101

Page 27: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result KeyBy

Odd (1 record)

0101 0100

PIPELINED RESULTS

1101 0101

Page 28: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result KeyBy

2

0100

PIPELINED RESULTS

1101 0101

Odd (1 record)

Page 29: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result KeyBy

0100

PIPELINED RESULTS

1101 0101

2 2

Odd (1 record)

Page 30: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result Reduce

0100

PIPELINED RESULTS

1101 0101

Odd (1 record)

Even (1 record)

Page 31: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result Reduce

1101 0101

1

PIPELINED RESULTS

1101 0101

Odd (1 record)

Even (1 record)

Page 32: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result Reduce

1101 0101

1

PIPELINED RESULTS

1101

0101

Odd (1 record)

Even (1 record)

Page 33: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result Reduce

1101 0101

PIPELINED RESULTS

1101

0101

Odd (2 records)

Even (1 record)

Page 34: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result Reduce

PIPELINED RESULTS

3

0101

Odd (2 records)

Even (1 record)

Page 35: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result Reduce

PIPELINED RESULTS

3 0101

Odd (2 records)

Even (1 record)

Page 36: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result Reduce

PIPELINED RESULTS

0101 Odd (3 records)

Even (1 record)

Page 37: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result Reduce

PIPELINED RESULTS

2

Odd (3 records)

Even (1 record)

Page 38: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

Map Pipelined Result Reduce

PIPELINED RESULTS

Odd (3 records)

Even (2 records)

Page 39: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

LATENCY AND THROUGHPUT •  When a data record is ready on the producer side, it is serialized and split into one or

more buffers.

•  A buffer is sent to a consumer either when it is full or when a timeout condition is reached.

•  High throughput and low latency is achieved.

Page 40: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

LATENCY AND THROUGHPUT

Page 41: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

FAULT TOLERANCE ASYNCHRONOUS BARRIER SNAPSHOTTING •  An operator receives barriers from upstream and first performs an alignment phase.

•  Then, the operator writes its state to durable storage.

•  Once the state has been backed up, the operator forwards the barrier downstream.

•  Eventually, all operators will register a snapshot of their state and a global snapshot will be complete.

Page 42: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

FAULT TOLERANCE

Page 43: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

COMPARISON WITH NAIAD

•  Both Flink and Naiad make use of snapshotting mechanism for fault tolerance.

•  Both Apache Flink and Naiad frameworks combine batch processing and stream processing.

•  Both the frameworks support high throughput and low latency.

•  NAIAD performs iterative and incremental computations, while Flink performs primarily data processing of stream and batch data.

Page 44: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

CONCLUSION

•  Apache Flink is designed to perform both stream and batch analytics.

•  The streaming API provides the means to keep recoverable state and to partition, transform, and aggregate data stream windows.

•  Flink treats batch computations by optimizing their execution using a query optimizer.

Page 45: APACHE FLINK STREAM AND BATCH PROCESSING IN A SINGLE …pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-akshaya-flink.pdf · • Apache Flink is designed to perform both stream and

QUESTIONS?