Streaming Dataflow with Apache Flink

Ufuk Celebi uce@apache.org

HUG London October 15, 2015

Streaming Data Flow with Apache Flink

Recent HistoryApril ‘14 December ‘14

v0.5 v0.6 v0.7

April ‘15

Project Incubation

Top Level Project

v0.8 v0.9

Currently moving towards 0.10 and 1.0 release.

What is Flink?

StreamingTopologies

Stream TimeWindow Count

Low Latency

Long Batch PipelinesResource Utilization

Rating Matrix User Matrix Item Matrix

W X Y ZW X Y Z

= XUse

Machine LearningIterative Algorithms

Graph Analysis

0.2 0.9

0.40.7

Mutable State

Overview

Deployment Local (Single JVM) · Cluster (Standalone, YARN)

DataStream API Unbounded Data

DataSet API Bounded Data

Runtime Distributed Streaming Data Flow

Libraries Machine Learning · Graph Processing · SQL-like API

Stream ProcessingReal world data is unbounded and is pushed to

systems.

BatchStreaming

Stream Platform Architecture

Server Logs

Trxn Logs

Sensor Logs

Downstream Systems

– Analyze and correlate streams – Create derived streams

– Gather and backup streams – Offer streams

Cornerstones of Flink

Low Latency for fast results.

High Throughput to handle many events per second.

Exactly-once guarantees for correct results.

Intuitive APIs for productivity.

DataStream APIStreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment()

DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window

counts.print();

env.execute();

counts.print();

env.execute();

counts.print();

env.execute();

counts.print();

env.execute();

counts.print();

env.execute();

counts.print();

env.execute();

counts.print();

env.execute();

counts.print();

env.execute();

Pipelining

s1 t1 w1

s2 t2 w2

Source Tokenizer Window Count

Complete Pipeline Online Concurrently.

Pipelining

s1 t1 w1

s2 t2 w2

Chained tasks

Pipelining

s2 t2 w2

Chained tasks Pipelined Shuffle

Streaming Fault Tolerance

At Least Once • Ensure that all operators see all events.

Exactly Once• Ensure that all operators see all events. • Do not perform duplicates updates to operator state.

Streaming Fault Tolerance

At Least Once • Ensure that all operators see all events.

Exactly Once• Ensure that all operators see all events. • Do not perform duplicates updates to operator state.

Flink guarantees exactly once processing.

Distributed SnaphotsBarriers flow through the topology in line with data.

Part of snapshot

Distributed Snaphots

JobManager

Master

State Backend

Ceckpoint DataSource 1: State 1:

Source 2: State 2:

Source 3: Sink 1:

Source 4: Sink 2:

Offset: 6791

Offset: 7252

Offset: 5589

Offset: 6843

JobManager

Master

State Backend

Ceckpoint DataSource 1: State 1:

Source 2: State 2:

Source 3: Sink 1:

Source 4: Sink 2:

Offset: 6791

Offset: 7252

Offset: 5589

Offset: 6843

Start CheckpointMessage

JobManager

Master

State Backend

Ceckpoint DataSource 1: 6791 State 1:

Source 2: 7252 State 2:

Source 3: 5589 Sink 1:

Emit Barriers

Acknowledge withPosition

JobManager

Master

State Backend

Received barrier at each input

JobManager

Master

State Backend

s1 Write Snapshotof its state

Received barrier at each input

JobManager

Master

State Backend

Ceckpoint DataSource 1: 6791 State 1: PTR1

Source 2: 7252 State 2: PTR2

Acknowledge withpointer to state

JobManager

Master

State Backend

Source 3: 5589 Sink 1: ACK

Acknowledge CheckpointReceived barrier

at each input

JobManager

Master

State Backend

Operator State

User-defined state • Flink’s transformations are long running operators • Feel free to keep objects around • Hooks to include into system’s checkpoint

Windowed streams• Time, count, and data-driven windows • Managed by the system

Batch on Streaming

DataStream API Unbounded Data

DataSet API Bounded Data

Runtime Distributed Streaming Data Flow

Libraries Machine Learning · Graph Processing · SQL-like API

Batch on StreamingRun a bounded stream (data set) on

a stream processor.

Bounded data set

Unbounded data stream

Batch on Streaming

Stream Windows

PipelinedData Exchange

Global View

Pipelined or BlockingData Exchange

Infinite Streams Finite Streams

Run a bounded stream (data set) ona stream processor.

Batch Pipelines

Data exchange is mostly streamed

Batch Pipelines

Data exchange is mostly streamed

Some operators block (e.g. sort, hash table)

DataSet APIExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment()

DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences

counts.print();

DataStream APIExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment()

counts.print();

Batch-specific optimizations

Managed memory • On- and off-heap memory • Internal operators (e.g. join or sort) with out-of-core

support • Serialization stack for user-types

Cost-based optimizer• Program adapts to changing data size

Getting Started

Project Page: http://flink.apache.org

Getting Started

Quickstarts: Java & Scala API

Getting Started

Docs: Programming Guides

Getting Started

Get Involved: Mailing Lists, Stack Overflow, IRC, …

Blogs http://flink.apache.org/blog http://data-artisans.com/blog

Twitter @ApacheFlink

Mailing lists (news|user|dev)@flink.apache.org

Apache Flink

Streaming Dataflow with Apache Flink

Technology

Integrating Apache NiFi and Apache Flink

Apache Flink – Distributed Stream Processing

Meetup Apache Flink en Madrid. Futuro de Apache Flink y su rivalidad con Spark Streaming

Apache Flink Meetup Berlin #6: Unified Batch & Stream Processing in Apache Flink

Apache Flink Training - Advanced Windowing

Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pre-Hadoop Summit Meetups)

Apache Flink® Training

Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Apache Flink Training: System Overview

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Graph Sampling with Distributed In-Memory Dataflow Systems · 2019-10-11 · Distributed Graph Sampling, Apache Flink, Apache Spark 1 INTRODUCTION Sampling is used to determine a

Apache Flink - SICS

Google cloud Dataflow & Apache Flink

Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Apache Flink & Graph Processing

Apache flink

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

SICS: Apache Flink Streaming

Apache Flink - tutorialspoint.comApache Flink was founded by Data Artisans company and is now developed under Apache License by Apache Flink Community. This community has over 479

Apache Flink Stream Processing