Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA

Apache Flink™ deep-diveUnified Batch and Stream Processing

Robert Metzger@rmetzger_

Hadoop Summit 2015,San Jose, CA

Flink’s Recent History

April 2014 April 2015Dec 2014

Top Level Project Graduation

0.70.60.5 0.90.9-m1

What is Flink

DataSet (Java/Scala) DataStream

Hadoop M

Local Remote YARN Tez Embedded

Streaming dataflow runtime

Program compilation

case class Path (from: Long, to: Long)val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }

Optimizer

Type extraction

Task schedulin

Dataflow metadata

Pre-flight (Client)

MasterWorkers

Data Sourceorders.tbl

Filter

Map DataSourcelineitem.tbl

JoinHybrid Hash

buildHT probe

hash-part [0] hash-part [0]

GroupRed

forward

Program

Dataflow GraphIndependent of batch or streaming job

deployoperators

trackintermediate

results

Layered Architecture allows plugging of components

Native workload support

Streaming topologies

Long batchpipelines

Machine Learning at scale

How can an engine natively support all these workloads?And what does "native" mean?

Graph Analysis

Low latency

resource utilization iterative algorithms

Mutable state

E.g.: Non-native iterations

Step Step Step Step Step

Client

for (int i = 0; i < maxIterations; i++) {// Execute MapReduce job

Teaching an old elephant new tricks Treat system as a black box

E.g.: Non-native streaming

streamdiscretizer

Job Job Job Job

while (true) { // get next few records // issue batch job}

Data Stream

Simulate stream processor with batch system

Native workload support

Streaming topologies

Long batchpipelines

Machine Learning at scale

How can an engine natively support all these workloads?And what does "native" mean?

Graph Analysis

Low latency

resource utilization iterative algorithms

Mutable state

Ingredients for “native” support

1. Execute everything as streamsPipelined execution, push model

2. Special code paths for batchAutomatic job optimization, fault tolerance

3. Allow some iterative (cyclic) dataflows4. Allow some mutable state5. Operate on managed memory

Make data processing on the JVM robust

Flink by Use Case

Stream data processingstreaming dataflows

Full talk tomorrow:3:10PM, Grand Ballroom 220AStream processing with Flink

Pipelined stream processor

StreamingShuffle!

Low latency Operators push data

forward

Expressive APIscase class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print()

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()

DataSet API (batch):

DataStream API (streaming):

Checkpointing / Recovery

Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots

Pushes checkpoint barriersthrough the data flow

Data Stream

barrier

Before barrier =part of the snapshot

After barrier =Not in snapshot

(backup till next snapshot)

Guarantees exactly-once processing

Batch processingBatch on Streaming

Batch on an streaming engine

File in HDFS

Filter Map Result 1

Map Result 2

Batch program, completely pipelined Data is never materialized anywhere (in this example)

Batch on an streaming engine

MapOperator

Data Source (small)

Stream

Data Sink

JoinOperator

Stream build side

in parallel

Data Source (large)

Data Sink

in parallel (once build side finished)

Stream probe side

Batch processing requirements

Get the data processed as fast as possible• Automatic job optimizer• Efficient memory management

Robust processing• provide fault-tolerance• again, memory management

Optimizer Cost-based optimizer Select data shipping strategy (forward, partition, broadcast) Local execution (sort merge join/hash join) Caching of loop invariant data (iterations)

case class Path (from: Long, to: Long)val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }

Optimizer

Type extraction

Pre-flight (Client)Data

Sourceorders.tbl

Filter

MapDataSour

celineitem.tbl

JoinHybrid Hash

buildHT

GroupRed

forward

Program

DataflowGraph

Two execution plans

DataSourceorders.tbl

Filter

JoinHybrid Hash

buildHT probe

broadcast forward

Combine

GroupRed

DataSourceorders.tbl

Filter

JoinHybrid Hash

buildHT probe

hash-part [0,1]

GroupRed

forwardBest plan

depends onrelative sizes of

input files

Memory Management

Operators on managed memory

Smooth out-of-core performance

23More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html

Blue bars are in-memory, orange bars (partially) out-of-core

Machine Learning AlgorithmsIterative data flows

Iterate in the Dataflow

API and runtime support Automatic caching of loop invariant

IterationState state = getInitialState(); while (!terminationCriterion()) {

state = step(state); } setFinalState(state);

Example: Matrix Factorization

Factorizing a matrix with28 billion ratings forrecommendations

More at: http://data-artisans.com/computing-recommendations-with-flink.html

Setups:• 40 medium instances ("n1-highmem-8" - 8

cores, 52 GB)• 40 large instances ("n1-highmem-16" - 16

cores, 104 GB)

Flink ML – Machine Learning Provide a complete toolchain

• scikit-learn style pipelining• Data pre-processing

various algorithms • Recommendations: ALS• Supervised learning: Support Vector Machines• …

ML on streams: SAMOA. We are planning to add support for streaming into ML

Graph AnalysisStateful Iterations

Graph processing characteristics

iteration

Iterate natively with state/deltas Keep state in an controlled way by having a partitioned hash-

map Relax immutability assumption of batch processing

… fast graph analysis

More at: http://data-artisans.com/data-analysis-with-flink.html

Gelly – Graph Processing API

Transformations: map, filter, subgraph, union, reverse, undirected

Mutations: add vertex/edge, remove … Pregel style vertex centric iterations Library of algorithms Utilities: Special data types, loading, graph properties

Gelly and Flink ML:

Available in Flink 0.9 (so far only beta release) Still under heavy development Seamlessly integrate with DataSet abstraction

Preprocess data as neededUse results as needed

Easy entry point for new contributors

Closing

Flink Meetup Groups

SF Spark and Friends• June 16, San Francisco

Bay Area Flink Meetup• June 17, Redwood City

Chicago Flink Meetup• June 30

Stockholm, Sweden Berlin, Germany

Flink Forward registration & call for abstracts is open now

flink.apache.org 37

• 12/13 October 2015• Meet developers and users of

Flink!• With Flink Workshops / Trainings!

flink.apache.org@ApacheFlink

Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA

Technology

Deep Dive of Kafka to HDFS/Hadoop Ingestion App Template

Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Impala deep dive - running descriptive analytics in hadoop

On-Premise, UI-Driven Hadoop/Spark/Flink/Kafka/Zeppelin-as-a

STXXL and Thrill - panthema.net · Hadoop Apache Spark Apache Flink Our Requirements: compound primitives into complex algorithms ... Institute of Theoretical Informatics – Algorithmics

Apache Flink · 2016-07-18 · Apache Flink Flink Core Features - General purpose data processing for clusters - Compatibility: Kafka, Hadoop YARN, HDFS, … - Fully pipelined native

Apache flink

Flink London meetup 3 March 2016 - Flink basics

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Apache Hadoop - A Deep Dive (Part 1 - HDFS)

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

Apache Flink @ NYC Flink Meetup

SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

Data Analysis with Apache Flink (Hadoop Summit, 2015)

Apache Flink Deep Dive

Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Streaming Analytics with Apache Flink - Meetupfiles.meetup.com/18824486/Flink @ DC Flink Meetup.pdf · Apache Flink Stack 2 DataStream API Stream Processing DataSet API Batch Processing

Apache Flink@ Strata & Hadoop World London

Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data