“No one at Google uses - JAX London · #Dataflow @martin_gorner Apache Beam: the “Dataflow model” Open-source lingua franca for unified batch and streaming data processing Google

#Dataflow @martin_gorner

“No one at Google uses MapReduce anymore”

Cloud Dataflow,the Dataflow model and

parallel processing in 2016.

Martin Görner


Before we start...

(QuickGoogle BigQuery

demo)


Flume Java (2010)The Dataflow Model (2015) Millwheel (2013)


Streaming pipeline

Example: hash tag auto-complete

Tweets

Predictions

read #argentina scores, my #art project, watching #armenia vs #argentina

ExtractTags #argentina #art #armenia #argentina

Count (argentina, 5M) (art, 9M) (armenia, 2M)

ExpandPrefixes a->(argentina,5M) ar->(argentina,5M) arg->(argentina,5M) ar->(art, 9M) ...

Top(3)

write

a->[apple, art, argentina] ar->[art, argentina, armenia]

.apply(PubSub.Read...))

.apply(Window.into())

.apply(ParDo.of(new ExtractTags()))

.apply(Count.create())

.apply(ParDo.of(new ExpandPrefixes())

.apply(Top.largestPerKey(3)

Pipeline p = Pipeline.create()

.apply(PubSub.Write...));p.run()


MapReduce ?

M M M

R R


M M M

R R

MapReduce ?


Associative example: count

SUM⇒ SUM⇒ 2SUM⇒3 4 33 3

R RSUM⇒ SUM⇒ 98

M+RM+RM+R1 1 1 1 1 1 11

1111111 111

3 3 2 3 4 3


Dataflow programming model

PCollection<T>

PaDo(DoFn) in: PCollection<T>out: PCollection<S>

GroupByKeyin: PCollection<KV<K,V>> //multimapout: PCollection<KV<K,Collection<V>>> //unimap

Combine(CombineFn)in: PCollection<KV<K,Collection<V>>> //unimapout: PCollection<KV<K,V>> //unimap

Flatten in: PCollection<T>, PCollection<T>, ...out: PCollection<T>


reinventing Count

Count in: PCollection<T>out: PCollection<KV<T,Integer>> //unimap

ParDo: T => KV<T, 1>

GroupByKey: PCollection<KV<<T, 1>> // multimap => PCollection<KV<<T,Collection<1>>> // unimap

Combine(Sum): => PCollection<KV<<T,Integer>>


reinventing JOIN

joinin: two multimaps with same key typePCollection<KV<<K,V>>, PCollection<KV<<K,W>>out: a unimapPCollection<KV<<K, Pair<Collection<V>,Collection<W>>>>

ParDo: KV<K,V> => KV<K, TaggedUnion<V,W>> KV<K,W> => KV<K, TaggedUnion<V,W>>

Flatten: => PCollection<KV<<K, TaggedUnion<V,W>>> //multimap

GroupByKey: => PCollection<KV<K, Collection<TaggedUnion<V,W>>>>//unimap

ParDo: final type transform => PCollection<KV<<K, Pair<Collection<V>, Collection<W>>>>


join

IN1

IN2

IN3

IN4

OUT1

OUT2C

A

Dflatten FB

= ParallelDo

count E

payment logs 1

payment logs 2

website logs

page views in session

formatted checkout

log

shopping session log:➔ session id➔ cart contents➔ payment OK?➔ page views

by session id

checkout logs

example: extracting shopping session data

Execution graph


= ParallelDo

GBK = GroupByKey

+ = Combine

C D

C+D

consumer-producer sibling

C D

C+D

graph optimiser: ParallelDo Fusion


B

CDflatten GBK

graph optimiser: sink Flattens = ParallelDo

GBK = GroupByKey

+ = Combine


B

C D

flatten GBK

D


GBK = GroupByKey

+ = Combine


B

C DGBK

D


GBK = GroupByKey

+ = Combine


“Map Shuffle Combine Reduce”

OUT1

OUT3GBK R2+

M3

M2

M1

GBK R1+ OUT2

IN1

IN2

IN3 MSCR

graph optimiser: identify MR-like blocks = ParallelDo

GBK = GroupByKey

+ = Combine


B

C

count

join

IN1

IN2

IN3

IN4

OUT1A

Dflatten F OUT2

E

graph optimiser = ParallelDo

GBK = GroupByKey

+ = Combine


B

Cjoin

IN1

IN2

IN3

IN4

A

D

E

flatten

Cnt GBK +

F OUT2

OUT1


GBK = GroupByKey

+ = Combine


A

B

CD F

IN1

IN2

IN3

IN4

flatten

OUT1

OUT2

J1

GBKflat.J1

J1

J2

ECnt GBK +


GBK = GroupByKey

+ = Combine


A

B

CD F

IN1

IN2

IN3

IN4

flatten

OUT1

OUT2

J1

GBKflat.J1

J1

J2

ECnt GBK +


GBK = GroupByKey

+ = Combine


A

B

C DF

IN1

IN2

IN3

IN4

flatten

OUT1

OUT2

J1

GBKflat.J1

J1

J2

ECnt GBK +

D


GBK = GroupByKey

+ = Combine


A

C DF

IN1

IN2

IN3

IN4

flatten

OUT1

OUT2

J1

GBKflat.

J1J1

J2

ECnt GBK +

B D J1


GBK = GroupByKey

+ = Combine


A

C DF

IN1

IN2

IN3

IN4

OUT1

OUT2

J1

GBK

J1J1

J2

ECnt GBK +

B D J1


GBK = GroupByKey

+ = Combine


C DF

IN1

IN2

IN3

IN4

OUT1

OUT2GBK

J1J1

J2

ECnt GBK +

B D J1

A+J1

graph optimiser: ParallelDo fusion = ParallelDo

GBK = GroupByKey

+ = Combine


C DF

IN1

IN2

IN3

IN4

OUT1

OUT2GBK

J1J1

J2

ECnt GBK +

B+D+J1

A+J1


GBK = GroupByKey

+ = Combine


C+D+J1

F

IN1

IN2

IN3

IN4

OUT1

OUT2GBK

J1

J2

ECnt GBK +

B+D+J1

A+J1


GBK = GroupByKey

+ = Combine


C+D+J1

F

IN1

IN2

IN3

IN4

OUT1

OUT2GBK J2

Cnt GBK +

B+D+J1

A+J1

E+J1


GBK = GroupByKey

+ = Combine


C+D+J1

IN1

IN2

IN3

IN4

OUT1

OUT2GBK

Cnt GBK +

B+D+J1

A+J1

E+J1

J2+F


GBK = GroupByKey

+ = Combine


C+D+J1

IN1

IN2

IN3

IN4

OUT1

OUT2GBK

B+D+J1

A+J1

E+J1

J2+F

Cnt GBK +


GBK = GroupByKey

+ = Combine


IN1

IN2

IN3

IN4

OUT1

OUT2GBK J2+F+

Cnt GBK +

C+D+J1

B+D+J1

A+J1

E+J1

graph optimiser: MapReduce-like blocks = ParallelDo

GBK = GroupByKey

+ = Combine


IN1

IN2

IN3

IN4

OUT1

OUT2GBK J2+F+

Cnt GBK +

C+D+J1

B+D+J1

A+J1

E+J1

MSCR1

MSCR2

graph optimiser: MapReduce-like blocks = ParallelDo

GBK = GroupByKey

+ = Combine


IN4

IN1IN2IN3

OUT1

OUT2

A+B+C+D+E+J1

A+B+C+D+E+J1

A+B+C+D+E+J1

A+B+C+D+E+J1

A+B+C+D+E+J1

A+B+C+D+E+J1

F+J2

F+J2

F+J2

Cnt

Cnt

Cnt

+

+

graph optimiser


Dataflow model

What are you computing?

Where in event time?

When are results emitted?

How do refinements relate?Wind

owing


Dataflow model: windowing

Windowing divides data into event-time-based finite chunks.

Often required when doing aggregations over unbounded data.

What Where When How

Fixed Sliding1 2 3

54

Sessions

2

431

Key 2

Key 1

Key 3

Time

1 2 3 4


Event-time windowing: out-of-order

What Where When How

11:0010:00 15:0014:0013:0012:00Event Time

11:0010:00 15:0014:0013:0012:00Processing Time

Input

Output


What Where When How

Proc

essi

ng T

ime

Event Time

Skew

Watermark

The correct definition of “late data”

Watermark


What, When, Where, How ?

PCollection<KV<String, Integer>> scores

.apply(Sum.integersPerKey());

What Where When How

PCollection<KV<String, Integer>> scores

p.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AfterWatermark().pastEndOfWindow()

.withLateFirings(AtCount(1))

.withEarlyFirings(AtPeriod(Duration.standardMinutes(1))

.withAllowedLateness.(Duration.standardHours(1))

.accumulatingfFiredPanes()/.discardingFiredPanes() )

.apply(Sum.integersPerKey());

default behaviour


The Dataflow model

What are you computing?

Where in event time?

When are result emitted?

How do refinements relate?

ParDo GroupByKey

Flatten Combine

FixedWindows

SlidingWindows

Sessions

triggering

AfterWatermark

AfterProcessingTime

AfterPane.elementCount…withAllowedLateness

withEarlyFirings

withLateFirings

accumulatingFiredPanes

discardingFiredPanes


Apache Beam: the “Dataflow model”Open-source lingua franca for unified batch and streaming data processing

Google Cloud DataflowRun it in the cloud

Run it on premise

Apache Flink runner for Beam

Apache Spark runner for Beam

Apache Beam (incubating)


Demo time

1 week of data

3M taxi rides

One point every 2s on each ride

Accelerated 8x

20,000 events/s

Streamed from Google Pub/Sub


Demo: NYC Taxis

PubSub(event queue)

PubSub(event queue)

Visualisation(Javascript)

BigQuery(data warehousing and

interactive analysis)

Dataflow


cloud.google.com/dataflowbeam.incubator.apache.org

Thank you !

Martin Görner

Google Developer relations@martin_gorner

Documents

“No one at Google uses - JAX London · #Dataflow @martin_gorner Apache Beam: the “Dataflow model” Open-source lingua franca for unified batch and streaming data processing Google