Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data Processing

Preview:

Citation preview

Apache Beam (incubating)Kenneth Knowles

Apache Beam (incubating) PPMCSoftware Engineer @ Google

klk@google.com / @KennKnowles Flink Forward 2016https://goo.gl/jzlvD9

A Unified Model for Batch and Streaming Data Processing

What is Apache Beam?

Apache Beam is

a unified programming model

for expressing

efficient and portable

data processing pipelines.

Big Data: Infinite & Out of Order

The Beam Model

Beam Project / Technical Vision

Agenda

1

2

3

3

4

Big Data:Infinite & Out of Order

1

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg5

6

Unbounded, delayed, out of order

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

6

8:00

8:008:00

Incoming!

Score per user?

7

Organizing the stream

8

8:00

8:00

8:00

Completeness Latency Cost

$$$

Data Processing Tradeoffs

9

What is important for your application?

Completeness Low Latency Low Cost

Important

Not Important

$$$10

Monthly Billing

Completeness Low Latency Low Cost

Important

Not Important

$$$11

Billing estimate

Completeness Low Latency Low Cost

Important

Not Important

$$$12

Abuse Detection

Completeness Low Latency Low Cost

Important

Not Important

$$$13

20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011

MapReduce(paper)

Apache Hadoop

Dataflow Model(paper)

See also: Tyler Akidau's Evolution of Massive-Scale Data Processing (goo.gl/VlVAEp)

MillWheel(paper)

Heron

ApacheSpark

ApacheStorm

Apache Gearpump

(incubating)Apache

Apex

Apache Flink

Cloud Dataflow

FlumeJava(paper)

Apache Beam (incubating)

Choices abound

Apache Samza

15

The Beam Model2

The Beam Model

Pipeline

16

PTransform

PCollection

(bounded or unbounded)

The Beam Vision (for users)

Sum Per Key

17

input.apply(

Sum.integersPerKey())

Java

input | Sum.PerKey()

Python

Apache Flink

Apache Spark

Cloud Dataflow

⋮ ⋮

Apache Gearpump

(incubating)

Apache Apex

Pipeline p = Pipeline.create(options);

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))

.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))

.apply(Filter.byPredicate(word → !word.isEmpty()))

.apply(Count.perElement())

.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())

.apply(TextIO.Write.to("gs://..."));

p.run();

What your (Java) Code Looks Like

18

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

19

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

20

Aggregations, transformations, ...

The Beam Model: What are you computing?

Sum Per User

21

The Beam Model: What are you computing?

Sum Per Key

22

input.apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...));

Java

input | Sum.PerKey() | Write(BigQuerySink(...))

Python

Per element(ParDo)

Grouping(Group/Combine Per Key)

Composite

The Beam Model: What are you computing?

The Beam Model: What are you computing?

Sum Per Key

24

input.apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...));

Java

input | Sum.PerKey() | Write(BigQuerySink(...))

Python

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

25

Event time windowing

26

The Beam Model: Where in Event Time?8:00

8:00

8:00

Processing Time vs Event Time

Event Time = Processing Time ??

27

Processing Time vs Event Time

28

Proc

essi

ng T

ime

Event Time

Proc

essi

ng T

ime

Processing Time vs Event Time

Realtime

29

This is not possible

Event Time

Processing Time vs Event Time

30

Processing Delay

Proc

essi

ng T

ime

Processing Time vs Event TimeVery delayed

31

Proc

essi

ng T

ime

Event Time

Processing Time windows(probably are not what you want)

Proc

essi

ng T

ime

Event Time 32

Event Time Windows

33

Proc

essi

ng T

ime

Event Time

Proc

essi

ng T

ime

Event Time

Event Time Windows

34

(implementing processing time windows)

Just throw away your data's timestamps and replace them with "now()"

input | WindowInto(FixedWindows(3600) | Sum.PerKey()

| Write(BigQuerySink(...))

Python

The Beam Model: Where in Event Time?

Sum Per Key

Window Into

35

input.apply(

Window.into(

FixedWindows.of(

Duration.standardHours(1))) .apply(Sum.integersPerKey())

.apply(BigQueryIO.Write.to(...))

Java

Fixed Windows(also called Tumbling)

Sliding Windows

User Sessions

The Beam Model: Where in Event Time?1. Assign each timestamped

event to one or more windows

2. Merge those windows according to custom logic

So that's what and where...

37

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

38

Watermarks & Triggers

Event time windowsPr

oces

sing

Tim

e

39

Event Time

Fixed cutoff (we can do better)Pr

oces

sing

Tim

e

Event Time40

Allowed delay

Concurrent windows

Perfect watermarkPr

oces

sing

Tim

e

41

Event Time

Check out Slava's slides from Strata London 2016 talk on watermarks:https://goo.gl/K4FnqQ

Heuristic WatermarkPr

oces

sing

Tim

e

42

Event Time

Heuristic WatermarkPr

oces

sing

Tim

e

43

Current processing time

Event Time

Heuristic WatermarkPr

oces

sing

Tim

e

44

Current processing time

Event Time

Heuristic WatermarkPr

oces

sing

Tim

e

45

Current processing time

Late data

Event Time

Watermarks measure completeness

46

$$$

$$$

$$$

? Running Total

✔ Monthly billing

? Abuse Detection

The Beam Model: When in Processing Time?

Sum Per Key

Window Into

47

input

.apply(Window.into(FixedWindows.of(...))

.triggering(

AfterWatermark.pastEndOfWindow())) .apply(Sum.integersPerKey())

.apply(BigQueryIO.Write.to(...))

Java

input | WindowInto(FixedWindows(3600),

trigger=AfterWatermark())

| Sum.PerKey()

| Write(BigQuerySink(...))

Python

Trigger after end of window

Proc

essi

ng T

ime

Event Time

AfterWatermark.pastEndOfWindow()

48

Current processing time

Proc

essi

ng T

ime

Event Time49

AfterWatermark.pastEndOfWindow()

Proc

essi

ng T

ime

Event Time

Late data

50

Current processing time

AfterWatermark.pastEndOfWindow()

Proc

essi

ng T

ime

Event Time51

High completeness

Potentially high latency

Low cost

AfterWatermark.pastEndOfWindow()

$$$

Proc

essi

ng T

ime

Event Time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

52

Proc

essi

ng T

ime

Event Time53

Current processing time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Current processing time

Proc

essi

ng T

ime

Event Time54

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Proc

essi

ng T

ime

Event Time55

Current processing time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Proc

essi

ng T

ime

Event Time56

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Low completeness

Low latency

Cost driven by input$$$

Build a finely tuned trigger for your use caseAfterWatermark.pastEndOfWindow()

.withEarlyFirings(

AfterProcessingTime

.pastFirstElementInPane()

.plusDuration(Duration.standardMinutes(1))

.withLateFirings(AfterPane.elementCountAtLeast(1))

57

Bill at end of month

Near real-time estimates

Immediate corrections

Proc

essi

ng T

ime

Event Time58

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Proc

essi

ng T

ime

Event Time59

Current processing time

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Proc

essi

ng T

ime

Event Time60

Current processing time

Low completeness

Low latency

Low cost, driven by time$$$

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Current processing time

Proc

essi

ng T

ime

Event Time61

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Current processing time

Proc

essi

ng T

ime

Event Time

Late output

62

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Proc

essi

ng T

ime

Event Time

Late output

63

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Trigger CatalogueComposite TriggersBasic Triggers

64

AfterEndOfWindow()

AfterCount(n)

AfterProcessingTimeDelay(Δ)

AfterEndOfWindow()

.withEarlyFirings(A)

.withLateFirings(B)

AfterAny(A, B)

AfterAll(A, B)

Repeat(A)

Sequence(A, B)

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

65

Accumulation Mode

66

The Beam Model: How do refinements relate?2

5 7 14 25

Window.into(...)

.triggering(...)

.discardingFiredPanes()

5

Window.into(...)

.triggering(...)

.accumulatingFiredPanes()

711

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

67

68

Beam Project / Technical Vision3

Dataflow → BeamGoogleCloudPlatform/DataflowJavaSDK

cloudera/spark-dataflowdataArtisans/flink-dataflow

apache/incubator-beam

Contributors [with GitHub badges] from:Google, Data Artisans, Cloudera, Talend, Paypal, Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your org here>

End users - who want to write pipelines in a language that’s familiar.

SDK authors - who want to make Beam concepts available in new languages.

Runner authors - who have a distributed processing environment and want to run Beam pipelines Beam Fn API: Invoke user-definable functions

Apache Flink

Apache Spark

Beam Runner API: Build and submit a pipeline

OtherLanguagesBeam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

The Beam Vision

Apache Apex

Apache Gearpump (incubating)

Outlook

Dataflow Java 1.x

Apache Beam Java 0.x

Apache Beam Java 2.xBug Fix

Feature

Breaking Change

We are

here

Feb 2016

Capability Matrix

http://beam.apache.org/learn/runners/capability-matrix/

Unified - One model handles batch and streaming use cases.

Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in.

Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors.

Why Apache Beam?

Why Apache Beam?http://data-artisans.com/why-apache-beam/

"We firmly believe that the Beam model is the correct programming model for streaming and batch data processing."

- Kostas Tzoumas (Data Artisans)

https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective

"We hope it will lead to a healthy ecosystem of sophisticated runners that compete by making users happy, not [via] API lock in."

- Tyler Akidau (Google)

END

77