77
Apache Beam (incubating) Kenneth Knowles Apache Beam (incubating) PPMC Software Engineer @ Google [email protected] / @KennKnowles Flink Forward 2016 https://goo.gl/jzlvD9 A Unified Model for Batch and Streaming Data Processing

Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data Processing

Embed Size (px)

Citation preview

Page 2: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

What is Apache Beam?

Apache Beam is

a unified programming model

for expressing

efficient and portable

data processing pipelines.

Page 3: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Big Data: Infinite & Out of Order

The Beam Model

Beam Project / Technical Vision

Agenda

1

2

3

3

Page 4: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

4

Big Data:Infinite & Out of Order

1

Page 5: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg5

Page 6: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

6

Unbounded, delayed, out of order

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

6

8:00

8:008:00

Page 7: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Incoming!

Score per user?

7

Page 8: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Organizing the stream

8

8:00

8:00

8:00

Page 9: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Completeness Latency Cost

$$$

Data Processing Tradeoffs

9

Page 10: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

What is important for your application?

Completeness Low Latency Low Cost

Important

Not Important

$$$10

Page 11: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Monthly Billing

Completeness Low Latency Low Cost

Important

Not Important

$$$11

Page 12: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Billing estimate

Completeness Low Latency Low Cost

Important

Not Important

$$$12

Page 13: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Abuse Detection

Completeness Low Latency Low Cost

Important

Not Important

$$$13

Page 14: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011

MapReduce(paper)

Apache Hadoop

Dataflow Model(paper)

See also: Tyler Akidau's Evolution of Massive-Scale Data Processing (goo.gl/VlVAEp)

MillWheel(paper)

Heron

ApacheSpark

ApacheStorm

Apache Gearpump

(incubating)Apache

Apex

Apache Flink

Cloud Dataflow

FlumeJava(paper)

Apache Beam (incubating)

Choices abound

Apache Samza

Page 15: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

15

The Beam Model2

Page 16: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model

Pipeline

16

PTransform

PCollection

(bounded or unbounded)

Page 17: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Vision (for users)

Sum Per Key

17

input.apply(

Sum.integersPerKey())

Java

input | Sum.PerKey()

Python

Apache Flink

Apache Spark

Cloud Dataflow

⋮ ⋮

Apache Gearpump

(incubating)

Apache Apex

Page 18: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Pipeline p = Pipeline.create(options);

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))

.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))

.apply(Filter.byPredicate(word → !word.isEmpty()))

.apply(Count.perElement())

.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())

.apply(TextIO.Write.to("gs://..."));

p.run();

What your (Java) Code Looks Like

18

Page 19: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

19

Page 20: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

20

Aggregations, transformations, ...

Page 21: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model: What are you computing?

Sum Per User

21

Page 22: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model: What are you computing?

Sum Per Key

22

input.apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...));

Java

input | Sum.PerKey() | Write(BigQuerySink(...))

Python

Page 23: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Per element(ParDo)

Grouping(Group/Combine Per Key)

Composite

The Beam Model: What are you computing?

Page 24: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model: What are you computing?

Sum Per Key

24

input.apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...));

Java

input | Sum.PerKey() | Write(BigQuerySink(...))

Python

Page 25: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

25

Event time windowing

Page 26: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

26

The Beam Model: Where in Event Time?8:00

8:00

8:00

Page 27: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Processing Time vs Event Time

Event Time = Processing Time ??

27

Page 28: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Processing Time vs Event Time

28

Proc

essi

ng T

ime

Event Time

Page 29: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Processing Time vs Event Time

Realtime

29

This is not possible

Event Time

Page 30: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Processing Time vs Event Time

30

Processing Delay

Proc

essi

ng T

ime

Page 31: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Processing Time vs Event TimeVery delayed

31

Proc

essi

ng T

ime

Event Time

Page 32: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Processing Time windows(probably are not what you want)

Proc

essi

ng T

ime

Event Time 32

Page 33: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Event Time Windows

33

Proc

essi

ng T

ime

Event Time

Page 34: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time

Event Time Windows

34

(implementing processing time windows)

Just throw away your data's timestamps and replace them with "now()"

Page 35: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

input | WindowInto(FixedWindows(3600) | Sum.PerKey()

| Write(BigQuerySink(...))

Python

The Beam Model: Where in Event Time?

Sum Per Key

Window Into

35

input.apply(

Window.into(

FixedWindows.of(

Duration.standardHours(1))) .apply(Sum.integersPerKey())

.apply(BigQueryIO.Write.to(...))

Java

Page 36: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Fixed Windows(also called Tumbling)

Sliding Windows

User Sessions

The Beam Model: Where in Event Time?1. Assign each timestamped

event to one or more windows

2. Merge those windows according to custom logic

Page 37: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

So that's what and where...

37

Page 38: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

38

Watermarks & Triggers

Page 39: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Event time windowsPr

oces

sing

Tim

e

39

Event Time

Page 40: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Fixed cutoff (we can do better)Pr

oces

sing

Tim

e

Event Time40

Allowed delay

Concurrent windows

Page 41: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Perfect watermarkPr

oces

sing

Tim

e

41

Event Time

Check out Slava's slides from Strata London 2016 talk on watermarks:https://goo.gl/K4FnqQ

Page 42: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Heuristic WatermarkPr

oces

sing

Tim

e

42

Event Time

Page 43: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Heuristic WatermarkPr

oces

sing

Tim

e

43

Current processing time

Event Time

Page 44: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Heuristic WatermarkPr

oces

sing

Tim

e

44

Current processing time

Event Time

Page 45: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Heuristic WatermarkPr

oces

sing

Tim

e

45

Current processing time

Late data

Event Time

Page 46: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Watermarks measure completeness

46

$$$

$$$

$$$

? Running Total

✔ Monthly billing

? Abuse Detection

Page 47: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model: When in Processing Time?

Sum Per Key

Window Into

47

input

.apply(Window.into(FixedWindows.of(...))

.triggering(

AfterWatermark.pastEndOfWindow())) .apply(Sum.integersPerKey())

.apply(BigQueryIO.Write.to(...))

Java

input | WindowInto(FixedWindows(3600),

trigger=AfterWatermark())

| Sum.PerKey()

| Write(BigQuerySink(...))

Python

Trigger after end of window

Page 48: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time

AfterWatermark.pastEndOfWindow()

48

Page 49: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Current processing time

Proc

essi

ng T

ime

Event Time49

AfterWatermark.pastEndOfWindow()

Page 50: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time

Late data

50

Current processing time

AfterWatermark.pastEndOfWindow()

Page 51: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time51

High completeness

Potentially high latency

Low cost

AfterWatermark.pastEndOfWindow()

$$$

Page 52: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

52

Page 53: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time53

Current processing time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Page 54: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Current processing time

Proc

essi

ng T

ime

Event Time54

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Page 55: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time55

Current processing time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Page 56: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time56

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Low completeness

Low latency

Cost driven by input$$$

Page 57: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Build a finely tuned trigger for your use caseAfterWatermark.pastEndOfWindow()

.withEarlyFirings(

AfterProcessingTime

.pastFirstElementInPane()

.plusDuration(Duration.standardMinutes(1))

.withLateFirings(AfterPane.elementCountAtLeast(1))

57

Bill at end of month

Near real-time estimates

Immediate corrections

Page 58: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time58

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 59: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time59

Current processing time

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 60: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time60

Current processing time

Low completeness

Low latency

Low cost, driven by time$$$

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 61: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Current processing time

Proc

essi

ng T

ime

Event Time61

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 62: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Current processing time

Proc

essi

ng T

ime

Event Time

Late output

62

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 63: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time

Late output

63

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Page 64: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Trigger CatalogueComposite TriggersBasic Triggers

64

AfterEndOfWindow()

AfterCount(n)

AfterProcessingTimeDelay(Δ)

AfterEndOfWindow()

.withEarlyFirings(A)

.withLateFirings(B)

AfterAny(A, B)

AfterAll(A, B)

Repeat(A)

Sequence(A, B)

Page 65: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

65

Accumulation Mode

Page 66: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

66

The Beam Model: How do refinements relate?2

5 7 14 25

Window.into(...)

.triggering(...)

.discardingFiredPanes()

5

Window.into(...)

.triggering(...)

.accumulatingFiredPanes()

711

Page 67: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

The Beam Model: Asking the Right Questions

What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

67

Page 68: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

68

Beam Project / Technical Vision3

Page 69: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Dataflow → BeamGoogleCloudPlatform/DataflowJavaSDK

cloudera/spark-dataflowdataArtisans/flink-dataflow

apache/incubator-beam

Contributors [with GitHub badges] from:Google, Data Artisans, Cloudera, Talend, Paypal, Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your org here>

Page 70: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

End users - who want to write pipelines in a language that’s familiar.

SDK authors - who want to make Beam concepts available in new languages.

Runner authors - who have a distributed processing environment and want to run Beam pipelines Beam Fn API: Invoke user-definable functions

Apache Flink

Apache Spark

Beam Runner API: Build and submit a pipeline

OtherLanguagesBeam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

The Beam Vision

Apache Apex

Apache Gearpump (incubating)

Page 71: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Outlook

Dataflow Java 1.x

Apache Beam Java 0.x

Apache Beam Java 2.xBug Fix

Feature

Breaking Change

We are

here

Feb 2016

Page 72: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Capability Matrix

http://beam.apache.org/learn/runners/capability-matrix/

Page 73: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Unified - One model handles batch and streaming use cases.

Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in.

Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors.

Why Apache Beam?

Page 74: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

Why Apache Beam?http://data-artisans.com/why-apache-beam/

"We firmly believe that the Beam model is the correct programming model for streaming and batch data processing."

- Kostas Tzoumas (Data Artisans)

https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective

"We hope it will lead to a healthy ecosystem of sophisticated runners that compete by making users happy, not [via] API lock in."

- Tyler Akidau (Google)

Page 77: Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data Processing

END

77