Apache Beam (incubating)

Preview:

Citation preview

Apache Beam (incubating)

Kenneth Knowlesklk@google.com@KennKnowles Apache Apex Meetup, 2016-06-

27

https://goo.gl/LTLjKt

Motivation

Beam Model

Beam Project / Technical Vision

Agenda1

2

3

2

3

Motivation1

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg4

5

Unbounded, delayed, out of order

9:008:00 14:00

13:00

12:00

11:00

10:00

2:001:00 7:006:005:004:003:00

5

8:00

8:008:00

Incoming!

Score per

user?

6

Organizing the stream

7

8:00

8:00

8:00

Completeness

Latency Cost

$$$

Data Processing Tradeoffs

8

What is important for your application?

Completeness Low Latency Low Cost

Important

Not Important

$$$9

Monthly Billing

Completeness Low Latency Low Cost

Important

Not Important

$$$10

Billing estimate

Completeness Low Latency Low Cost

Important

Not Important

$$$11

Abuse Detection

Completeness Low Latency Low Cost

Important

Not Important

$$$12

13

The Beam Model

2

The Beam Model

Pipeline

14

PTransform

PCollection

The Beam Vision (for users)

Sum Per Key

15

input.apply( Sum.integersPerKey())

Java

input | Sum.PerKey()

Python

Apache Flink

Apache Spark

Cloud Dataflow

⋮ ⋮

Apache Apex

Apache Gearpump

(incubating)

Pipeline p = Pipeline.create(options);

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))

.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))

.apply(Filter.byPredicate(word → !word.isEmpty()))

.apply(Count.perElement())

.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())

.apply(TextIO.Write.to("gs://..."));

p.run();

What your (Java) Code Looks Like

16

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

17

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

18

Aggregations, transformations, ...

The Beam Model: What are you computing?

Sum Per User

19

The Beam Model: What are you computing?

Sum Per Key

20

input.apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...));

Java

input | Sum.PerKey() | Write(BigQuerySink(...))

Python

http://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

21

Event time windowing

22

The Beam Model: Where in Event Time?8:00

8:00

8:00

Processing Time vs Event Time

Event Time = Processing Time ??

23

Processing Time vs Event Time

24

Proc

essi

ng T

ime

Proc

essi

ng T

ime

Processing Time vs Event Time

Realtime

25

This is not possible

Processing Time vs Event Time

26

Processing DelayPr

oces

sing

Tim

e

Processing Time vs Event TimeVery delayed

27

Proc

essi

ng T

ime

Event Time

Processing Time windows(probably are not what you want)

Proc

essi

ng T

ime

Event Time 28

Event Time Windows

29

Proc

essi

ng T

ime

Event Time

Proc

essi

ng T

ime

Event Time

Event Time Windows

30

(implementing processing time windows)

Just throw away your data's timestamps and replace them with "now()"

input |

WindowInto(FixedWindows(3600) | Sum.PerKey() | Write(BigQuerySink(...))

Python

The Beam Model: Where in Event Time?

Sum Per Key

Window Into

31

input.apply(

Window.into( FixedWindows.of( Duration.standardHours(1))) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...))

Java

So that's what and where...

32

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

33

Watermarks &

Triggers

Event time windowsPr

oces

sing

Tim

e

34

Event Time

Fixed cutoff (we can do better)Pr

oces

sing

Tim

e

Event Time35

Allowed delay

Concurrent windows

Perfect watermarkPr

oces

sing

Tim

e

36

Event Time

Check out Slava's slides from Strata London 2016 talk on watermarks:https://goo.gl/K4FnqQ

Heuristic WatermarkPr

oces

sing

Tim

e

37

Event Time

Heuristic WatermarkPr

oces

sing

Tim

e

38

Current processing time

Event Time

Heuristic WatermarkPr

oces

sing

Tim

e

39

Current processing time

Event Time

Heuristic WatermarkPr

oces

sing

Tim

e

40

Current processing time

Late data

Event Time

Watermarks measure completeness

41

$$$

$$$

$$$

? Running Total

✔ Monthly billing

? Abuse Detection

The Beam Model: When in Processing Time?

Sum Per Key

Window Into

42

input .apply(Window.into(FixedWindows.of(...))

.triggering( AfterWatermark.pastEndOfWindow())) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...))

Java

input | WindowInto(FixedWindows(3600),

trigger=AfterWatermark()) | Sum.PerKey() | Write(BigQuerySink(...))

Python

Trigger after end of window

Proc

essi

ng T

ime

Event Time

AfterWatermark.pastEndOfWindow()

43

Current processing time

Proc

essi

ng T

ime

Event Time44

AfterWatermark.pastEndOfWindow()

Proc

essi

ng T

ime

Event Time

Late data

45

Current processing time

AfterWatermark.pastEndOfWindow()

Proc

essi

ng T

ime

Event Time46

High completeness

Potentially high latency

Low cost

AfterWatermark.pastEndOfWindow()

$$$

Proc

essi

ng T

ime

Event Time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

47

Proc

essi

ng T

ime

Event Time48

Current processing time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Current processing time

Proc

essi

ng T

ime

Event Time49

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Proc

essi

ng T

ime

Event Time50

Current processing time

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Current processing time

Proc

essi

ng T

ime

Event Time51

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Proc

essi

ng T

ime

Event Time52

Repeatedly.forever( AfterPane.elementCountAtLeast(2))

Low completeness

Low latency

Cost driven by input

$$$

Build a finely tuned trigger for your use caseAfterWatermark.pastEndOfWindow()

.withEarlyFirings( AfterProcessingTime .pastFirstElementInPane() .plusDuration(Duration.standardMinutes(1))

.withLateFirings(AfterPane.elementCountAtLeast(1)) 53

Bill at end of month

Near real-time estimates

Immediate corrections

Proc

essi

ng T

ime

Event Time54

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Proc

essi

ng T

ime

Event Time55

Current processing time

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Proc

essi

ng T

ime

Event Time56

Current processing time

Low completeness

Low latency

Low cost, driven by time

$$$

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Current processing time

Proc

essi

ng T

ime

Event Time57

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Current processing time

Proc

essi

ng T

ime

Event Time

Late output

58

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Proc

essi

ng T

ime

Event Time

Late output

59

.withEarlyFirings(after 1 minute)

.withLateFirings(ASAP after each element)

Trigger CatalogueComposite TriggersBasic Triggers

60

AfterEndOfWindow()

AfterCount(n)

AfterProcessingTimeDelay(Δ)

AfterEndOfWindow() .withEarlyFirings(A) .withLateFirings(B)

AfterAny(A, B)AfterAll(A, B)Repeat(A)Sequence(A, B)

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

61

Accumulation Mode

The Beam Model: How do refinements relate?

62

input

.apply(Window.into(...).triggering(...).discardingFiredPanes()) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...))

vs

1

3 7

4

10

5

1

3 7

4

10

15

discarding accumulating

The Beam Model: Asking the Right Questions What are you computing?

Where in event time?

When in processing time are results produced?

How do refinements relate?

63

64

Beam Project / Technical Vision

3

1. End users: who want to write pipelines in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to run Beam pipelines

Beam Fn API: Invoke user-definable functions

Apache Flink

Apache Spark

Beam Runner API: Build and submit a piepline

OtherLanguagesBeam Java Beam

Python

Execution Execution

Cloud Dataflo

w

Execution

The Beam Vision

Apache Apex

Apache Gearpump (incubatin

g)

Project Setup (vision meets code)GoogleCloudPlatform/DataflowJavaSDK cloudera/spark-dataflow dataArtisans/flink-dataflow

apache/incubator-beam

Direct (on your laptop)Google Cloud DataflowFlinkSparkIn pull request: Apex, Gearpump

Integration tests

Runners

Examples

I/O Connectors

sharing

HDFSKafkaBigQueryGoogle Cloud Storage, Pubsub, Bigtable, DatastoreIn pull request: JMS, CassandraProposed: Sqoop, Parquet, JDBC, SocketStream, ...

SDKs

Committers from Google, Data Artisans, Cloudera, Talend, Paypal● ~40 commits/week● Rigorous code review for every commit

Contributors [with GitHub badges] from: Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your name here>

● Improvements to existing I/O connectors● Improvements to Spark runner● Utility classes for users● Documentation fixes● Bug diagnoses● New I/O connectors● Gearpump runner PoC● Apex runner PoC!

… and it has been awesomeapache/incubator-beam

Java SDK: Transition from Dataflow

Dataflow Java 1.x

Apache Beam Java 0.x

Apache Beam Java 2.xBug Fix

Feature

Breaking Change

We are here

Feb 201

6

Late 2016

Understanding: Capability Matrix

http://beam.incubator.apache.org/capability-matrix/

Why Apache Beam?Unified - One model handles batch and streaming use cases.

Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in.

Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors.

Why Apache Beam?http://data-artisans.com/why-apache-beam/

"We firmly believe that the Beam model is the correct programming model for streaming and batch data processing."

- Kostas Tzoumas (Data Artisans)

https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective

"We hope it will lead to a healthy ecosystem of sophisticated runners that compete by making users happy, not [via] API lock in."

- Tyler Akidau (Google)

72

Creating an Apache Beam CommunityCollaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors.

Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem.

We love contributions. Join us!

Apache Beam

http://beam.incubator.apache.org/Why Apache Beam? (from Data Artisans)Why Apache Beam? (from Google)

Programming Model Overviews

Streaming 101Streaming 102The Dataflow Beam Model

Join the community!User discussions - user-subscribe@beam.incubator.apache.orgDevelopment discussions - dev-subscribe@beam.incubator.apache.orgFollow @ApacheBeam on Twitter

Learn More!

73

END

74

Recommended