Upload
flink-forward
View
265
Download
0
Embed Size (px)
Citation preview
Apache Beam (incubating)Kenneth Knowles
Apache Beam (incubating) PPMCSoftware Engineer @ Google
[email protected] / @KennKnowles Flink Forward 2016https://goo.gl/jzlvD9
A Unified Model for Batch and Streaming Data Processing
What is Apache Beam?
Apache Beam is
a unified programming model
for expressing
efficient and portable
data processing pipelines.
Big Data: Infinite & Out of Order
The Beam Model
Beam Project / Technical Vision
Agenda
1
2
3
3
4
Big Data:Infinite & Out of Order
1
https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg5
6
Unbounded, delayed, out of order
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
6
8:00
8:008:00
Incoming!
Score per user?
7
Organizing the stream
8
8:00
8:00
8:00
Completeness Latency Cost
$$$
Data Processing Tradeoffs
9
What is important for your application?
Completeness Low Latency Low Cost
Important
Not Important
$$$10
Monthly Billing
Completeness Low Latency Low Cost
Important
Not Important
$$$11
Billing estimate
Completeness Low Latency Low Cost
Important
Not Important
$$$12
Abuse Detection
Completeness Low Latency Low Cost
Important
Not Important
$$$13
20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011
MapReduce(paper)
Apache Hadoop
Dataflow Model(paper)
See also: Tyler Akidau's Evolution of Massive-Scale Data Processing (goo.gl/VlVAEp)
MillWheel(paper)
Heron
ApacheSpark
ApacheStorm
Apache Gearpump
(incubating)Apache
Apex
Apache Flink
Cloud Dataflow
FlumeJava(paper)
Apache Beam (incubating)
Choices abound
Apache Samza
15
The Beam Model2
The Beam Model
Pipeline
16
PTransform
PCollection
(bounded or unbounded)
The Beam Vision (for users)
Sum Per Key
17
input.apply(
Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
Apache Flink
Apache Spark
Cloud Dataflow
⋮ ⋮
Apache Gearpump
(incubating)
Apache Apex
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))
.apply(Filter.byPredicate(word → !word.isEmpty()))
.apply(Count.perElement())
.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())
.apply(TextIO.Write.to("gs://..."));
p.run();
What your (Java) Code Looks Like
18
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
19
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
20
Aggregations, transformations, ...
The Beam Model: What are you computing?
Sum Per User
21
The Beam Model: What are you computing?
Sum Per Key
22
input.apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...));
Java
input | Sum.PerKey() | Write(BigQuerySink(...))
Python
Per element(ParDo)
Grouping(Group/Combine Per Key)
Composite
The Beam Model: What are you computing?
The Beam Model: What are you computing?
Sum Per Key
24
input.apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...));
Java
input | Sum.PerKey() | Write(BigQuerySink(...))
Python
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
25
Event time windowing
26
The Beam Model: Where in Event Time?8:00
8:00
8:00
Processing Time vs Event Time
Event Time = Processing Time ??
27
Processing Time vs Event Time
28
Proc
essi
ng T
ime
Event Time
Proc
essi
ng T
ime
Processing Time vs Event Time
Realtime
29
This is not possible
Event Time
Processing Time vs Event Time
30
Processing Delay
Proc
essi
ng T
ime
Processing Time vs Event TimeVery delayed
31
Proc
essi
ng T
ime
Event Time
Processing Time windows(probably are not what you want)
Proc
essi
ng T
ime
Event Time 32
Event Time Windows
33
Proc
essi
ng T
ime
Event Time
Proc
essi
ng T
ime
Event Time
Event Time Windows
34
(implementing processing time windows)
Just throw away your data's timestamps and replace them with "now()"
input | WindowInto(FixedWindows(3600) | Sum.PerKey()
| Write(BigQuerySink(...))
Python
The Beam Model: Where in Event Time?
Sum Per Key
Window Into
35
input.apply(
Window.into(
FixedWindows.of(
Duration.standardHours(1))) .apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
Fixed Windows(also called Tumbling)
Sliding Windows
User Sessions
The Beam Model: Where in Event Time?1. Assign each timestamped
event to one or more windows
2. Merge those windows according to custom logic
So that's what and where...
37
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
38
Watermarks & Triggers
Event time windowsPr
oces
sing
Tim
e
39
Event Time
Fixed cutoff (we can do better)Pr
oces
sing
Tim
e
Event Time40
Allowed delay
Concurrent windows
Perfect watermarkPr
oces
sing
Tim
e
41
Event Time
Check out Slava's slides from Strata London 2016 talk on watermarks:https://goo.gl/K4FnqQ
Heuristic WatermarkPr
oces
sing
Tim
e
42
Event Time
Heuristic WatermarkPr
oces
sing
Tim
e
43
Current processing time
Event Time
Heuristic WatermarkPr
oces
sing
Tim
e
44
Current processing time
Event Time
Heuristic WatermarkPr
oces
sing
Tim
e
45
Current processing time
Late data
Event Time
Watermarks measure completeness
46
$$$
$$$
$$$
? Running Total
✔ Monthly billing
? Abuse Detection
The Beam Model: When in Processing Time?
Sum Per Key
Window Into
47
input
.apply(Window.into(FixedWindows.of(...))
.triggering(
AfterWatermark.pastEndOfWindow())) .apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
input | WindowInto(FixedWindows(3600),
trigger=AfterWatermark())
| Sum.PerKey()
| Write(BigQuerySink(...))
Python
Trigger after end of window
Proc
essi
ng T
ime
Event Time
AfterWatermark.pastEndOfWindow()
48
Current processing time
Proc
essi
ng T
ime
Event Time49
AfterWatermark.pastEndOfWindow()
Proc
essi
ng T
ime
Event Time
Late data
50
Current processing time
AfterWatermark.pastEndOfWindow()
Proc
essi
ng T
ime
Event Time51
High completeness
Potentially high latency
Low cost
AfterWatermark.pastEndOfWindow()
$$$
Proc
essi
ng T
ime
Event Time
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
52
Proc
essi
ng T
ime
Event Time53
Current processing time
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
Current processing time
Proc
essi
ng T
ime
Event Time54
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
Proc
essi
ng T
ime
Event Time55
Current processing time
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
Proc
essi
ng T
ime
Event Time56
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
Low completeness
Low latency
Cost driven by input$$$
Build a finely tuned trigger for your use caseAfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDuration(Duration.standardMinutes(1))
.withLateFirings(AfterPane.elementCountAtLeast(1))
57
Bill at end of month
Near real-time estimates
Immediate corrections
Proc
essi
ng T
ime
Event Time58
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Proc
essi
ng T
ime
Event Time59
Current processing time
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Proc
essi
ng T
ime
Event Time60
Current processing time
Low completeness
Low latency
Low cost, driven by time$$$
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Current processing time
Proc
essi
ng T
ime
Event Time61
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Current processing time
Proc
essi
ng T
ime
Event Time
Late output
62
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Proc
essi
ng T
ime
Event Time
Late output
63
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Trigger CatalogueComposite TriggersBasic Triggers
64
AfterEndOfWindow()
AfterCount(n)
AfterProcessingTimeDelay(Δ)
AfterEndOfWindow()
.withEarlyFirings(A)
.withLateFirings(B)
AfterAny(A, B)
AfterAll(A, B)
Repeat(A)
Sequence(A, B)
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
65
Accumulation Mode
66
The Beam Model: How do refinements relate?2
5 7 14 25
Window.into(...)
.triggering(...)
.discardingFiredPanes()
5
Window.into(...)
.triggering(...)
.accumulatingFiredPanes()
711
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
67
68
Beam Project / Technical Vision3
Dataflow → BeamGoogleCloudPlatform/DataflowJavaSDK
cloudera/spark-dataflowdataArtisans/flink-dataflow
apache/incubator-beam
Contributors [with GitHub badges] from:Google, Data Artisans, Cloudera, Talend, Paypal, Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your org here>
End users - who want to write pipelines in a language that’s familiar.
SDK authors - who want to make Beam concepts available in new languages.
Runner authors - who have a distributed processing environment and want to run Beam pipelines Beam Fn API: Invoke user-definable functions
Apache Flink
Apache Spark
Beam Runner API: Build and submit a pipeline
OtherLanguagesBeam Java
Beam Python
Execution Execution
Cloud Dataflow
Execution
The Beam Vision
Apache Apex
Apache Gearpump (incubating)
Outlook
Dataflow Java 1.x
Apache Beam Java 0.x
Apache Beam Java 2.xBug Fix
Feature
Breaking Change
We are
here
Feb 2016
Capability Matrix
http://beam.apache.org/learn/runners/capability-matrix/
Unified - One model handles batch and streaming use cases.
Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in.
Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors.
Why Apache Beam?
Why Apache Beam?http://data-artisans.com/why-apache-beam/
"We firmly believe that the Beam model is the correct programming model for streaming and batch data processing."
- Kostas Tzoumas (Data Artisans)
https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
"We hope it will lead to a healthy ecosystem of sophisticated runners that compete by making users happy, not [via] API lock in."
- Tyler Akidau (Google)
Beam here at Flink Forward 2016
75
I hope you saw:
Beaming Flink to the Cloud @ Netflix Monal Daxini - Netflix
And stay in this room for:
Flink and Beam: Current State & RoadmapMaximilian Michels - data Artisans
No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - Google
http://beam.incubator.apache.org/
Join the community!
User discussions - [email protected] discussions - [email protected] @ApacheBeam on Twitter
Good Reads
Why Apache Beam? (from Data Artisans)Why Apache Beam? (from Google)
Streaming 101Streaming 102The Dataflow Beam Model
More Beam!
76
END
77