Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Beyond MapReduce, Beyond
Lambda Easy, unified, reliable processing for stream and batch
William Vambenepe
@vambenepe
Lead Product Manager for Big Data on Google Cloud Platform
http://research.google.com/archive/mapreduce.html
http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
http://research.google.com/pubs/pub41378.html
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
The Lambda Architecture
2012 2013 2002 2004 2006 2008 2010
Google Cloud
Dataflow
MapReduce
GFS Big Table
Dremel
Pregel
Flume
Colossus
Spanner MillWheel
Event Time - When Events Happened
Stream Time - When Events Are Processed
Batch vs Streaming
MapReduce
Batch
MapReduce
[10:00 - 11:00) [10:00 - 11:00) [11:00 -
12:00) [12:00 -
13:00) [13:00 -
14:00) [14:00 -
15:00) [15:00 -
16:00) [16:00 -
17:00) [18:00 -
19:00) [19:00 -
20:00) [21:00 -
22:00) [22:00 -
23:00) [23:00 - 0:00)
Batch: Fixed Windows
MapReduce
[10:00 - 11:00) [11:00 - 12:00)
Batch: User Sessions
Joan
Larry
Ingo
Amanda
Cheryl
Arthur
[11:00 - 12:00) [10:00 - 11:00)
Streaming
11:00 10:00 16:00 15:00 14:00 13:00 12:00
Unordered
Unbounded
Of Varying Event Time Skew
Confounding characteristics of data streams
Event Time Skew
Str
ea
m T
ime
Event Time
Skew
Approaches
1.Time-Agnostic Processing
2.Approximation
3.Stream Time Windowing
4.Event Time Windowing
Approaches to reasoning about time
1. Time-Agnostic Processing - Filters
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
1. Time-Agnostic Processing - Hash Join
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
2. Approximation via Online Algorithms
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
3. Windowing by Stream Time
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Event Time
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Fixed Windows
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Event Time
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Sessions
Dataflow API
What are you computing?
Where in event time?
When in stream time?
What = Aggregation API
Where = Windowing API
When = Watermarks + Triggers API
Dataflow improvements over Lambda
Low-latency, approximate results
Complete, correct results as soon as possible
One system: less to manage, fewer resources, one set of bugs
Tools for explicit reasoning about time
= Power + Flexibility + Clarity
And those are just the programming model improvements…
What about the operational model improvements from
marrying Dataflow with Cloud?
Cloud Dataflow as a No-op Cloud service
Google Cloud Platform
Managed Service
User Code & SDK
Work Manager
De
plo
y &
Sch
ed
ule
Pro
gre
ss &
Log
s
Monitoring UI
Job Manager
Putting it all together
Stream
Batch
Cloud
Pub/Sub
Cloud Logs
Analytics
Premium
Cloud
Storage
App
Engine
Cloud
Dataflow
BigQuery
Storage (tables)
Cloud
Storage (files)
Cloud
Dataflow
BigQuery
Analytics (SQL)
Bigtable (noSQL)
Optimizing Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability Deployment &
configuration
Handling
Growing
Scale
Utilization
improvements
Data Processing with
Cloud Dataflow Typical Data Processing
Programming
For more info Google Cloud Services:
https://cloud.google.com/dataflow/
https://cloud.google.com/bigquery/
https://cloud.google.com/pubsub/
https://cloud.google.com/hadoop/
Contact me:
William Vambenepe
twitter: @vambenepe
email: [email protected]
Dataflow programming model
is open-source:
SDK @ github
/GoogleCloudPlatform/DataflowJavaSDK
(Python SDK in progress)
Spark runner @ github
/cloudera/spark-dataflow
Flink runner @ github
/dataArtisans/flink-dataflow