32
Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big Data on Google Cloud Platform

Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Beyond MapReduce, Beyond

Lambda Easy, unified, reliable processing for stream and batch

William Vambenepe

@vambenepe

Lead Product Manager for Big Data on Google Cloud Platform

Page 2: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

http://research.google.com/archive/mapreduce.html

Page 3: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf

Page 4: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

http://research.google.com/pubs/pub41378.html

Page 5: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

Page 6: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

The Lambda Architecture

Page 7: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

2012 2013 2002 2004 2006 2008 2010

Google Cloud

Dataflow

MapReduce

GFS Big Table

Dremel

Pregel

Flume

Colossus

Spanner MillWheel

Page 8: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Event Time - When Events Happened

Stream Time - When Events Are Processed

Page 9: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Batch vs Streaming

Page 10: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

MapReduce

Batch

Page 11: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

MapReduce

[10:00 - 11:00) [10:00 - 11:00) [11:00 -

12:00) [12:00 -

13:00) [13:00 -

14:00) [14:00 -

15:00) [15:00 -

16:00) [16:00 -

17:00) [18:00 -

19:00) [19:00 -

20:00) [21:00 -

22:00) [22:00 -

23:00) [23:00 - 0:00)

Batch: Fixed Windows

Page 12: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

MapReduce

[10:00 - 11:00) [11:00 - 12:00)

Batch: User Sessions

Joan

Larry

Ingo

Amanda

Cheryl

Arthur

[11:00 - 12:00) [10:00 - 11:00)

Page 13: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Streaming

11:00 10:00 16:00 15:00 14:00 13:00 12:00

Page 14: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Unordered

Unbounded

Of Varying Event Time Skew

Confounding characteristics of data streams

Page 15: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Event Time Skew

Str

ea

m T

ime

Event Time

Skew

Page 16: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Approaches

Page 17: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

1.Time-Agnostic Processing

2.Approximation

3.Stream Time Windowing

4.Event Time Windowing

Approaches to reasoning about time

Page 18: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

1. Time-Agnostic Processing - Filters

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

Page 19: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

1. Time-Agnostic Processing - Hash Join

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

Page 20: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

2. Approximation via Online Algorithms

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

Page 21: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

3. Windowing by Stream Time

Page 22: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Event Time

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

4. Windowing by Event Time - Fixed Windows

Page 23: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Event Time

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

4. Windowing by Event Time - Sessions

Page 24: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Dataflow API

Page 25: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

What are you computing?

Where in event time?

When in stream time?

Page 26: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

What = Aggregation API

Where = Windowing API

When = Watermarks + Triggers API

Page 27: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Dataflow improvements over Lambda

Low-latency, approximate results

Complete, correct results as soon as possible

One system: less to manage, fewer resources, one set of bugs

Tools for explicit reasoning about time

= Power + Flexibility + Clarity

Page 28: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

And those are just the programming model improvements…

What about the operational model improvements from

marrying Dataflow with Cloud?

Page 29: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Cloud Dataflow as a No-op Cloud service

Google Cloud Platform

Managed Service

User Code & SDK

Work Manager

De

plo

y &

Sch

ed

ule

Pro

gre

ss &

Log

s

Monitoring UI

Job Manager

Page 30: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Putting it all together

Stream

Batch

Cloud

Pub/Sub

Cloud Logs

Google

Analytics

Premium

Google

Cloud

Storage

Google

App

Engine

Cloud

Dataflow

BigQuery

Storage (tables)

Cloud

Storage (files)

Cloud

Dataflow

BigQuery

Analytics (SQL)

Bigtable (noSQL)

Page 31: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

Optimizing Time To Answer

More time to dig

into your data

Programming

Resource

provisioning

Performance

tuning

Monitoring

Reliability Deployment &

configuration

Handling

Growing

Scale

Utilization

improvements

Data Processing with

Cloud Dataflow Typical Data Processing

Programming

Page 32: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big

For more info Google Cloud Services:

https://cloud.google.com/dataflow/

https://cloud.google.com/bigquery/

https://cloud.google.com/pubsub/

https://cloud.google.com/hadoop/

Contact me:

William Vambenepe

twitter: @vambenepe

email: [email protected]

Dataflow programming model

is open-source:

SDK @ github

/GoogleCloudPlatform/DataflowJavaSDK

(Python SDK in progress)

Spark runner @ github

/cloudera/spark-dataflow

Flink runner @ github

/dataArtisans/flink-dataflow