57
Section Slide Template Option 2 Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you. Make the subtitle something clever. People will think it’s neat. Dataflow A Unified Model for Batch and Streaming Data Processing Vadim Solovey Google Developer Expert & Trainer [email protected] Shahar Frank Cloud Solutions Architect [email protected]

Dataflow - A Unified Model for Batch and Streaming Data Processing

Embed Size (px)

Citation preview

Section Slide Template Option 2

Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.

Make the subtitle something clever. People will think it’s neat.

Dataflow A Unified Model for Batch and Streaming Data Processing

Vadim SoloveyGoogle Developer Expert & [email protected]

Shahar FrankCloud Solutions [email protected]

Goals:

Write interesting computations

Run in both batch & streaming

Use custom timestamps

Handle late data

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg

Data Shapes

Google’s Data Processing Story

The Dataflow Model

Agenda

Demo: Google Cloud Dataflow

1

2

3

4

Data Shapes1

Data...

...can be big...

...really, really big...

TuesdayWednesday

Thursday

...maybe even infinitely big...

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

… with unknown delays.

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

8:00

1 + 1 = 2Completeness Latency Cost

$$$

Data Processing Tradeoffs

Requirements: Billing Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Requirements: Live Cost Estimate Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Requirements: Abuse Detection Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Requirements: Abuse Detection Backfill Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Google’s Data Processing Story2

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

2014

MillWheel

Dataflow

2016

Google’s Data-Related Papers

(Produce)

MapReduce: Batch Processing

(Prepare)

(Shuffle)

Map

Reduce

FlumeJava: Easy and Efficient MapReduce Pipelines

● Higher-level API with simple data processing abstractions.○ Focus on what you want to do to

your data, not what the underlying system supports.

● A graph of transformations is automatically transformed into an optimized series of MapReduces.

MapReduce

Batch Patterns: Creating Structured Data

MapReduce

Batch Patterns: Repetitive Runs

TuesdayWednesday

Thursday

MapReduce

Tuesday [11:00 - 12:00)

[12:00 - 13:00)

[13:00 - 14:00)

[14:00 - 15:00)

[15:00 - 16:00)

[16:00 - 17:00)

[18:00 - 19:00)

[19:00 - 20:00)

[21:00 - 22:00)

[22:00 - 23:00)

[23:00 - 0:00)

Batch Patterns: Time Based Windows

MapReduce

TuesdayWednesday

Batch Patterns: Sessions

Jose

Lisa

Ingo

Asha

Cheryl

Ari

WednesdayTuesday

MillWheel: Streaming Computations

● Framework for building low-latency data-processing applications

● User provides a DAG of computations to be performed

● System manages state and persistent flow of elements

Streaming Patterns: Element-wise transformations

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Streaming Patterns: Aggregating Time Based Windows

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

11:0010:00 15:0014:0013:0012:00Event Time

11:0010:00 15:0014:0013:0012:00Processing Time

Input

Output

Streaming Patterns: Event Time Based Windows

Streaming Patterns: Session Windows

Event Time

Processing Time 11:0010:00 15:0014:0013:0012:00

11:0010:00 15:0014:0013:0012:00

Input

Output

Proc

essi

ng T

ime

Event Time

Skew

Event-Time Skew

Watermark Watermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

Streaming or Batch?

1 + 1 = 2 $$$Completeness Latency Cost

Why not both?

The Dataflow Model3

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

What are you computing?

• A Pipeline represents a graph of data processing transformations

• PCollections flow through the pipeline

• Optimized and executed as a unit for efficiency

What are you computing? • A PCollection<T> is a collection

of data of type T

• Maybe be bounded or unbounded in size

• Each element has an implicit timestamp

• Initially created from backing data stores

What are you computing?

PTransforms transform PCollections into other PCollections.

What Where When How

Element-Wise Aggregating Composite

Example: Computing Integer Sums

// Collection of raw log linesPCollection<String> raw = ...;

// Element-wise transformation into team/score pairsPCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()))

// Composite transformation containing an aggregationPCollection<KV<String, Integer>> output = input .apply(Sum.integersPerKey());

What Where When How

Example: Computing Integer Sums

What Where When How

What Where When How

Example: Computing Integer Sums

Key 2

Key 1

Key 3

1

Fixed

2

3

4

Key 2

Key 1

Key 3

Sliding

123

54

Key 2

Key 1

Key 3

Sessions

2

43

1

Where in Event Time?

• Windowing divides data into event-time-based finite chunks.

• Required when doing aggregations over unbounded data.

What Where When How

PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2)))) .apply(Sum.integersPerKey());

What Where When How

Example: Fixed 2-minute Windows

What Where When How

Example: Fixed 2-minute Windows

What Where When How

When in Processing Time?

• Triggers control when results are emitted.

• Triggers are often relative to the watermark.Pr

oces

sing

Tim

e

Event Time

Watermark

PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .apply(Sum.integersPerKey());

What Where When How

Example: Triggering at the Watermark

What Where When How

Example: Triggering at the Watermark

What Where When How

Example: Triggering for Speculative & Late Data

PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());

What Where When How

Example: Triggering for Speculative & Late Data

What Where When How

How do Refinements Relate?• How should multiple outputs per window

accumulate?• Appropriate choice depends on consumer.

Firing Elements

Speculative 3

Watermark 5, 1

Late 2

Total Observ 11

Discarding

3

6

2

11

Accumulating

3

9

11

23

Acc. & Retracting

3

9, -3

11, -9

11

PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(new Sum());

What Where When How

Example: Add Newest, Remove Previous

What Where When How

Example: Add Newest, Remove Previous

1. Classic Batch 2. Batch with Fixed Windows

3. Streaming 5. Streaming with Retractions

4. Streaming with Speculative + Late Data

Customizing What Where When How

What Where When How

Demo: Google Cloud Dataflow4

Cloud DataflowA fully-managed cloud service and programming model for batch and streaming big data processing.

Google Cloud Dataflow

Open Source SDKs

● Used to construct a Dataflow pipeline.

● Java available now. Python in the open beta.

● Pipelines can run…○ On your development machine○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark or Flink.

Fully Managed Dataflow Service

Runs the pipeline on Google Cloud Platform. Includes:

● Graph optimization: Modular code, efficient execution● Smart Workers: Lifecycle management, Autoscaling, and

Smart task rebalancing ● Easy Monitoring: Dataflow UI, Restful API and CLI,

Integration with Cloud Logging, etc.

Demo Time!

user11_BisqueEmu,BisqueEmu,11,1460980305000,2016-04-18 04:59:27.936user0_BisqueQuokka,BisqueQuokka,15,1460980768000,2016-04-18 04:59:28.473user11_ArmyGreenNumbat,ArmyGreenNumbat,5,1460980768000,2016-04-18 04:59:28.473user4_AquaEmu,AquaEmu,19,1460980768000,2016-04-18 04:59:28.473user5_BeigeDingo,BeigeDingo,16,1460980768000,2016-04-18 04:59:28.473user1_ArmyGreenKookaburra,ArmyGreenKookaburra,9,1460980768000,2016-04-18 04:59:28.473user7_AquaDingo,AquaDingo,18,1460980768000,2016-04-18 04:59:28.473user7_AmberEmu,AmberEmu,11,1460980768000,2016-04-18 04:59:28.473user1_BeigeDingo,BeigeDingo,6,1460980768000,2016-04-18 04:59:28.473user2_ArmyGreenKookaburra,ArmyGreenKookaburra,8,1460980768000,2016-04-18 04:59:28.473user10_AzureBandicoot,AzureBandicoot,12,1460980768000,2016-04-18 04:59:28.473user1_AzureBandicoot,AzureBandicoot,12,1460980768000,2016-04-18 04:59:28.473Robot-12,BeigeQuokka,12,1460980768000,2016-04-18 04:59:28.473user3_MagentaAntechinus,MagentaAntechinus,5,1460980768000,2016-04-18 04:59:28.473user6_AmberEmu,AmberEmu,15,1460980768000,2016-04-18 04:59:28.473

Data

Wrote Reusable PTransforms

Ran the PTransform in streaming mode

Used event time, not processing time

Easily adjusted windowing and late data settings

Learn More!

● The Dataflow Model @VLDB 2015 http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

● Dataflow SDK for Javahttps://github.com/GoogleCloudPlatform/DataflowJavaSDK

● Google Cloud Dataflow on Google Cloud Platformhttp://cloud.google.com/dataflow (Free Trial!)