Apache Beam: A unified model for batch and stream processing data

PowerPoint Presentation

Unbounded, unordered, global scale datasets are increasingly common in day-today business, and consumers of these datasets have detailed requirements for latency, cost, and completeness.

Apache Beam (incubating) defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtime environments, both open source (e.g., Apache Flink, Apache Spark, et al.), and proprietary (e.g., Google Cloud Dataflow).

This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in the programming model. During the talk, well argue why Beam is unified, efficient and portable.Abstract

Davor BonaciApache Beam PPMCSoftware Engineer, Google Inc.Apache Beam:A Unified Model for Batch andStreaming Data Processing

Hadoop Summit, June 28-30, 2016, San Jose, CA

Apache Beam isa unified programming modeldesigned to provideefficient and portabledata processing pipelines

The Beam Model:What / Where / When / HowSDKs for writing Beam pipelines:JavaPythonRunners for Existing Distributed Processing BackendsApache FlinkApache SparkGoogle Cloud DataflowLocal runner for testingWhat is Apache Beam?

Beam Model: Fn RunnersApache FlinkApache SparkBeam Model: Pipeline ConstructionOtherLanguagesBeam Java

Beam PythonExecutionExecutionGoogle Cloud DataflowExecution

The Evolution of BeamMapReduce

Google Cloud DataflowApache Beam

BigTable

Dremel

Colossus

Flume

Megastore

Spanner

PubSub

Millwheel

Google published the original paper on MapReduce in 2004 -- fundamentally change the way we do distributed processing. Inside Google, kept innovating, but just published papers Externally the open source community created Hadoop. Entire ecosystem flourished, partially influenced by those Google papers. In 2014, Google Cloud Dataflow -- included both a new programming model and fully managed service share this model more broadly -- both because it is awesome and because users benefit from a larger ecosystem and portability across multiple runtimes.So Google, along with a handful of partners donated this programming model to the Apache Software Foundation, as the incubating project Apache Beam...


Data...

heres gaming logseach square represents an event where a user scored some points for their team

...can be big...

game gets popular

...really, really big...TuesdayWednesdayThursday

start organizing it into a repeated structure

maybe infinitely big...

9:008:00

14:00

13:00

12:00

11:0010:00

repetitive structure just a cheap way of representing an infinite data source. game logs are continuousdistributed systems can cause ambiguity...

with unknown delays.9:008:00

14:00

13:00

12:00

11:0010:00

8:00

8:00

8:00

Lets look at some points that were scored at 8am red score 8am, received quickly yellow score also happened at 8am, received at 8:30 due to network congestion green element was hours late. this was someone playing in airplane mode on the plane. had to wait for it to land.so now weve got an unordered, infinite data set, how do we process it...

RealityFormalizing Event-Time SkewProcessing TimeEvent TimeIdealSkew

Blue axis is event, Green is processing. Ideally no delay -- elements processed when they occurred Reality looks more like that red squiggly line, where processing time is slightly delayed off event time. The variable distance between reality and ideal is called skew. need to track in order to reason about correctness.

Formalizing Event-Time SkewWatermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"Processing TimeEvent Time~WatermarkIdealSkew

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

red line the watermark -- no event times earlier than this point are expected to appear in the future.often heuristic based too slow unnecessary latency.too fast some data comes in late, after we thought we were done for a given time period.how do we reason about these types of infinite, out-of-order datasets...

What are you computing?Where in event time?When in processing time?How do refinements relate?

not too hard if you know what kinds of questions to ask!

What results are calculated? sums, joins, histograms, machine learning models?Where in event time are results calculated? Does the time each event originally occurred affect results? Are results aggregated for all time, in fixed windows, or as user activity sessions?When in processing time are results materialized? Does the time each element arrives in the system affect results? How do we know when to emit a result? What do we do about data that comes in late from those pesky users playing on transatlantic flights?And finally, how do refinements relate? If we choose to emit results multiple times, is each result independent and distinct, do they build upon one another?

Lets dive into how each question contributes when we build a pipeline...

What are you computing? What Where When How

Element-WiseAggregatingComposite

The first thing to figure out is what you are actually computing.transform each element independently, similar to the Map function in MapReduce, easy to parallelizeOther transformations, like Grouping and Combining, require inspecting multiple elements at a timeSome operations are really just subgraphs of other more primitive operations.

Now lets see a code snippet for our gaming example...

What: Computing Integer Sums// Collection of raw log linesPCollection raw = IO.read(...);

// Element-wise transformation into team/score pairsPCollection input =raw.apply(ParDo.of(new ParseFn());

// Composite transformation containing an aggregationPCollection scores = input.apply(Sum.integersPerKey());

What Where When How

Psuedo Java for compactness/clarity!start by reading a collection of raw eventstransform it into a more structured collection containing key value pairs with a team name and the number of points scored during the event.use a composite operation to sum up all the points per team.Lets see how this code excutes...

What: Computing Integer SumsWhat Where When How

Looking at points scored for a given teamblue axis, green axis ideal This score of 3 from just before 12:07 arrives almost immediately. 7 minutes delayed. elevator or subway. graph not big enough to show offline mode from transatlantic flight

What: Computing Integer SumsWhat Where When How

time is thick white line.accumulate sum into the intermediate stateproduces output represented by the blue rectangle.all the data available, rectangle covers all events, no matter when in time they occurred.single final result emitted when its all completepretty standard batch processing -> lets see what happens if we tweak the other questions

Windowing divides data into event-time-based finite chunks.

Often required when doing aggregations over unbounded data.Where in event time?What Where When HowFixedSliding

1

2

3

5

4

Sessions

2

43

1

Key 2Key 1Key 3Time1

2

3

4

windowing lets us create individual results for different slices of event time.divides data into finite chunks based on the event time of each element.common patterns include fixed time (like hourly, daily, monthly), sliding windows (like the last 24 hours worth of data, every hour) -- single element may be in multiple overlapping windowssession based windows that capture bursts of user activity -- unaligned per keyvery common when trying to aggregations on infinite dataalso actually common pattern in batch, though historically done using composite keys.

Where: Fixed 2-minute WindowsWhat Where When HowPCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey());

fixed windows that are 2 minutes long

Where: Fixed 2-minute WindowsWhat Where When How

independent answer for every two minute period of event time. still waiting until the entire computation completes to emit any results. wont work for infinite data!want to reduce latency...

When in processing time?What Where When HowTriggers control when results are emitted.

Triggers are often relative to the watermark.Processing TimeEvent Time~WatermarkIdealSkew

trigger define when in processing time to emit resultsoften relative to the watermark, which is that heuristic about event time progress.

When: Triggering at the WatermarkWhat Where When HowPCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());

request that results are emitted when we think weve roughly seen all the elements for a given window. actually default -- just written it for clarity.

When: Triggering at the WatermarkWhat Where When How

left graph shows a perfect watermark -- tracks when all the data for a given event time has arrivedemit the result from each window as soon as the watermark passes. watermark is usually just a heuristic, so look more like graph on the right. now 9 is missedand if the watermark is delayed, like in the first graph, need to wait a long time for anything. would like speculative.lets use a more advanced trigger...

When: Early and Late FiringsWhat Where When HowPCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());

ask for early, speculative firings every minuteget updates every time a late element comes in.

When: Early and Late FiringsWhat Where When How

in all cases, able to get speculative results before the watermark.now get results when watermark passes, but still handle late value 9 even with heuristic watermarkin this case, we accumulate across the multiple results per windowIn the final window, we see and emit 3 but then still include that 3 in the next update of 12.but this behavior around multiple firings is configurable...

How do refinements relate?What Where When HowHow should multiple outputs per window accumulate?Appropriate choice depends on consumer.

FiringElementsSpeculative [3]Watermark[5, 1]Late[2]Last ObservedTotal Observed

Discarding362211

Accumulating39111123

Acc. & Retracting39, -311, -91111

(Accumulating & Retracting not yet implemented.)

fire three times for a window -- a speculative firing with 3, watermark with two more values 5 and 1, and finally a late value 2.one option is emit new elements that have come in since the last result. requires consumer to be able to do final sumcould produce the running sum every time. consumer may overcountproduce both the new running sum and retract the old one.

How: Add Newest, Remove PreviousWhat Where When HowPCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

use accumulating and retracting.

How: Add Newest, Remove PreviousWhat Where When How

speculative results, on time results, and retractions.now the final window emits 3, then retracts the 3 when emitting 12.

So those are the four questions...

CorrectnessPowerComposabilityFlexibilityModularityWhat / Where / When / How

those are the four key questionsare they the right questions?

here are 5 reasons...


the results we get are correct

this is not something weve historically gotten with streaming systems.

Distributed Systems are Distributed

distributed systems are distributed.if the winds had been blowing from the east instead of the west, elements might have arrived in a slightly different order.

Event Time Results are Stable

aggregating based on event time may have different intermediate resultsbut the final results are identical across the two arrival scenarios.


next, the abstractions can represent powerful and complex algorithms.

SessionsWhat Where When HowPCollection scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(1)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

earlier mentioned session windows -- burst of user activitysimple code change...

Identifying Bursts of User ActivityWhat Where When How

want to identify two groupings of pointsin other words, Tyler was playing the game, got distracted by a squirrel, and then resumed his play.


ok flexibility for covering all sorts of uses cases

1.Classic Batch

2. Batch with Fixed Windows

3. Streaming 5. Streaming With Retractions

4. Streaming with Speculative + Late DataWhat Where When How

6. Sessions

By tuning our what/where/when/how knobs, weve covered everything from classic batch to sessions


And not only that, we do so with lovely modular code

PCollection scores = input .apply(Sum.integersPerKey());

PCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

PCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .apply(Sum.integersPerKey());

PCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());

PCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey());

PCollection scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

1.Classic Batch 2. Batch with Fixed Windows3. Streaming 5. Streaming With Retractions

4. Streaming with Speculative + Late Data6. Sessions

all these uses cases -- and we never changed our core algorithmjust integer summing here, but the same would apply with much more complex algorithms too


so there you go -- 5 reasons that these 4 questions are awesome


Workload vary in pipelines over timeWorkloadTime

Batch pipelines go through stagesStreaming pipelines input varies

Perils of fixed decisionsWorkloadTimeUnder-provisioned / average caseOver-provisioned / worst caseWorkloadTime

Ideal caseWorkloadTime

Solution: bundlesWhat Where When Howclass MyDoFn extends DoFn {

void startBundle(...) { }

void processElement(...) { }

void finishBundle(...) { }}User code operates on bundles of elements.

Easy parallelization.

Dynamic sizing.

Parallelism decisions in the runners hands.

The Straggler ProblemWhat Where When HowWork is unevenly distributed across tasks.

Reasons:Underlying data.Processing.

Effects multiplied per stage. WorkerTime

Data1 file per task & files of different sizesBigtable key range partitioned lexicographically, assuming uniformProcessingHot shuffle key rangesData-dependent computation

Standard workarounds for stragglersWhat Where When HowSplit files into equal sizes?

Pre-emptively over-split?

Detect slow workers and re-execute?

Sample extensively and then split?WorkerTime

Pre-job stage: chunk files into equal sizesChoice of constant?Does not handle runtime asymmetryPre-emptively over-splitHow much is enough?How much is too much?Per-task overheads can dominateDetect slow workers and re-executeDoes not handle processing asymmetrySample (maybe extensively) and then splitOverheadStill does not handle runtime asymmetry

No amount of upfront heuristic tuning (be it manual orautomatic) is enough to guarantee good performance:the system will always hit unpredictable situations at run-time.

A system that's able to dynamically adapt andget out of a bad situation is much more powerfulthan one that heuristically hopes to avoid getting into it.

Solution: Dynamic Work Rebalancing

Done workActive workPredicted completionSplitNowAverage completionTimeNowAverage completionTime

Solution: Dynamic work rebalancingWhat Where When Howclass MyReader extends BoundedReader { [...]

getFractionConsumed() { }

splitAtFraction(...) { }}class MySource extends BoundedSource { [...]

splitIntoBundles() { }

getEstimatedSizeBytes() { }}

Real world example

400 workersRead GCS Parse GroupByKey Write

Dynamic bundles + work re-balancing + autoscaling


Write: Choose an SDK to write your pipeline in.

Execute: Choose any runner at execution time.Apache Beam Architecture



Categorizing Runner Capabilities

http://beam.incubator.apache.org/capability-matrix/

The Beam model is attempting to generalize semantics -- will not align perfectly with all possible runtimes.Started categorizing the features in the model and the various levels of runner support.This will help users understand mismatches like using event time processing in Spark or exactly once processing with Samza.

End users: who want to write pipelines or transform libraries in a language thats familiar.

SDK writers: who want to make Beam concepts available in new languages.

Runner writers: who have a distributed processing environment and want to support Beam pipelines

Multiple categories of users



fully support three different categories of usersEnd users who want to write data processing pipelinesIncludes adding value like additional connectors -- weve got Kafka!Additionally, support community-sourced SDKs and runnersEach community has very different sets of goals and needs. having a vision and reaching it are two different things...

If you have Big Data APIs, write a Beam SDK or DSL or library of transformations.

If you have a distributed processing backend, write a Beam runner!

If you have a data storage or messaging system, write a Beam IO connector!Growing the Open Source Community

And one of the things were most excited about is the collaboration opportunities that Beam enables. Been doing this stuff for a while at Google -- very hermetic environment. Looking forward to incorporating new perspectives -- to build a truly generalizable solution.Growing the Beam development community over the next few months, whether they are looking to write transform libraries for end users, new SDKs, or provide new runners.


Visions are a Journey02/01/2016Enter Apache IncubatorEnd 2016Beam pipelines run on many runners in production usesEarly 2016Design for use cases,begin refactoringMid 2016Additional refactoring,non-production usesLate 2016Multiple runners execute Beam pipelines

02/25/20161st commit to ASF repository06/14/20161st incubating releaseJune 2016Python SDK moves to Beam

Beam entered incubation in early February.Quickly did the code donations and began bootstrapping the infrastructure.initial focus is on stabilizing internal APIs and integrating the additional runners.Part of that is understanding what different runners can do...

Learn More!Apache Beam (incubating)http://beam.incubator.apache.orgThe World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Join the Beam mailing lists! [email protected]@beam.incubator.apache.orgFollow @ApacheBeam on Twitter

Thank you!

Credits?

Extra material

Credits?

Element-wise transformations13:0014:008:009:0010:0011:0012:00Processing Time

Element wise transformations work on individual elementsparsing, translating or filtering applied as elements flow pastbut other transforms like counting or joining require combining multiple elements together ...

Aggregating via Processing-Time Windows

13:0014:008:009:0010:0011:0012:00Processing Time

when doing aggregations, need to divide the infinite stream of elements into finite sized chunks that can be processed independently.simplest way using arrival time in fixed time periodscan mean elements are being processed out of order, late elements may be aggregated with unrelated elements that arrived about the same time...

Aggregating via Event-Time Windows

Event TimeProcessing Time11:0010:0015:0014:0013:0012:00

11:0010:0015:0014:0013:0012:00

InputOutput

reorganize data base on when they occurred, not when they arrivedred element arrived relatively on time and stays in the noon window. green that arrived at 12:30, was actually created about 11:30, so it moves up to the 11am window.requires formalizing the difference between processing time and event time

Processing Time Results Differ

if we were aggregating based on processing time, this would result in different results for the two orderings.

Identifying Bursts of User ActivityWhat Where When How

now you can see the sessions being built over timeat first we see multiple components in the first session not until late element 9 comes in that we realize its one big session


next -- weve seen what the four questions can do.what if we ask the questions twice?

Calculating Session LengthsWhat Where When How

input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

code to calculate the length of a user session

What Where When How

Remember that these graphs are always shown per keyheres the graph calculating session legths for Frances and the ones for Tyler

Calculating the Average Session LengthWhat Where When How

.apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally());

input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

now lets take those session lengths per userask the questions againthis time using fixed windows to take the mean across the entire collection...

What Where When How

Now calculating the average length of all sessions that ended in a given time periodif we rolled out an update to our game, this would let us quickly understand if that resulted a change in user behaviorif the change made the game less fun, we could see a sudden drop in how long users play

Technology

Apache Beam: A unified model for batch and stream processing data