Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved.

Hourglass: a Library for Incremental Processing on HadoopIEEE BigData 2013October 9thMatthew Hayes

©2013 LinkedIn Corporation. All Rights Reserved.

Matthew HayesStaff Software Engineerwww.linkedin.com/in/matthewterencehayes/

• 3+ Years on Applied Data Team at LinkedIn

• Skills

• Endorsements

• DataFu

• White Elephant

©2013 LinkedIn Corporation. All Rights Reserved. 3

Agenda

Motivation Design Experiments Q&A


Motivation


Event Collection in an Online System

Typically online websites have instrumented services that collect events

Events stored in an offline system (such as Hadoop) for later analysis

Using events, can build dashboards with metrics such as:

– # of page views over last month– # of active users over last month

Metrics derived from events can also be useful in recommendation pipelines

– e.g. impression discounting


Event Storage

Events can be categorized into topics, for example:– page view– user login– ad impression/click

Store events by topic and by day:– /data/page_view/daily/2013/10/08– /data/page_view/daily/2013/10/09– ...– /data/ad_click/daily/2013/10/08

Now can perform computation over specific time windows


Computation Over Time Windows

In practice, many of our computations over time windows use either:


Recognizing Inefficiencies

But, typically jobs compute these daily From one day to next, input changes little Fixed-start window includes one new day:



Fixed-length window includes one new day, minus oldest day



Repeatedly processing same input data This wastes cluster resources Better to process new data only How can we do better?


Hourglass Design


Design Goals

Address use cases:– Fixed-start and fixed-length window computations– Daily partitioned data

Reduce resource usage Reduce wall clock time Run on standard Hadoop


Improving Fixed-Start Computations

Suppose we must compute page view counts per member The job consumes all days of available input, producing one output. We call this a partition-collapsing job. But, if the job runs tomorrow it has to reprocess the same data.


Improving Fixed-Start Computations

Solution: Merge new data with previous output We can do this because this is an arithmetic operation Hourglass provides a partition-collapsing job that supports output

reuse.


Partition-Collapsing Job Architecture (Fixed-Start)

When applied to a fixed-start window computation:


Improving Fixed-Length Computations

For a fixed-length job, can reuse output using a similar trick:– Add new day to previous output– Subtract old day from result

We can subtract the old day since this is arithmetic


Partition-Collapsing Job Architecture (Fixed-Length)

When applied to a fixed-length window computation:


Improving Fixed-Length Computations

But, for some operations, cannot subtract old data– example: max(), min()

Cannot reuse previous output, so how do we reduce computation? Solution: partition-preserving job Partitioned input data, partitioned output data Essentially: aggregate the data in advance Aggregating in advance can be useful even when you can reuse

output


Partition-Preserving Job Architecture


MapReduce in Hourglass

MapReduce is a fairly general programming model Hourglass requires:

– reduce() must output (key,value) pair– reduce() must produce at most one value– reduce() implemented by an accumulator


Building Blocks

Two types of jobs:– Partition-preserving: consume partitioned input data, produce

partitioned output data– Partition-collapsing: consume partitioned input data, produce single

output Must provide to jobs:

– Inputs and output paths– Desired time range

Must implement:– map()– accumulate()

May implement if necessary:– merge()– unmerge()


Experiments


Metrics for Evaluation

Wall clock time– Amount of time that elapses until job completes

Total task time– Sum of execution times for all tasks– Represents usage of cluster resources

Compare each against baseline non-incremental job


Experiment: Page Views per Member

Goal: Count page views per member over last n days Chain partition-preserving and partition-collapsing Can reuse previous output:


Experiment: Page Views per Member


Member Count Estimation

Goal: Estimate number of members visiting site over past n days Use HyperLogLog cardinality estimation (space vs. accuracy) Can't reuse output, but with partition-preserving can save state:


Member Count Estimation: Results


Conclusion

Computations over sliding windows are quite common Implementations are typically inefficient Incrementalizing Hadoop jobs can in some cases yield:

– 95-98% reductions in total task time– 20-40% reductions in wall clock time


datafu.orgLearning More

Technology

Hourglass: a Library for Incremental Processing on Hadoop