29
Hourglass: a Library for Incremental Processing on Hadoop IEEE BigData 2013 October 9th Matthew Hayes ©2013 LinkedIn Corporation. All Rights Reserved.

Hourglass: a Library for Incremental Processing on Hadoop

Embed Size (px)

DESCRIPTION

Slides from my talk at IEEE BigData 2013 presenting our paper "Hourglass: a Library for Incremental Processing on Hadoop" Abstract: Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the framework ensures that only the necessary subcomputations are performed. It is successfully used at LinkedIn, one of the largest online social networks, for many use cases in dashboarding and machine learning. Hourglass is open source and freely available.

Citation preview

Page 1: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved.

Hourglass: a Library for Incremental Processing on HadoopIEEE BigData 2013October 9thMatthew Hayes

Page 2: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved.

Matthew HayesStaff Software Engineerwww.linkedin.com/in/matthewterencehayes/

• 3+ Years on Applied Data Team at LinkedIn

• Skills

• Endorsements

• DataFu

• White Elephant

Page 3: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 3

Agenda

Motivation Design Experiments Q&A

Page 4: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 4

Motivation

Page 5: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 5

Event Collection in an Online System

Typically online websites have instrumented services that collect events

Events stored in an offline system (such as Hadoop) for later analysis

Using events, can build dashboards with metrics such as:

– # of page views over last month– # of active users over last month

Metrics derived from events can also be useful in recommendation pipelines

– e.g. impression discounting

Page 6: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 6

Event Storage

Events can be categorized into topics, for example:– page view– user login– ad impression/click

Store events by topic and by day:– /data/page_view/daily/2013/10/08– /data/page_view/daily/2013/10/09– ...– /data/ad_click/daily/2013/10/08

Now can perform computation over specific time windows

Page 7: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 7

Computation Over Time Windows

In practice, many of our computations over time windows use either:

Page 8: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 8

Recognizing Inefficiencies

But, typically jobs compute these daily From one day to next, input changes little Fixed-start window includes one new day:

Page 9: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 9

Recognizing Inefficiencies

Fixed-length window includes one new day, minus oldest day

Page 10: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 10

Recognizing Inefficiencies

Repeatedly processing same input data This wastes cluster resources Better to process new data only How can we do better?

Page 11: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 11

Hourglass Design

Page 12: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 12

Design Goals

Address use cases:– Fixed-start and fixed-length window computations– Daily partitioned data

Reduce resource usage Reduce wall clock time Run on standard Hadoop

Page 13: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 13

Improving Fixed-Start Computations

Suppose we must compute page view counts per member The job consumes all days of available input, producing one output. We call this a partition-collapsing job. But, if the job runs tomorrow it has to reprocess the same data.

Page 14: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 14

Improving Fixed-Start Computations

Solution: Merge new data with previous output We can do this because this is an arithmetic operation Hourglass provides a partition-collapsing job that supports output

reuse.

Page 15: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 15

Partition-Collapsing Job Architecture (Fixed-Start)

When applied to a fixed-start window computation:

Page 16: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 16

Improving Fixed-Length Computations

For a fixed-length job, can reuse output using a similar trick:– Add new day to previous output– Subtract old day from result

We can subtract the old day since this is arithmetic

Page 17: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 17

Partition-Collapsing Job Architecture (Fixed-Length)

When applied to a fixed-length window computation:

Page 18: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 18

Improving Fixed-Length Computations

But, for some operations, cannot subtract old data– example: max(), min()

Cannot reuse previous output, so how do we reduce computation? Solution: partition-preserving job Partitioned input data, partitioned output data Essentially: aggregate the data in advance Aggregating in advance can be useful even when you can reuse

output

Page 19: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 19

Partition-Preserving Job Architecture

Page 20: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 20

MapReduce in Hourglass

MapReduce is a fairly general programming model Hourglass requires:

– reduce() must output (key,value) pair– reduce() must produce at most one value– reduce() implemented by an accumulator

Page 21: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 21

Building Blocks

Two types of jobs:– Partition-preserving: consume partitioned input data, produce

partitioned output data– Partition-collapsing: consume partitioned input data, produce single

output Must provide to jobs:

– Inputs and output paths– Desired time range

Must implement:– map()– accumulate()

May implement if necessary:– merge()– unmerge()

Page 22: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 22

Experiments

Page 23: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 23

Metrics for Evaluation

Wall clock time– Amount of time that elapses until job completes

Total task time– Sum of execution times for all tasks– Represents usage of cluster resources

Compare each against baseline non-incremental job

Page 24: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 24

Experiment: Page Views per Member

Goal: Count page views per member over last n days Chain partition-preserving and partition-collapsing Can reuse previous output:

Page 25: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 25

Experiment: Page Views per Member

Page 26: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 26

Member Count Estimation

Goal: Estimate number of members visiting site over past n days Use HyperLogLog cardinality estimation (space vs. accuracy) Can't reuse output, but with partition-preserving can save state:

Page 27: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 27

Member Count Estimation: Results

Page 28: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 28

Conclusion

Computations over sliding windows are quite common Implementations are typically inefficient Incrementalizing Hadoop jobs can in some cases yield:

– 95-98% reductions in total task time– 20-40% reductions in wall clock time

Page 29: Hourglass: a Library for Incremental Processing on Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 29

datafu.orgLearning More