44
Cascading Webinar

Learn from HomeAway Hadoop Development and Operations Best Practices

Embed Size (px)

Citation preview

Page 1: Learn from HomeAway Hadoop Development and Operations Best Practices

Cascading Webinar

Page 2: Learn from HomeAway Hadoop Development and Operations Best Practices

HomeAwayThe world leader for vacation rentals

Over a million listings worldwide and growing!

Page 3: Learn from HomeAway Hadoop Development and Operations Best Practices

Hadoop is changingYou …

Need faster ROI Need compellinguse cases

Need more with less Need to leverage existing talent

Page 4: Learn from HomeAway Hadoop Development and Operations Best Practices

Harnessing the power of hadoop MapReduce

Divides into smaller problems;;Assemble smaller answers into the answers to the bigger problems.

MapReduce Can be hard to learn Verbose;; Tedious Historically slow

New Engine Options Apache Tez Apache Spark Apache Flink

Page 5: Learn from HomeAway Hadoop Development and Operations Best Practices

Problem at HomeAway

Cascading

Page 6: Learn from HomeAway Hadoop Development and Operations Best Practices

Speaker Panel

• Austin Tobin -­ Software Engineer

File Storage Quotas :: Introduction to Cascading

• Michael McAllister -­ Staff Data Warehouse Engineer

Supplier Analytics :: Phoenix, HBase and Driven

• Francois Forster -­ Architect

User Analytics :: A/B Test Readouts

Page 7: Learn from HomeAway Hadoop Development and Operations Best Practices

File Storage Quotas :: Introduction to Cascading

© Copyright 2015 HomeAway, Inc.

Page 8: Learn from HomeAway Hadoop Development and Operations Best Practices

Introduction

1. What is it we are trying to solve

2. What is Cascading

3. How we applied Cascading to solve this problem

Page 9: Learn from HomeAway Hadoop Development and Operations Best Practices

© Copyright 2015 HomeAway, Inc.

What is Mesa? What is the problem with Mesa?Mesa is an internal file systemDivided up into buckets, each bucket has a quotaEach bucket maintains a statistics file, locked on write and deleteAs usage increases, this locking creates performance bottlenecks

9

Page 10: Learn from HomeAway Hadoop Development and Operations Best Practices

• Kafka• High performance messaging technology• Used to insert high volume of consistent log messages very quickly

• Avro• Compressible file-­format. Binarized, highly portable.

• Hadoop• Distributed file store and processing framework• enables near infinite horizontal scalability for storage and processing

• Cascading...

Key Technologies

Page 11: Learn from HomeAway Hadoop Development and Operations Best Practices

Cascading

• Taps can be either sources or sinks• Sources are data inputs, and sinks are data outputs• They require a scheme, which is a set of column names (tuples), and a text-­delimiter

• The sink of one flow can be the source of another flow.• Pipes

• Abstractions to perform functions or transformations• Functions include split, merge, expression, and filter• The output of one pipe may be another pipe, • chain together to perform sequences of transformations

• Flows• Connect sources to sinks via pipes into a flow• Can connect multiple flows together into

a CASCADE

CASCADING

Page 12: Learn from HomeAway Hadoop Development and Operations Best Practices

Cascading

The Cascading Archetype is project which makes it very easy to get started with cascading applications. Currently an internal project, which uses Spring to make defining taps and flows very easy.

1. Define your Taps2. Build your Flows.3. Cascade!

Cascading Archetype

Page 13: Learn from HomeAway Hadoop Development and Operations Best Practices

© Copyright 2015 HomeAway, Inc.

Hadoop

Log Events

Mesa Stats Job

Mesa Metadata Old Catalog + Log EventsNew Catalog

+ Statistics

Mesa

Mesa Stats -­ The Big Picture

Page 14: Learn from HomeAway Hadoop Development and Operations Best Practices

OLDCATALOGTAP

EVENT TAP

Clean Events Pipe

Build New Catalog Pipe

NEW CATALOG SINK

Flow Def -­ Create the New Catalog

Page 15: Learn from HomeAway Hadoop Development and Operations Best Practices

CascadingOld Catalog Tap

Page 16: Learn from HomeAway Hadoop Development and Operations Best Practices

Filter Non Mesa Events

Split the Message Field into multiple Fields

Remove Extraneous Fields

Pipe -­ Clean the Events

Page 17: Learn from HomeAway Hadoop Development and Operations Best Practices

CascadingPipe -­ Clean the Events

Page 18: Learn from HomeAway Hadoop Development and Operations Best Practices

Cleaned Event Pipe

Catalog Pipe

Sort Events by Latest Desc

Take Top 1 Event

Remove Deleted Events

Merge Events With Catalog Pipe

Pipe -­ Build the New Catalog

Page 19: Learn from HomeAway Hadoop Development and Operations Best Practices

CascadingPipe -­ Build the Catalog

Page 20: Learn from HomeAway Hadoop Development and Operations Best Practices

CascadingUpdate Catalog Flow Def -­ Revisited

Page 21: Learn from HomeAway Hadoop Development and Operations Best Practices

NEW CATALOG TAP MESA QUOTA TAP

Sum File Sizes Per Bucket

Merge on Bucket Names

Divide Bucket File Sizes By Quota

STATISTICS SINK

Flow Def -­ Calculate the Statistics

Page 22: Learn from HomeAway Hadoop Development and Operations Best Practices

CascadingPipe -­ Sum and Merge

Page 23: Learn from HomeAway Hadoop Development and Operations Best Practices

CascadingFlow Def -­ Calculate the Statistics

Page 24: Learn from HomeAway Hadoop Development and Operations Best Practices

CascadingFlow Def -­ Statistics Revisited

Page 25: Learn from HomeAway Hadoop Development and Operations Best Practices

Thank you all!• Cascading For the Impatient

Page 26: Learn from HomeAway Hadoop Development and Operations Best Practices

Supplier Analytics :: Phoenix, HBase and Driven

Page 27: Learn from HomeAway Hadoop Development and Operations Best Practices

The goal

The goal: Expose our EDW analytics to suppliers. But ... More users of analytics = requirement to horizontally scale

SQL Server EDW + Managed Storage = Expensive to horizontally scale

Page 28: Learn from HomeAway Hadoop Development and Operations Best Practices

The solution

Use Cascading with HBase / PhoenixCascading for ETLApache Phoenix as an abstraction layer over HbaseHomeAway created Cascading Phoenix Tap to simplify use of Phoenix.

Page 29: Learn from HomeAway Hadoop Development and Operations Best Practices

What does our Cascading ETL look like?

Daily jobs scheduled in oozie Runs Cascading ETL developed as Java programs Examples:-­ETL listings that have changed since yesterday from EDW to HBaseETL listing metrics from current periodic snapshot fact partition over to HBase. ETL market group metrics from current periodic snapshot fact partition over to HBase

Page 30: Learn from HomeAway Hadoop Development and Operations Best Practices

What does our Cascading ETL look like?

Extract -­ SQL statement issued against SQL Server JDBC tap

Transform Simple -­ do it in your SQL statement Complex -­ do it in your pipes -­ filters, cogroups, user defined functions, etc

Load -­ sink tap bound to Apache Phoenix Cascading tap This tap is in essence a HBase table

Page 31: Learn from HomeAway Hadoop Development and Operations Best Practices

How Driven simplifies using Cascading

Page 32: Learn from HomeAway Hadoop Development and Operations Best Practices

How Driven simplifies using Cascading

Page 33: Learn from HomeAway Hadoop Development and Operations Best Practices

How Driven simplifies using Cascading

Page 34: Learn from HomeAway Hadoop Development and Operations Best Practices

How Driven simplifies using Cascading

Page 35: Learn from HomeAway Hadoop Development and Operations Best Practices

A real simple Cascading flow definition

Page 36: Learn from HomeAway Hadoop Development and Operations Best Practices

User Analytics :: A/B Test Readouts

Page 37: Learn from HomeAway Hadoop Development and Operations Best Practices

A/B Test Readouts

• We’re always running many A/B tests concurrently on our sites• Daily Cascading Job performs A/B test readout

– Readout for all running A/B tests at once– Rolling 3-­week

• Sliced and diced by site, by day, by test as well as various roll ups• Multiple conversion metrics• Millions of daily test exposures and conversions

Page 38: Learn from HomeAway Hadoop Development and Operations Best Practices

A/B Test Readout Flow

Not The Full Cascade!

Page 39: Learn from HomeAway Hadoop Development and Operations Best Practices

A/B Test Readout Cascade

• Includes Daily Intermediate Files–cascade.setFlowSkipStrategy(new FlowSkipIfSinkExists());

Page 40: Learn from HomeAway Hadoop Development and Operations Best Practices

Using Driven For Performance Tuning

• Driven makes it easy to look at the time it takes to execute– Including the number of mappers or reducers

– Increase if needed:pipe.getStepConfigDef().setProperty("mapreduce.job.reduces","20");

Page 41: Learn from HomeAway Hadoop Development and Operations Best Practices

Cascading Tips

• Store intermediate files to avoid re-­processing the same data over and over again–When running frequent jobs on rolling window

• Breakup your complex flows

• Use Driven to tweak # of reducers at various points

Page 42: Learn from HomeAway Hadoop Development and Operations Best Practices

Deployment / Operational Issues

Page 43: Learn from HomeAway Hadoop Development and Operations Best Practices

HomeAway CI/CD Pipelinecascading-­archetype

job-­A

job-­B

oozie-­job-­deployer

Page 44: Learn from HomeAway Hadoop Development and Operations Best Practices

HomeAway

#wholevacation

Thank you!