Upload
driven-inc
View
569
Download
2
Embed Size (px)
Citation preview
Cascading Webinar
HomeAwayThe world leader for vacation rentals
Over a million listings worldwide and growing!
Hadoop is changingYou …
Need faster ROI Need compellinguse cases
Need more with less Need to leverage existing talent
Harnessing the power of hadoop MapReduce
Divides into smaller problems;;Assemble smaller answers into the answers to the bigger problems.
MapReduce Can be hard to learn Verbose;; Tedious Historically slow
New Engine Options Apache Tez Apache Spark Apache Flink
Problem at HomeAway
Cascading
Speaker Panel
• Austin Tobin - Software Engineer
File Storage Quotas :: Introduction to Cascading
• Michael McAllister - Staff Data Warehouse Engineer
Supplier Analytics :: Phoenix, HBase and Driven
• Francois Forster - Architect
User Analytics :: A/B Test Readouts
File Storage Quotas :: Introduction to Cascading
© Copyright 2015 HomeAway, Inc.
Introduction
1. What is it we are trying to solve
2. What is Cascading
3. How we applied Cascading to solve this problem
© Copyright 2015 HomeAway, Inc.
What is Mesa? What is the problem with Mesa?Mesa is an internal file systemDivided up into buckets, each bucket has a quotaEach bucket maintains a statistics file, locked on write and deleteAs usage increases, this locking creates performance bottlenecks
9
• Kafka• High performance messaging technology• Used to insert high volume of consistent log messages very quickly
• Avro• Compressible file-format. Binarized, highly portable.
• Hadoop• Distributed file store and processing framework• enables near infinite horizontal scalability for storage and processing
• Cascading...
Key Technologies
Cascading
• Taps can be either sources or sinks• Sources are data inputs, and sinks are data outputs• They require a scheme, which is a set of column names (tuples), and a text-delimiter
• The sink of one flow can be the source of another flow.• Pipes
• Abstractions to perform functions or transformations• Functions include split, merge, expression, and filter• The output of one pipe may be another pipe, • chain together to perform sequences of transformations
• Flows• Connect sources to sinks via pipes into a flow• Can connect multiple flows together into
a CASCADE
CASCADING
Cascading
The Cascading Archetype is project which makes it very easy to get started with cascading applications. Currently an internal project, which uses Spring to make defining taps and flows very easy.
1. Define your Taps2. Build your Flows.3. Cascade!
Cascading Archetype
© Copyright 2015 HomeAway, Inc.
Hadoop
Log Events
Mesa Stats Job
Mesa Metadata Old Catalog + Log EventsNew Catalog
+ Statistics
Mesa
Mesa Stats - The Big Picture
OLDCATALOGTAP
EVENT TAP
Clean Events Pipe
Build New Catalog Pipe
NEW CATALOG SINK
Flow Def - Create the New Catalog
CascadingOld Catalog Tap
Filter Non Mesa Events
Split the Message Field into multiple Fields
Remove Extraneous Fields
Pipe - Clean the Events
CascadingPipe - Clean the Events
Cleaned Event Pipe
Catalog Pipe
Sort Events by Latest Desc
Take Top 1 Event
Remove Deleted Events
Merge Events With Catalog Pipe
Pipe - Build the New Catalog
CascadingPipe - Build the Catalog
CascadingUpdate Catalog Flow Def - Revisited
NEW CATALOG TAP MESA QUOTA TAP
Sum File Sizes Per Bucket
Merge on Bucket Names
Divide Bucket File Sizes By Quota
STATISTICS SINK
Flow Def - Calculate the Statistics
CascadingPipe - Sum and Merge
CascadingFlow Def - Calculate the Statistics
CascadingFlow Def - Statistics Revisited
Thank you all!• Cascading For the Impatient
Supplier Analytics :: Phoenix, HBase and Driven
The goal
The goal: Expose our EDW analytics to suppliers. But ... More users of analytics = requirement to horizontally scale
SQL Server EDW + Managed Storage = Expensive to horizontally scale
The solution
Use Cascading with HBase / PhoenixCascading for ETLApache Phoenix as an abstraction layer over HbaseHomeAway created Cascading Phoenix Tap to simplify use of Phoenix.
What does our Cascading ETL look like?
Daily jobs scheduled in oozie Runs Cascading ETL developed as Java programs Examples:-ETL listings that have changed since yesterday from EDW to HBaseETL listing metrics from current periodic snapshot fact partition over to HBase. ETL market group metrics from current periodic snapshot fact partition over to HBase
What does our Cascading ETL look like?
Extract - SQL statement issued against SQL Server JDBC tap
Transform Simple - do it in your SQL statement Complex - do it in your pipes - filters, cogroups, user defined functions, etc
Load - sink tap bound to Apache Phoenix Cascading tap This tap is in essence a HBase table
How Driven simplifies using Cascading
How Driven simplifies using Cascading
How Driven simplifies using Cascading
How Driven simplifies using Cascading
A real simple Cascading flow definition
User Analytics :: A/B Test Readouts
A/B Test Readouts
• We’re always running many A/B tests concurrently on our sites• Daily Cascading Job performs A/B test readout
– Readout for all running A/B tests at once– Rolling 3-week
• Sliced and diced by site, by day, by test as well as various roll ups• Multiple conversion metrics• Millions of daily test exposures and conversions
A/B Test Readout Flow
Not The Full Cascade!
A/B Test Readout Cascade
• Includes Daily Intermediate Files–cascade.setFlowSkipStrategy(new FlowSkipIfSinkExists());
Using Driven For Performance Tuning
• Driven makes it easy to look at the time it takes to execute– Including the number of mappers or reducers
– Increase if needed:pipe.getStepConfigDef().setProperty("mapreduce.job.reduces","20");
Cascading Tips
• Store intermediate files to avoid re-processing the same data over and over again–When running frequent jobs on rolling window
• Breakup your complex flows
• Use Driven to tweak # of reducers at various points
Deployment / Operational Issues
HomeAway CI/CD Pipelinecascading-archetype
job-A
job-B
oozie-job-deployer
HomeAway
#wholevacation
Thank you!