Hadoop at Meebo: Lessons in the Real World

Hadoop at MeeboLessons learned in the real world

Vikram OberoiAugust, 2010Hadoop Day, Seattle

About me

• SDE Intern at Amazon, ’07– R&D on item-to-item similarities

• Data Engineer Intern at Meebo, ’08– Built an A/B testing system

• CS at Stanford, ’09– Senior project: Ext3 and XFS under Hadoop

MapReduce workloads• Data Engineer at Meebo, ’09—present– Data infrastructure, analytics

About Meebo

• Products– Browser-based IM client (www.meebo.com)– Mobile chat clients– Social widgets (the Meebo Bar)

• Company– Founded 2005– Over 100 employees, 30 engineers

• Engineering– Strong engineering culture– Contributions to CouchDB, Lounge, Hadoop components

The Problem

• Hadoop is powerful technology– Meets today’s demand for big data

• But it’s still a young platform– Evolving components and best practices

• With many challenges in real-world usage– Day-to-day operational headaches– Missing eco-system features (e.g recurring jobs?)– Lots of re-inventing the wheel to solve these

Purpose of this talk

1. Discuss some real problems we’ve seen2. Explain our solutions3. Propose best practices so you can avoid them

What will I talk about?

Background:• Meebo’s data processing needs• Meebo’s pre and post Hadoop data pipelines

Lessons:• Better workflow management

– Scheduling, reporting, monitoring, etc.– A look at Azkaban

• Get wiser about data serialization– Protocol Buffers (or Avro, or Thrift)

Meebo’s Data Processing Needs

What do we use Hadoop for?

• ETL• Analytics• Behavioral targeting• Ad hoc data analysis, research• Data produced helps power:– internal/external dashboards– our ad server

What kind of data do we have?

• Log data from all our products– The Meebo Bar– Meebo Messenger (www.meebo.com)– Android/iPhone/Mobile Web clients– Rooms– Meebo Me– Meebo notifier– Firefox extension

How much data?

• 150MM uniques/month from the Meebo Bar• Around 200 GB of uncompressed daily logs• We process a subset of our logs

Meebo’s Data PipelinePre and Post Hadoop

A data pipeline in general

1. DataCollection

2. DataProcessing

3. DataStorage

4. Workflow Management

Our data pipeline, pre-Hadoop

Python/shell scripts pull log data

Python/shell scripts

process data

MySQL, CouchDB, flat files

Cron, wrapper shell scripts glue everything together

Servers

Our data pipeline post Hadoop

Push logs to HDFS

Pig scripts process

data

MySQL, CouchDB, flat files

Azkaban, a workflow management system, glues everything together

Servers

Our transition to using Hadoop

• Deployed early ’09– Motivation: processing data took aaaages!– Catalyst: Hadoop Summit

• Turbulent, time consuming– New tools, new paradigms, pitfalls

• Totally worth it– 24 hours to process day’s logs under an hour– Leap in ability to analyze our data– Basis for new core product features

Workflow Management

What is workflow management?

What is workflow management?

It’s the glue that binds your data pipeline together: scheduling, monitoring, reporting etc.

• Most people use scripts and cron• But end up spending too much time managing• We need a better way

Workflow management consists of:

• Executes jobs with arbitrarily complex dependency chains

Split up your jobs into discrete chunks with dependencies

• Minimize impact when chunks fail

• Allow engineers to work on chunks separately

• Monolithic scripts are no fun

Clean up data from log A

Process data from log B

Join data, train a classifier

Archive output Post-processing

Export to DB somewhere



• Schedules recurring jobs to run at a given time



• Schedules recurring jobs to run at a given time • Monitors job progress



• Schedules recurring jobs to run at a given time • Monitors job progress• Reports when job fails, how long jobs take



• Schedules recurring jobs to run at a given time • Monitors job progress• Reports when job fails, how long jobs take• Logs job execution and exposes logs so that

engineers can deal with failures swiftly



• Schedules recurring jobs to run at a given time • Monitors job progress• Reports when job fails, how long jobs take• Logs job execution and exposes logs so that

engineers can deal with failures swiftly• Provides resource management capabilities






DB somewhere

Don’t DoS yourself






DB somewhere

Permit Manager2 1 0 0 0

Don’t roll your own scheduler!

• Building a good scheduling framework is hard– Myriad of small requirements, precise

bookkeeping with many edge cases• Many roll their own– It’s usually inadequate– So much repeated effort!

• Mold an existing framework to your requirements and contribute

Two emerging frameworks

• Oozie– Built at Yahoo– Open-sourced at Hadoop Summit ’10– Used in production for [don’t know]– Packaged by Cloudera

• Azkaban– Built at LinkedIn– Open-sourced in March ‘10– Used in production for over nine months as of March ’10– Now in use at Meebo

Azkaban

Azkaban jobs are bundles of configuration and code

Configuring a job

type=commandcommand=python [email protected]

process_log_data.job

import osimport sys# Do useful things…

process_logs.py

Deploying a jobStep 1: Shove your config and code into a zip archive.

process_log_data.zip

.job .py

Deploying a jobStep 2: Upload to Azkaban

process_log_data.zip

.job .py

Scheduling a jobThe Azkaban front-end:

What about dependencies?

get_users_widgets

process_widgets.job process_users.job

join_users_widgets.job

export_to_db.job


process_widgets.job


process_users.job

get_users_widgets

type=commandcommand=python join_users_widgets.pyfailure.emails=datateam@whereiwork.comdependencies=process_widgets,process_users

join_users_widgets.job

type=commandcommand=python export_to_db.pyfailure.emails=datateam@whereiwork.comdependencies=join_users_widgets

export_to_db.job

get_users_widgets

get_users_widgets

get_users_widgets.zip

.job

.py

.job.job.job

.py.py.py

You deploy and schedule a job flow as you would a single job.

Hierarchical configuration



process_users.job

process_widgets.job

This is silly. Can‘t I specify failure.emails globally?

azkaban-job-dir/system.propertiesget_users_widgets/

process_widgets.jobprocess_users.jobjoin_users_widgets.jobexport_to_db.job

some-other-job/…

Hierarchical configuration

[email protected]=foo.whereiwork.comarchive.dir=/var/whereiwork/archive

system.properties

What is type=command?

• Azkaban supports a few ways to execute jobs– command• Unix command in a separate process

– javaprocess• Wrapper to kick off Java programs

– java• Wrapper to kick off Runnable Java classes• Can hook into Azkaban in useful ways

– Pig• Wrapper to run Pig scripts through Grunt

What’s missing?

• Scheduling and executing multiple instances of the same job at the same time.

FOO

FOO

3:00 PM

4:00 PM

• Runs hourly• 3:00 PM took longer than expected

FOO

FOO

FOO

3:00 PM

4:00 PM

5:00 PM

• Runs hourly• 3:00 PM failed, restarted at 4:25 PM

What’s missing?

• Scheduling and executing multiple jobs at the same time.– AZK-49, AZK-47– Stay tuned for complete, reviewed patch

branches: www.github.com/voberoi/azkaban

What’s missing?

• Scheduling and executing multiple jobs at the same time.– AZK-49, AZK-47– Stay tuned for complete, reviewed patch

branches: www.github.com/voberoi/azkaban• Passing arguments between jobs.– Write a library used by your jobs– Put your arguments anywhere you want

What did we get out of it?

• No more monolithic wrapper scripts• Massively reduced job setup time– It’s configuration, not code!

• More code reuse, less hair pulling• Still porting over jobs– It’s time consuming

Data Serialization

What’s the problem?

• Serializing data in simple formats is convenient– CSV, XML etc.

• Problems arise when data changes• Needs backwards-compatibility

Does this really matter? Let’s discuss.

v1

Username: Password:

Go!

clickabutton.com

“Click a Button” Analytics PRD

• We want to know the number of unique users who clicked on the button.– Over an arbitrary range of time.– Broken down by whether they’re logged in or not.– With hour granularity.

“I KNOW!”

unique_id,logged_in,clicked

Every hour, process logs and dump lines that look like this to HDFS with Pig:

“I KNOW!”

--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING PigStorage(‘,’) AS (

unique_id:chararray,logged_in:int,clicked:int

);

-- Munge data according to the PRD…

v2

Username: Password:

Go!

clickabutton.com

“Click a Button” Analytics PRD

Break users down by which button they clicked, too.

“I KNOW!”

unique_id,logged_in,red_click,green_click

Every hour, process logs and dump lines that look like this to HDFS with Pig:

“I KNOW!”

--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING PigStorage(‘.’) AS (

unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int

);


v3

Username: Password:

Go!

clickabutton.com

“Hmm.”

Bad Solution 1

Remove red_click


unique_id,logged_in,green_click

Why it’s bad

LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int

);


Your script thinks green clicks are red clicks.

Why it’s bad

LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,green_clicked:int

);


Now your script won’t work for all the data you’ve collected so far.

“I’ll keep multiple scripts lying around”

LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,green_clicked:int

);

LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,orange_clicked:int

);

My data has three fields. Which one do I use?

Bad Solution 2

Assign a sentinel to red_click when it should be ignored, i.e. -1.


Why it’s bad

It’s a waste of space.

Why it’s bad

Sticking logic in your data is iffy.

The Preferable Solution

Serialize your data using backwards-compatible data structures!

Protocol Buffers and Elephant Bird

Protocol Buffers

• Serialization system– Avro, Thrift

• Compiles interfaces to language modules– Construct a data structure– Access it (in a backwards-compatible way)– Ser/deser the data structure in a standard,

compact, binary format

message UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;

}

.java .py .h,.cc

uniqueuser.proto

Elephant Bird

• Generate protobuf-based Pig load/store functions + lots more

• Developed at Twitter• Blog post– http://engineering.twitter.com/2010/04/hadoop-

at-twitter.html• Available at:– http://www.github.com/kevinweil/elephant-bird


}

uniqueuser.proto

*.pig.load.UniqueUserLzoProtobufB64LinePigLoader*.pig.store.UniqueUserLzoProtobufB64LinePigStorage

LzoProtobufB64?

LzoProtobufB64Serialization

(bak49jsn, 0, 1)

Protobuf Binary Blob

Base64-encoded Protobuf Binary Blob

LZO-compressed Base64-encoded Protobuf Binary Blob

LzoProtobufB64Deserialization

(bak49jsn, 0, 1)

Protobuf Binary Blob

Base64-encoded Protobuf Binary Blob

LZO-compressed Base64-encoded Protobuf Binary Blob

Setting it up

• Prereqs– Protocol Buffers 2.3+– LZO codec for Hadoop

• Check out docs– http://www.github.com/kevinweil/elephant-bird

Time to revisit

v1

Username: Password:

Go!

clickabutton.com


}

uniqueuser.proto

Every hour, process logs and dump lines to HDFS that use this protobuf interface:

--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (

unique_id:chararray,logged_in:int,red_clicked:int

);


v2

Username: Password:

Go!

clickabutton.com

message UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;optional int32 green_clicked = 4;

}

uniqueuser.proto

Every hour, process logs and dump lines to HDFS that use this protobuf interface:

--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (

unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int

);


v3

Username: Password:

Go!

clickabutton.com

No need to change your scripts.

They’ll work on old and new data!

Bonus!

http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter

Conclusion

• Workflow management– Use Azkaban, Oozie, or another framework.– Don’t use shell scripts and cron.– Do this from day one! Transitioning expensive.

• Data serialization– Use Protocol Buffers, Avro, Thrift. Something else!– Do this from day one before it bites you.

Questions?

[email protected]

@voberoi on Twitter

We’re hiring!

Documents

Hadoop at Meebo: Lessons in the Real World