47
HADOOP WORKFLOW MANAGEMENT WITH CELOS Alexander Konopko

AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование задач Hadoop."

Embed Size (px)

Citation preview

HADOOP WORKFLOW

MANAGEMENT WITH CELOS

Alexander Konopko

WHO AM I

Hello

I am Alexander Konopko from Sigma

I've been doing Software Engineering

for 8 years so far

For last 2 years I work for Collective

primarily on Celos project

WHO AM I Hello

I am Alexander Konopko from Sigma

I've been doing Software Engineering

for 8 years so far

For last 2 years I work for Collective

primarily on Celos project

Hadoop

Hadoop Workflow

Hadoop Workflow management

WHAT

IS ALL THIS

ABOUT?

CRON IS UNSUFFICIENT

No reruns

No monitoring

Only time-based schedules

сrontab is easy to break and hard to maintain

WHAT WE WOULD

LIKE TO SEE IN

WORKFLOW

ENGINE

Easy to monitor and manage

Easy to create workflow set

Flexible schedules (data-based,

dependency-based, time-based)

Easy to solve possible problems

(good community and/or easy to

hack on)

Meta-workflows such as

notifications, data cleaning etc etc

Security

Continuous deployment

Testing

Apache Oozie

LinkedIn Azbakan

Airbnb Chronos

Spotify Luigi

...Collective Celos!

WHAT DO

WE HAVE ON

THE MARKET

AIRBNB CHRONOS

Doesnt really fit for Hadoop WF's, it's rather general-

use scheduler

cron on top of Mesos

Hard to see historical data in UI

Uses SH to run jobs

Only time-based schedules

LINKEDIN AZBAKAN

GOOD:

Has very good UI

Previously had awful design, but v2 is

much better

Has most of Oozie features now

BAD:

No data-based schedules

Workflow is somehow limited and

restrictive comparing to Luigi and

even Oozie

Support is not as good as Oozie,

and product is less mature

SPOTIFY LUIGI

GOOD:

Flexible (Workflows are Python tasks)

6k LOC

Used at FourSquare, Spotify and

several others

BAD:

Community is not that big

Only Unix security

Uses Hadoop CLI, which has to

start JVM each time, can be slow

Saves state only when it gracefully

shuts down. If not, all jobs should

be restarted

Was designed with no job history,

now it's experimental feature. No

history at UI

APACHE OOZIE

GOOD:

Most widely used and mature

Very strong support from Hadoop

community

Integration with HUE, Cloudera

Manager

Great documentation

Default support for Pig, Hive, Java,

FileSystem

Hadoop Security, Kerberous

Scalable

BAD:

Stateful Coordinators

XML... Tons of XML

Rigid

Poor UI

OOZIE WORLD: TO RESTART A BUNDLE…

1. Clone the repo, cd into root directory and make sure that all buildr dependencies are satisfied and RVM is happy

2. If it's a pristine repo, edit bundle-conf.xml in a root directory and fill in values for all the passwords. All those

values can be grabbed from any workflow instance from Configuration settings.

3. Run buildr package, which would put ready-to-deploy bundle files into target/workflow.

4. Before killing old bundle instance, make sure there are no workflow instances running. The best time would be at

the end of an hour. The only current exception would be ready-check jobs, which run always, and which are safe

to leave running while restarting the bundle.

5. Grep bundle id with oozie jobs -jobtype bundle -filter status=RUNNING|grep TheWorkflow

6. Kill the bundle with oozie job -kill <bundle-id>

7. Remove old installation of a bundle : hadoop fs -rm -r /deploy/TheWorkflow/*

8. Deploy the new installation : hadoop fs -copyFromLocal target/workflow/* /deploy/TheWorkflow/

9. Start the bundle : oozie job -submit -config ./bundle-conf.xml

10. Check that it's running as in (5), and make sure there are no errors in the log with : oozie job -log <new-bundle-

id>

COLLECTIVE CELOS

CELOS IS A TOOL FOR RUNNING, TESTING, AND MAINTAINING HADOOP DATA

APPLICATIONS THAT IS DESIGNED TO BE SIMPLE AND FLEXIBLE.

Configurable — It’s your job to make it usable.

Elegant — The only use case is making me feel smart.

Lightweight — I don’t understand the use-cases the alternatives solve.

Opinionated — I don’t believe that your use case exists.

Simple — It solves my use case.

(from the Devil's Dictionary of Programming)

GENERAL IDEA

WORKFLOW

DEFINITION

OOZIE

WORKFLOW.XML

(more like actions)utilizes

Move complexity to JS

JS

FILE

WHAT IS SLOT

WORKFLOW

ID + DATE

WORKFLOW DEFINITION

WAIT WAIT WAIT WAIT

WHAT IS TRIGGER

TRIGGER TRIGGER TRIGGER TRIGGER

SAYS IF

WORKFLOW FOR

DATE CAN BE

STARTED

WORKFLOW DEFINITION

WAIT WAIT WAIT WAIT

WHAT TRIGGER DOES?

SCHEDULER CHECKS TRIGGERS

ITERATES

EVERY ONCE

IN A WHILE

(1 MIN)

TRIGGER TRIGGER TRIGGER TRIGGER

TRIGGER

RETURNS TRUE,

WORKFLOW IS

READY!

WORKFLOW DEFINITION

WAIT READY WAIT READY

WORKFLOW STARTS

SCHEDULER STARTS OOZIE WORKFLOW

AND PROVIDES REQUIRED PARAMETERS

APACHE OOZIE

HADOOP CLUSTER

OOZIE WORKFLOW ID

WORKFLOW DEFINITION

WAIT RUN WAIT RUN

WORKFLOW STARTS

SCHEDULER UPDATES SLOT STATUSES

CAN BE RESTARTED

YOU HAVE

OOZIE POWERS

TO DEBUG

THIS GUY

WORKFLOW DEFINITION

WAIT FAIL WAIT RUN

IF YOU RESTART…

RESTARTED

SLOT GOES

TO WAIT

STATE

WORKFLOW DEFINITION

WAIT WAIT WAIT WAIT

HOW TO CREATE A WORKFLOW

addWorkflow({

"id": "wordcount",

"maxRetryCount": 5,

"startTime": "2014-03-10T00:00Z",

"schedule": hourlySchedule(),

"schedulingStrategy": serialSchedulingStrategy(),

"trigger": hdfsCheckTrigger(

"/user/celos/samples/wordcount/input/${year}-${month}-${day}T${hour}00.txt"

),

"externalService": oozieExternalService({

"oozie.wf.application.path": "/user/celos/wordcount/workflow/workflow.xml",

"inputDir": "/user/celos/samples/wordcount/input",

"outputDir": "/user/celos/samples/wordcount/output"

}

)

});

workflow.js:

HOW TO CREATE SHARED UTILITY FUNCTIONS

function emailAlert(workflowName, attachedWorkflowName, timeout /*in hours*/) {

addWorkflow({

"id": workflowName,

"trigger": andTrigger(delayTrigger(60*60*timeout),

notTrigger(successTrigger(attachedWorkflowName))),

"schedule": dependentSchedule(attachedWorkflowName),

"schedulingStrategy": serialSchedulingStrategy(),

"externalService": oozieExternalService({

"oozie.wf.application.path": "/user/celos/app/ftas/workflow.xml",

"sendAlerts": “[email protected],[email protected]”,

"workflowName": attachedWorkflowName

}

})

});

}

defaults.js:

CELOS TRIGGERS

alwaysTrigger()

successTrigger(“workflow-1”)

orTrigger(trigger1, trigger2)

notTrigger(trigger)

hdfsTrigger(“/my/path”)

delayTrigger(time_ms)

offsetTrigger(time_ms, trigger)!

Combinations: wait for today and yesterday data

andTrigger(hdfsCheckTrigger(...) , offsetTrigger(- ONE_DAY, hdfsCheckTrigger(...)))

CELOS SCHEDULES

hourlySchedule()

minutelySchedule()

cronSchedule("0 0 * * * ?“) – run every hour

dependentSchedule(“workflow-1”)

TBD: iso8601Schedule for better human experience

CELOS POWER!

function emailAlert(workflowName, attachedWorkflowName,teamToSendAlerts,schedule, timeout /*in hours*/) {

addWorkflow({

"id": workflowName,

"trigger": andTrigger(delayTrigger(60*60*timeout),

notTrigger(successTrigger(attachedWorkflowName))),

"schedule": dependentSchedule(attachedWorkflowName),

"schedulingStrategy": serialSchedulingStrategy(),

"externalService": oozieExternalService({

"oozie.wf.application.path": "/user/celos/app/ftas/workflow.xml",

"sendAlerts": “[email protected],[email protected]”,

"workflowName": attachedWorkflowName

}

})

});

}

CASE STUDY: PROJECT SAWBILL

PROJECT:

4 workflows per region, 2 regions

(with a bit different parameters for a region)

BEFORE:

~150 LOC bash script generates

2 ~200 LOC bundle XMLs

each references 4 ~50 LOC coordinator XML

each has it's own workflow.xml (~100 LOC)

~1750 LOC in 19 files

AFTER:

4 workflow.xml files + ~250 LOC in JS in a single file

~650 LOC in 5 files

USER INTERFACE

2015-05-19T03:00:00.000Z:SUCCESS

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

Google Integration

Sibyl S2S

WAIT WAIT WAIT

FAIL

FAIL

RDY RDY RDY RDY RDY RDY RDY RUN

RUN

08:00I

06:00I

04:00I

02:00I

00:00I

22:00I

20:00I

2015-05-19T08:14:08Z 2015-05-19

USER INTERFACE

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

hive-monitoring

2015-05-19T08:26:21Z 2015-05-19

ad transaction log copy

Monitoring

08:00I

07:30I

07:00I

06:30I

06:00I

05:30I

05:00I

04:30I

xxxxxxxxxxxxxxx

WAIT

RUN

RUN

WAIT

WAIT

ANYTHING ELSE?

MEET CELOS-CI PROJECT

Continuous Deployment

Testing

On top of Celos

DEPLOY PROCESS

Deploy dir:

workflow.js

defaults.js

workflow.xml

libs

HADOOP

CLUSTER

WORKFLOW

DEPLOY DIRTARGET.JSON

CELOS CI

CONTINUOUS DEPLOYMENT

COMMIT

TO DEVELOP

BRANCH ON

GITHUB

CONTINUOUS DEPLOYMENT

COMMIT

TO DEVELOP

BRANCH ON

GITHUB

JENKINS BUILD

WITH TESTS

CONTINUOUS DEPLOYMENT

COMMIT

TO DEVELOP

BRANCH ON

GITHUB

JENKINS BUILD

WITH TESTS

GITHUB

NOTIFICATION

CONTINUOUS DEPLOYMENT

COMMIT

TO DEVELOP

BRANCH ON

GITHUB

JENKINS BUILD

WITH TESTS

GITHUB

NOTIFICATION

MERGE

TO MASTER

CONTINUOUS DEPLOYMENT

COMMIT

TO DEVELOP

BRANCH ON

GITHUB

JENKINS BUILD

WITH TESTS

GITHUB

NOTIFICATION

MERGE

TO MASTER

DEPLOY

TO CLUSTER

TESTING: GENERAL IDEA

Regular steps to test something

1. Prepare input data

2. Prepare output data

3. Load input fixtures

4. Run workflow on input data

5. Compare actual output with output fixtures

Can we test Hadoop Workflows like this?

STEPS 3, 5. TEST CASE DESCRIPTION

ci.addTestCase({

name: "wordcount test case 1",

sampleTimeStart: "2013-12-20T16:00Z",

sampleTimeEnd: "2013-12-20T17:00Z",

inputs: [

ci.hdfsInput(ci.fixDirFromResource("test-1/input"), "/input/wordcount")

],

outputs: [

ci.plainCompare(ci.fixDirFromResource("test-1/output"), "/output/wordcount")

]

});

Describe your test in test.js

STEP 4. RUN WORKFLOW ON INPUT DATA

addWorkflow({

"id": "wordcount",

"maxRetryCount": 5,

"startTime": "2014-03-10T00:00Z",

"schedule": hourlySchedule(),

"schedulingStrategy": serialSchedulingStrategy(),

"trigger": hdfsCheckTrigger(

hdfsPath("/user/celos/samples/wordcount/input/${year}-${month}-${day}T${hour}00.txt“)

),

"externalService": oozieExternalService({

"oozie.wf.application.path": hdfsPath("/user/celos/wordcount/workflow/workflow.xml“),

"inputDir": hdfsPath("/user/celos/samples/wordcount/input“),

"outputDir": hdfsPath("/user/celos/samples/wordcount/output“)

}

)

});

Augument workflow.js: mark all input path

SANDBOXED PATHS

CELOS CI TEST CASE

1. Initialize and create environment for Celos Server

2. Load input data into sandbox

3. Run Celos server against sandboxed inputs

4. Compare outputs in sandbox with expected fixtures

CELOS SERVER

JETTY

CELOS CIUUID

TARGET.JSON

CELOS CI TEST CASE

1. Create sandboxed environment in Hadoop Cluster

2. Load input data into sandbox

3. Run Celos server against sandboxed inputs

4. Compare outputs in sandbox with expected fixtures

INPUT FIXTURES

HADOOP

CLUSTER

INPUT FIXTURES FROM TEST.JSCELOS SERVER

JETTY

CELOS CI

CELOS CI TEST CASE

1. Create sandboxed environment in Hadoop Cluster

2. Load input data into sandbox

3. Run Celos server against sandboxed inputs

4. Compare outputs in sandbox with expected fixtures

UUID

RUN AUGMENTED

WORKFLOWCELOS SERVER

JETTY

CELOS CI

HADOOP

CLUSTER

CELOS CI TEST CASE

1. Create sandboxed environment in Hadoop Cluster

2. Load input data into sandbox

3. Run Celos server against sandboxed inputs

4. Compare outputs in sandbox with expected fixtures

COMPARE WITH

HADOOP

CLUSTER

OUTPUT FIXTURES

FROM TEST.JS

FETCH ACTUAL

RESULTSCELOS SERVER

JETTY

CELOS CI

ANY REAL EXAMPLES FOR TESTING?

ci.addTestCase({

name: “Accordion test case 1",

sampleTimeStart: "2014-12-01T00:00Z",

sampleTimeEnd: "2014-12-01T00:00Z",

inputs: [

ci.hdfsInput(ci.fixDirFromResource("test-1/logs"), "/logs"),

ci.hdfsInput(ci.fixDirFromResource("test-1/warehouse"), "/user/hive/warehouse"),

ci.hiveInput(“db_name")

],

outputs: [

ci.jsonCompare(

ci.fixFileFromResource("test-1/output/accordion/2014-12-01/00/preprocessor.json"),

ci.expandJson(ci.tableToJson(ci.hiveTable(“db_name ", "preprocessor")), ["preprocessor.change"])

)

]

});

CONTINUOUS DEPLOYMENT WITH TEST CLUSTER

COMMIT

TO DEVELOP

BRANCH ON

GITHUB

JENKINS BUILD

WITH TESTS

GITHUB

NOTIFICATION

CONTINUOUS DEPLOYMENT WITH TEST CLUSTER

COMMIT

TO DEVELOP

BRANCH ON

GITHUB

JENKINS BUILD

WITH TESTS

MERGE

TO MASTER

GITHUB

NOTIFICATION

DEPLOY TO

PRODUCTION

CLUSTER

RUN TESTS

ON TEST CLUSTER

WHAT DO WE HAVE IN CELOS

All Oozie benefits

Flexibility of scripting language

Easy to monitor and manage

Easy to create workflow set

Flexible schedules (data-based, dependency-based, time-based)

Easy to solve possible problems and add new features (1.6k LOC in Java)

Reliable & robust

Bug-free software (1 minor bug in a year of production with > 100 wf’s)

Meta-workflows such as data retention are just several lines in JS

Understandable by DevOps and DataScientists

Continuous deployment

Testing

PERFECT? ALMOST…

Only Oozie supported for now

No workflow dependency graph in UI

If you change schedule, you cant see history

Test fixtures setup experience could be better

Test run may be harmful if you do it wrong

No logs at UI

No HA

RESOURCES

Learn about Workflow engines

http://nerds.airbnb.com/introducing-chronos/

http://data.linkedin.com/opensource/azkaban

https://github.com/spotify/luigi

http://falcon.apache.org/

Workflow Engines for Hadoop by Joe Crobak

https://youtu.be/PTwi89n72Mc

Learn about Celos project

https://github.com/collectivemedia/celos

[email protected]

[email protected]