Upload
geekslab
View
67
Download
0
Tags:
Embed Size (px)
Citation preview
WHO AM I
Hello
I am Alexander Konopko from Sigma
I've been doing Software Engineering
for 8 years so far
For last 2 years I work for Collective
primarily on Celos project
WHO AM I Hello
I am Alexander Konopko from Sigma
I've been doing Software Engineering
for 8 years so far
For last 2 years I work for Collective
primarily on Celos project
CRON IS UNSUFFICIENT
No reruns
No monitoring
Only time-based schedules
сrontab is easy to break and hard to maintain
WHAT WE WOULD
LIKE TO SEE IN
WORKFLOW
ENGINE
Easy to monitor and manage
Easy to create workflow set
Flexible schedules (data-based,
dependency-based, time-based)
Easy to solve possible problems
(good community and/or easy to
hack on)
Meta-workflows such as
notifications, data cleaning etc etc
Security
Continuous deployment
Testing
Apache Oozie
LinkedIn Azbakan
Airbnb Chronos
Spotify Luigi
...Collective Celos!
WHAT DO
WE HAVE ON
THE MARKET
AIRBNB CHRONOS
Doesnt really fit for Hadoop WF's, it's rather general-
use scheduler
cron on top of Mesos
Hard to see historical data in UI
Uses SH to run jobs
Only time-based schedules
LINKEDIN AZBAKAN
GOOD:
Has very good UI
Previously had awful design, but v2 is
much better
Has most of Oozie features now
BAD:
No data-based schedules
Workflow is somehow limited and
restrictive comparing to Luigi and
even Oozie
Support is not as good as Oozie,
and product is less mature
SPOTIFY LUIGI
GOOD:
Flexible (Workflows are Python tasks)
6k LOC
Used at FourSquare, Spotify and
several others
BAD:
Community is not that big
Only Unix security
Uses Hadoop CLI, which has to
start JVM each time, can be slow
Saves state only when it gracefully
shuts down. If not, all jobs should
be restarted
Was designed with no job history,
now it's experimental feature. No
history at UI
APACHE OOZIE
GOOD:
Most widely used and mature
Very strong support from Hadoop
community
Integration with HUE, Cloudera
Manager
Great documentation
Default support for Pig, Hive, Java,
FileSystem
Hadoop Security, Kerberous
Scalable
BAD:
Stateful Coordinators
XML... Tons of XML
Rigid
Poor UI
OOZIE WORLD: TO RESTART A BUNDLE…
1. Clone the repo, cd into root directory and make sure that all buildr dependencies are satisfied and RVM is happy
2. If it's a pristine repo, edit bundle-conf.xml in a root directory and fill in values for all the passwords. All those
values can be grabbed from any workflow instance from Configuration settings.
3. Run buildr package, which would put ready-to-deploy bundle files into target/workflow.
4. Before killing old bundle instance, make sure there are no workflow instances running. The best time would be at
the end of an hour. The only current exception would be ready-check jobs, which run always, and which are safe
to leave running while restarting the bundle.
5. Grep bundle id with oozie jobs -jobtype bundle -filter status=RUNNING|grep TheWorkflow
6. Kill the bundle with oozie job -kill <bundle-id>
7. Remove old installation of a bundle : hadoop fs -rm -r /deploy/TheWorkflow/*
8. Deploy the new installation : hadoop fs -copyFromLocal target/workflow/* /deploy/TheWorkflow/
9. Start the bundle : oozie job -submit -config ./bundle-conf.xml
10. Check that it's running as in (5), and make sure there are no errors in the log with : oozie job -log <new-bundle-
id>
COLLECTIVE CELOS
CELOS IS A TOOL FOR RUNNING, TESTING, AND MAINTAINING HADOOP DATA
APPLICATIONS THAT IS DESIGNED TO BE SIMPLE AND FLEXIBLE.
Configurable — It’s your job to make it usable.
Elegant — The only use case is making me feel smart.
Lightweight — I don’t understand the use-cases the alternatives solve.
Opinionated — I don’t believe that your use case exists.
Simple — It solves my use case.
(from the Devil's Dictionary of Programming)
GENERAL IDEA
WORKFLOW
DEFINITION
OOZIE
WORKFLOW.XML
(more like actions)utilizes
Move complexity to JS
JS
FILE
WHAT IS TRIGGER
TRIGGER TRIGGER TRIGGER TRIGGER
SAYS IF
WORKFLOW FOR
DATE CAN BE
STARTED
WORKFLOW DEFINITION
WAIT WAIT WAIT WAIT
WHAT TRIGGER DOES?
SCHEDULER CHECKS TRIGGERS
ITERATES
EVERY ONCE
IN A WHILE
(1 MIN)
TRIGGER TRIGGER TRIGGER TRIGGER
TRIGGER
RETURNS TRUE,
WORKFLOW IS
READY!
WORKFLOW DEFINITION
WAIT READY WAIT READY
WORKFLOW STARTS
SCHEDULER STARTS OOZIE WORKFLOW
AND PROVIDES REQUIRED PARAMETERS
APACHE OOZIE
HADOOP CLUSTER
OOZIE WORKFLOW ID
WORKFLOW DEFINITION
WAIT RUN WAIT RUN
WORKFLOW STARTS
SCHEDULER UPDATES SLOT STATUSES
CAN BE RESTARTED
YOU HAVE
OOZIE POWERS
TO DEBUG
THIS GUY
WORKFLOW DEFINITION
WAIT FAIL WAIT RUN
HOW TO CREATE A WORKFLOW
addWorkflow({
"id": "wordcount",
"maxRetryCount": 5,
"startTime": "2014-03-10T00:00Z",
"schedule": hourlySchedule(),
"schedulingStrategy": serialSchedulingStrategy(),
"trigger": hdfsCheckTrigger(
"/user/celos/samples/wordcount/input/${year}-${month}-${day}T${hour}00.txt"
),
"externalService": oozieExternalService({
"oozie.wf.application.path": "/user/celos/wordcount/workflow/workflow.xml",
"inputDir": "/user/celos/samples/wordcount/input",
"outputDir": "/user/celos/samples/wordcount/output"
}
)
});
workflow.js:
HOW TO CREATE SHARED UTILITY FUNCTIONS
function emailAlert(workflowName, attachedWorkflowName, timeout /*in hours*/) {
addWorkflow({
"id": workflowName,
"trigger": andTrigger(delayTrigger(60*60*timeout),
notTrigger(successTrigger(attachedWorkflowName))),
"schedule": dependentSchedule(attachedWorkflowName),
"schedulingStrategy": serialSchedulingStrategy(),
"externalService": oozieExternalService({
"oozie.wf.application.path": "/user/celos/app/ftas/workflow.xml",
"sendAlerts": “[email protected],[email protected]”,
"workflowName": attachedWorkflowName
}
})
});
}
defaults.js:
CELOS TRIGGERS
alwaysTrigger()
successTrigger(“workflow-1”)
orTrigger(trigger1, trigger2)
notTrigger(trigger)
hdfsTrigger(“/my/path”)
delayTrigger(time_ms)
offsetTrigger(time_ms, trigger)!
Combinations: wait for today and yesterday data
andTrigger(hdfsCheckTrigger(...) , offsetTrigger(- ONE_DAY, hdfsCheckTrigger(...)))
CELOS SCHEDULES
hourlySchedule()
minutelySchedule()
cronSchedule("0 0 * * * ?“) – run every hour
dependentSchedule(“workflow-1”)
TBD: iso8601Schedule for better human experience
CELOS POWER!
function emailAlert(workflowName, attachedWorkflowName,teamToSendAlerts,schedule, timeout /*in hours*/) {
addWorkflow({
"id": workflowName,
"trigger": andTrigger(delayTrigger(60*60*timeout),
notTrigger(successTrigger(attachedWorkflowName))),
"schedule": dependentSchedule(attachedWorkflowName),
"schedulingStrategy": serialSchedulingStrategy(),
"externalService": oozieExternalService({
"oozie.wf.application.path": "/user/celos/app/ftas/workflow.xml",
"sendAlerts": “[email protected],[email protected]”,
"workflowName": attachedWorkflowName
}
})
});
}
CASE STUDY: PROJECT SAWBILL
PROJECT:
4 workflows per region, 2 regions
(with a bit different parameters for a region)
BEFORE:
~150 LOC bash script generates
2 ~200 LOC bundle XMLs
each references 4 ~50 LOC coordinator XML
each has it's own workflow.xml (~100 LOC)
~1750 LOC in 19 files
AFTER:
4 workflow.xml files + ~250 LOC in JS in a single file
~650 LOC in 5 files
USER INTERFACE
2015-05-19T03:00:00.000Z:SUCCESS
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
Google Integration
Sibyl S2S
WAIT WAIT WAIT
FAIL
FAIL
RDY RDY RDY RDY RDY RDY RDY RUN
RUN
08:00I
06:00I
04:00I
02:00I
00:00I
22:00I
20:00I
2015-05-19T08:14:08Z 2015-05-19
USER INTERFACE
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
hive-monitoring
2015-05-19T08:26:21Z 2015-05-19
ad transaction log copy
Monitoring
08:00I
07:30I
07:00I
06:30I
06:00I
05:30I
05:00I
04:30I
xxxxxxxxxxxxxxx
WAIT
RUN
RUN
WAIT
WAIT
DEPLOY PROCESS
Deploy dir:
workflow.js
defaults.js
workflow.xml
libs
HADOOP
CLUSTER
WORKFLOW
DEPLOY DIRTARGET.JSON
CELOS CI
CONTINUOUS DEPLOYMENT
COMMIT
TO DEVELOP
BRANCH ON
GITHUB
JENKINS BUILD
WITH TESTS
GITHUB
NOTIFICATION
CONTINUOUS DEPLOYMENT
COMMIT
TO DEVELOP
BRANCH ON
GITHUB
JENKINS BUILD
WITH TESTS
GITHUB
NOTIFICATION
MERGE
TO MASTER
CONTINUOUS DEPLOYMENT
COMMIT
TO DEVELOP
BRANCH ON
GITHUB
JENKINS BUILD
WITH TESTS
GITHUB
NOTIFICATION
MERGE
TO MASTER
DEPLOY
TO CLUSTER
TESTING: GENERAL IDEA
Regular steps to test something
1. Prepare input data
2. Prepare output data
3. Load input fixtures
4. Run workflow on input data
5. Compare actual output with output fixtures
Can we test Hadoop Workflows like this?
STEPS 3, 5. TEST CASE DESCRIPTION
ci.addTestCase({
name: "wordcount test case 1",
sampleTimeStart: "2013-12-20T16:00Z",
sampleTimeEnd: "2013-12-20T17:00Z",
inputs: [
ci.hdfsInput(ci.fixDirFromResource("test-1/input"), "/input/wordcount")
],
outputs: [
ci.plainCompare(ci.fixDirFromResource("test-1/output"), "/output/wordcount")
]
});
Describe your test in test.js
STEP 4. RUN WORKFLOW ON INPUT DATA
addWorkflow({
"id": "wordcount",
"maxRetryCount": 5,
"startTime": "2014-03-10T00:00Z",
"schedule": hourlySchedule(),
"schedulingStrategy": serialSchedulingStrategy(),
"trigger": hdfsCheckTrigger(
hdfsPath("/user/celos/samples/wordcount/input/${year}-${month}-${day}T${hour}00.txt“)
),
"externalService": oozieExternalService({
"oozie.wf.application.path": hdfsPath("/user/celos/wordcount/workflow/workflow.xml“),
"inputDir": hdfsPath("/user/celos/samples/wordcount/input“),
"outputDir": hdfsPath("/user/celos/samples/wordcount/output“)
}
)
});
Augument workflow.js: mark all input path
SANDBOXED PATHS
CELOS CI TEST CASE
1. Initialize and create environment for Celos Server
2. Load input data into sandbox
3. Run Celos server against sandboxed inputs
4. Compare outputs in sandbox with expected fixtures
CELOS SERVER
JETTY
CELOS CIUUID
TARGET.JSON
CELOS CI TEST CASE
1. Create sandboxed environment in Hadoop Cluster
2. Load input data into sandbox
3. Run Celos server against sandboxed inputs
4. Compare outputs in sandbox with expected fixtures
INPUT FIXTURES
HADOOP
CLUSTER
INPUT FIXTURES FROM TEST.JSCELOS SERVER
JETTY
CELOS CI
CELOS CI TEST CASE
1. Create sandboxed environment in Hadoop Cluster
2. Load input data into sandbox
3. Run Celos server against sandboxed inputs
4. Compare outputs in sandbox with expected fixtures
UUID
RUN AUGMENTED
WORKFLOWCELOS SERVER
JETTY
CELOS CI
HADOOP
CLUSTER
CELOS CI TEST CASE
1. Create sandboxed environment in Hadoop Cluster
2. Load input data into sandbox
3. Run Celos server against sandboxed inputs
4. Compare outputs in sandbox with expected fixtures
COMPARE WITH
HADOOP
CLUSTER
OUTPUT FIXTURES
FROM TEST.JS
FETCH ACTUAL
RESULTSCELOS SERVER
JETTY
CELOS CI
ANY REAL EXAMPLES FOR TESTING?
ci.addTestCase({
name: “Accordion test case 1",
sampleTimeStart: "2014-12-01T00:00Z",
sampleTimeEnd: "2014-12-01T00:00Z",
inputs: [
ci.hdfsInput(ci.fixDirFromResource("test-1/logs"), "/logs"),
ci.hdfsInput(ci.fixDirFromResource("test-1/warehouse"), "/user/hive/warehouse"),
ci.hiveInput(“db_name")
],
outputs: [
ci.jsonCompare(
ci.fixFileFromResource("test-1/output/accordion/2014-12-01/00/preprocessor.json"),
ci.expandJson(ci.tableToJson(ci.hiveTable(“db_name ", "preprocessor")), ["preprocessor.change"])
)
]
});
CONTINUOUS DEPLOYMENT WITH TEST CLUSTER
COMMIT
TO DEVELOP
BRANCH ON
GITHUB
JENKINS BUILD
WITH TESTS
GITHUB
NOTIFICATION
CONTINUOUS DEPLOYMENT WITH TEST CLUSTER
COMMIT
TO DEVELOP
BRANCH ON
GITHUB
JENKINS BUILD
WITH TESTS
MERGE
TO MASTER
GITHUB
NOTIFICATION
DEPLOY TO
PRODUCTION
CLUSTER
RUN TESTS
ON TEST CLUSTER
WHAT DO WE HAVE IN CELOS
All Oozie benefits
Flexibility of scripting language
Easy to monitor and manage
Easy to create workflow set
Flexible schedules (data-based, dependency-based, time-based)
Easy to solve possible problems and add new features (1.6k LOC in Java)
Reliable & robust
Bug-free software (1 minor bug in a year of production with > 100 wf’s)
Meta-workflows such as data retention are just several lines in JS
Understandable by DevOps and DataScientists
Continuous deployment
Testing
PERFECT? ALMOST…
Only Oozie supported for now
No workflow dependency graph in UI
If you change schedule, you cant see history
Test fixtures setup experience could be better
Test run may be harmful if you do it wrong
No logs at UI
No HA
RESOURCES
Learn about Workflow engines
http://nerds.airbnb.com/introducing-chronos/
http://data.linkedin.com/opensource/azkaban
https://github.com/spotify/luigi
http://falcon.apache.org/
Workflow Engines for Hadoop by Joe Crobak
https://youtu.be/PTwi89n72Mc
Learn about Celos project
https://github.com/collectivemedia/celos