20
Oozie: Scheduling Workflows On the Grid Mohammad K Islam kamrul@yahoo‐inc.com

Oozie Summit 2011

Embed Size (px)

DESCRIPTION

Presented at HAdoop Summit

Citation preview

Page 1: Oozie Summit 2011

Oozie:Scheduling WorkflowsOntheGrid

MohammadKIslamkamrul@yahoo‐inc.com

Page 2: Oozie Summit 2011

Agenda•  OozieOverview•  Oozie3.xfeatures:– Bundle– Scalability– Usability

•  Challenges•  FuturePlan•  Q&A

Page 3: Oozie Summit 2011

Overview:Workflow•  OozieexecutesworkflowdefinedasDAGofjobs.•  Thejobtypeincludes:Map‐Reduce/Pipes/Streaming/Pig/CustomJavaCodeetc.

•  IntroducedinOozie1.x.

startM/Rjob

M/Rstreaming

job

decision

fork

Pigjob

M/Rjob

join

end JavaFSjob

ENOUGH

MORE

Page 4: Oozie Summit 2011

Overview:Coordinator•  Oozieexecutesworkflowbasedon:–  TimeDependency(Frequency)–  DataDependency

•  IntroducedinOozie2.x.

Hadoop

OozieServer

OozieClient

OozieWorkflow

WSAPI OozieCoordinator

CheckDataAvailability

Page 5: Oozie Summit 2011

Oozie3.x:Bundle•  Usercandefineandexecutea bunch of coordinatorapplica\ons.

•  Usercouldstart/stop/suspend/resume/rerun inthebundlelevel.

•  Benefits:Easytomaintainandcontrollargedatapipelinesapplica\onsforServiceEngineeringteam.

Hadoop

OozieServer

OozieClient

Workflow

WSAPI

Coordinator

CheckDataAvailability

Bundle

Page 6: Oozie Summit 2011

OozieAbstracNonLayers

Coord Action 1 

Coord Action 2 

Coord Action1  

Coord Action 2 

WF Job 1  WF Job 2  WF Job 2 

M/R Job 

PIG Job 

FS Job 

M/R Job 

PIG Job 

Bundle  Layer1

Coord Job 1  Coord Job 2 

Layer2

WF Job 1 

Layer3

Page 7: Oozie Summit 2011

EnhancedStabilityandScalability

•  Issue:– Atveryhighload,Ooziebecomesslow.–  90%ofthetotalOoziesupportincidence.

•  Reason:–  Lotofac\vebutnon‐progressingjobs.– Oozieinternalqueueisfull.

•  Resolu\on:–  Throclethenumberofac\vejobs/coordinator–  Putthejobinto\meoutstate.–  Enforcetheuniquenessforooziequeueelement.

Page 8: Oozie Summit 2011

ImprovedUsability

•  Issue:– Coordinatorjob’sstatusisnotintui\veandcausesconfusiontotheOozieuser.

•  Reason:– StatusSUCCEEDEDdoesn’tmeanjobissuccessful!!

– StatusPREMATERisforoozieinternaluseonly.Butitwasexposedtouser.

•  Resolu\on:– RedesignCoordinatorstatus

Page 9: Oozie Summit 2011

CoordinatorStatusRedesign

PREP Running

KILLED

SUCCEEDED

FAILED

DONE_WITH_ERROR

SUSPENDED

PAUSED

Current

New

PREP PREMATER Running

KILLED

SUCCEEDED

FAILED

SUSPENDED

PREMATER SUCCEEDED

Page 10: Oozie Summit 2011

TheSecondYear...•  NumberofReleases–  FeatureReleases:3–  Patches:9

•  Backward compa5bility isstronglymaintained.

•  NoneedtoresubmitthejobifOozieisrestarted.

•  CodeOverhaul:–  Re‐designedthecommandpacerntoavoidDBconnec\onleaksandtoimproveDBconnec\onsusages.

Page 11: Oozie Summit 2011

OozieUsages•  Y!internalusages:– Totalnumberofuser:377

– Totalnumberofprocessedjobs≈600K/month

•  Externaldownloads:– 1500+inlast8monthsfromGithub– Alargenumberofdownloadsmaintainedby3rdpartypackaging.

Page 12: Oozie Summit 2011

OozieUsagesCont.

•  UserCommunity:– Membership•  Y!internal‐265•  External–163

– Message(approximate):•  Y!internal–9/day•  External–7/day

Page 13: Oozie Summit 2011

Challenges1:DataAvailabilityCheck

•  Issue:– Currentlychecksdirectoryineveryminute(polling based).

–  IncreasesNNoverheadanddoes not scale well.•  Reason:Nometa‐datasystemwithappropriateno\fica\onsmechanism.

•  Plannedresolu\on:IncorporatewithHCatalogmetadatasystem.

Page 14: Oozie Summit 2011

Challenges2:AdaptabilitytoHadoop

•  Issues:IfHadoopNNorJTisdown,Ooziesubmitsjobandobviouslyfails.Userinterven\onisrequiredwhenHadoopserverisback.

•  Impact:InconvenientforOozieuser.Forexample,ifHadoopisrestartedonFridaynight,jobwillnotrunun\lnextMonday.

•  PlannedResolu\on:GracefulhandlingofHadoopdown\me:–  IfHadoopisdown,blocksubmission.– WhenHadoopbecomesavailable

•  Submittheblockedjob•  Auto‐resubmittheuntracedjob.

Page 15: Oozie Summit 2011

Challenges3:HorizontallyScalable

•  Issues:OneinstanceofOoziecouldnotefficientlyhandleaverylargenumberofjobs(say100K/hours).Inaddi\on,Ooziedoesn’tsupportloadbalancing.

•  Reason:Oozieinternaltaskqueueisnotsynchronizedacrossmul\pleOozieinstances.

•  PlannedResolu\on:UseZookeeperforcoordina\on.•  Benefits:Astheloadincreases,addextraOozieserver.

Page 16: Oozie Summit 2011

FuturePlan

•  AutomaNcFailover:UsingZooKeeper.•  Monitoring:RichWSAPIforapplica\onMonitoring/Aler\ng.

•  ImprovedUsability:– Distcpac\on– HiveAc\on

•  Asynchronousdataprocessing.•  Incrementaldataprocessing.•  ApacheMigraNon:Worksini\ated.

Page 17: Oozie Summit 2011

Q&A

MohammadKIslam

kamrul@yahoo‐inc.com

•  Githublink:hcp://yahoo.github.com/oozie• Mailinglist:[email protected]

Page 18: Oozie Summit 2011

BackupSlides

Page 19: Oozie Summit 2011

OozieWorkflowApplica\on•  Contents–  Aworkflow.xmlfile–  Resourcefiles,configfilesandPigscripts–  AllnecessaryJARandna\velibraryfiles

•  Parameters–  Theworkflow.xml,isparameterized,parameterscanbepropagatedtomap-reduce,pig &sshjobs

•  Deployment–  InadirectoryintheHDFSoftheHadoopclusterwheretheHadoop&Pigjobswillrun

19

Page 20: Oozie Summit 2011

OoziecmdRunningaWorkflowJob

WorkflowApplicaNonDeployment

$ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf $ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf/lib $ hadoop fs –copyFromLocal workflow.xml wordcount.xml hdfs://usr/tucu/wordcount-wf $ hadoop fs –copyFromLocal hadoop-examples.jar hdfs://usr/tucu/wordcount-wf/lib $

WorkflowJobExecuNon

$ oozie run -o http://foo.corp:8080/oozie \ -a hdfs://bar.corp:9000/usr/tucu/wordcount-wf \

input=/data/2008/input output=/data/2008/output

Workflow job id [1234567890-wordcount-wf] $

WorkflowJobStatus

$ oozie status -o http://foo.corp:8080/oozie -j 1234567890-wordcount-wf Workflow job status [RUNNING]

... $

20