15
Oozie Meetup Bowen Zhang

Oozie meetup Hadoop Summit 2014

Embed Size (px)

DESCRIPTION

by Bowen Zhang (Hortonworks)

Citation preview

Page 1: Oozie meetup Hadoop Summit 2014

Oozie MeetupBowen Zhang

Page 2: Oozie meetup Hadoop Summit 2014

Agenda

● Oozie Log HA● Oozie cron scheduling

○ Use cases○ Troubleshooting

● Prospective 4.1 release

Page 3: Oozie meetup Hadoop Summit 2014

Oozie Log HA

● HA implemented already○ Server○ HCat○ SLA○ Sharelib

● Remaining piece○ log streaming: if a node is down, all oozie.log

content on that node is not accessible

Page 4: Oozie meetup Hadoop Summit 2014

Proposed Solution

● YARN faced the same issue when it comes to log streaming and retrieval

● Currently, YARN puts container log to HDFS when log aggregation is enabled

● Oozie can duplicate logs onto HDFS○ Log directly to HDFS○ Copy the log onto HDFS during log rotation

Page 5: Oozie meetup Hadoop Summit 2014

Direct logging to HDFS

● Pros○ complete log HA with 100% accuracy of the

content● Cons

○ This could be hard to implement and may need significant oozie logging mechanism changes

○ This introduces strict dependency on HDFS○ Potential server performance issues.

Page 6: Oozie meetup Hadoop Summit 2014

Copy log rotation

● Pros○ Easy to implement without significant

changes to oozie logging structure○ Less performance issue

● Cons○ Always has less than one hour window of log

unavailability due to rotation schedule

Page 7: Oozie meetup Hadoop Summit 2014

Other ideas?

Other ideas are always welcome.Eg. Putting it into DB?Eg. Integrate Zookeeper to solve this problem?

Page 8: Oozie meetup Hadoop Summit 2014

Coordinator Cron Scheduling

Various Cron syntaxes exist and unix cron syntax is only one of them.● Oozie cron has 5 fields since oozie

operates on per minute base.● Weekday starts at 2 which is Monday● Complicated Overflowing ranges are

discouraged to use

Page 9: Oozie meetup Hadoop Summit 2014

Use cases

● A job running at 9am every weekday○ frequency="0 9 * * 2-6" or "0 9 * * MON-FRI"○ Notice in the first expression, we use 2-6

instead of 1-5● A job running every 15 minutes from 9-

11am every day○ frequency="0/15 9,10 * * *" or "0,15,30,45

9,10 * * *"○ Notice hour field should be 9,10 instead of 9-

11

Page 10: Oozie meetup Hadoop Summit 2014

Use cases continued

● A job running at 9am of every last Friday of the month○ frequency = "0 9 * * 6L"

● A job running at 9am of every 2nd Friday of the month○ frequency = "0 9 * * 6#2"

Page 11: Oozie meetup Hadoop Summit 2014

General mistakes

● Oozie timezone by default is UTC. So your cron syntax should calculate this timezone differences○ If you live in LA and want to run a job at 9am

every day:■ frequency = "0 9 * * *" is wrong!!!■ frequency = "0 16 * * *" is the right one

○ We understand the inconvenience, but that’s the oozie server timezone.

Page 12: Oozie meetup Hadoop Summit 2014

General mistakes continued

● Day of month and day of week are union, not intersection○ "0 10 12-18 * 2-6" DOES NOT MEAN running a

job at 10 am on weekdays between 12 and 18th of the month

○ It MEANS running a job at 10 am on weekdays AND on 12-18th of the month. Only one of the two needs to be satisfied for the job to fire.

Page 13: Oozie meetup Hadoop Summit 2014

General Mistakes Continued

● An overflow range goes wild○ "0 23-1 * * *" is reasonable○ "0 23-1 * DEC-MAR FRI-MON". What does this

mean?■ Cron can produce many different edge

cases in the above scenario. Do it at your own peril!

Page 14: Oozie meetup Hadoop Summit 2014

4.1 ReleaseWe have around 200 patches for this release and it’s been a while!● All HA related work● Cron scheduling for coordinator job● Consolidation of JPAExecutors (not user

facing”● Introduction of SharelibService (some

user facing backward imcompatibilty issue)

Page 15: Oozie meetup Hadoop Summit 2014

4.1 Release continued

● Better integration with YARN RM HA and restart

● Oozie Sqoop CLI functionality● Many more versatility of Coordinator

job functionalities● Major overhaul of coordinator job

execution order