31
Everything that you ever wanted to know about Oozie, but were afraid to ask B Lublinsky, A Yakubovich

Everything you wanted to know, but were afraid to ask about Oozie

  • View
    9.630

  • Download
    2

Embed Size (px)

DESCRIPTION

Boris Lublinsky and Alexey Yakubovich give us an overview of using Oozie. This presentation was given on December 13th, 2012 at the Nokia offices in Chicago, IL. View the HD video of this talk here: http://vimeo.com/chug/oozie-overview

Citation preview

Page 1: Everything you wanted to know, but were afraid to ask about Oozie

Everything that you ever wanted to know about Oozie, but were

afraid to ask

B Lublinsky, A Yakubovich

Page 2: Everything you wanted to know, but were afraid to ask about Oozie

Apache Oozie• Oozie is a workflow/coordination system to

manage Apache Hadoop jobs.• A single Oozie server implements all four

functional Oozie components:– Oozie workflow– Oozie coordinator– Oozie bundle– Oozie SLA.

Page 3: Everything you wanted to know, but were afraid to ask about Oozie

Main components

Data

Oozie Server

Coordinator

Coordinator

Hadoop

Coordinator

Oozie Command Line Interface

3rd party application

definitions,states

WS API

job submissionand monitoring

workflow

action

action

action

action

Oozie shared libraries

Coordinator

wf logic

Bundle

CoordinatorCoordinatorBundle

CoordinatorCoordinatorWorkflow

time condition monitoring

data condition monitoring

HDFS

MapReduce

Page 4: Everything you wanted to know, but were afraid to ask about Oozie

Oozie workflow

Page 5: Everything you wanted to know, but were afraid to ask about Oozie

Workflow LanguageFlow-control node

XML element type Description

Decision workflow:DECISION expressing “switch-case” logic

Fork workflow:FORK splits one path of execution into multiple concurrent pathsJoin workflow:JOIN waits until every concurrent execution path of a previous fork

node arrives to itKill workflow:kill forces a workflow job to kill (abort) itself

Action node XML element type Descriptionjava workflow:JAVA invokes the main() method from the specified java classfs workflow:FS manipulate files and directories in HDFS; supports commands:

move, delete, mkdirMapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job,

streaming job or pipe jobPig workflow:pig runs a Pig jobSub workflow workflow:SUB-

WORKFLOWruns a child workflow job

Hive * workflow:HIVE runs a Hive jobShell * workflow:SHELL runs a Shell commandssh * workflow:SSH starts a shell command on a remote machine as a remote secure

shellSqoop * workflow:SQOOP runs a Sqoop jobEmail * workflow:EMAIL sending emails from Oozie workflow applicationDistcp ? Under development (Yahoo)

Page 6: Everything you wanted to know, but were afraid to ask about Oozie

Workflow actions

ActionStartCommand J avaActionExecutorWorkflowStore Services J obClientActionExecutorContext

1 : workflow := getWorkflow()

2 : action := getAction()

3 : context := init<>()

4 : executor := get()

5 : start()

6 : submitLauncher()

7 : jobClient := get()

8 : runningJ ob := submit()

9 : setStartData()

• Oozie workflow supports two types of actions: Synchronous, executed inside Oozie runtime Asynchronous, executed as a Map Reduce job.

Page 7: Everything you wanted to know, but were afraid to ask about Oozie

Workflow lifecycle

PREP

RUNNINGKILLED

SUSPENDED

FAILED

SUCCEDDED

Page 8: Everything you wanted to know, but were afraid to ask about Oozie

Oozie execution console

Page 9: Everything you wanted to know, but were afraid to ask about Oozie

Extending Oozie workflow• Oozie provides a “minimal” workflow language, which contains

only a handful of control and actions nodes.• Oozie supports a very elegant extensibility mechanism – custom

action nodes. Custom action nodes allow to extend Oozie’ language with additional actions (verbs).

• Creation of custom action requires implementation of following: – Java action implementation, which extends ActionExecutor class.– Implementation of the action’s XML schema defining action’s

configuration parameters– Packaging of java implementation and configuration schema into

action jar, which has to be added to Oozie war– extending oozie-site.xml to register information about custom executor

with Oozie runtime.

Page 10: Everything you wanted to know, but were afraid to ask about Oozie

Oozie Workflow Client• Oozie provides an easy way for integration with enterprise

applications through Oozie client APIs. It provides two types of APIs

• REST HTTP APINumber of HTTP requests• Info requests (job status, job configuration)• Job management (submit, start, suspend, resume, kill)Example: job definition info request

GET /oozie/v0/job/job-ID?show=definition• Java API - package org.apache.oozie.client

– OozieClientstart(), submit(), run(), reRunXXX(), resume(), kill(), suspend()

– WorkflowJob, WorkflowAction– CoordinatorJob, CoordinatorAction – SLAEvent

Page 11: Everything you wanted to know, but were afraid to ask about Oozie

Oozie workflow good, bad and ugly

• Good– Nice integration with Hadoop ecosystem, allowing to easily build

processes encompassing synchronized execution of multiple Map Reduce, Hive, Pig, etc jobs.

– Nice UI for tracking execution progress– Simple APIs for integration with other applications– Simple extensibility APIs

• Bad– Process has to be expressed directly in hPDL with no visual support– No support for Uber Jars (but we added our own)

• Ugly– Static forking (but you can regenerate workflow and invoke on a fly)– No support for loops

Page 12: Everything you wanted to know, but were afraid to ask about Oozie

Oozie Coordinator

Page 13: Everything you wanted to know, but were afraid to ask about Oozie

Coordinator languageElement type Description Attributes and sub-elementscoordinator-app

top-level element in coordinator instance frequencystartend

controls specify the execution policy for coordinator and it’s elements (workflow actions)

timeout (actions)concurrency (actions)execution order (workflow instances)

action Required singular element specifying the associated workflow. The jobs specified in workflow consume and produce dataset instances

Workflow name

datasets Collection of data referred to by a logical name. Datasets serve to specify data dependences between workflow instances

input event specifies the input conditions (in the form of present data sets) that are required in order to execute a coordinator action

output event specifies the dataset that should be produced by coordinator action

Page 14: Everything you wanted to know, but were afraid to ask about Oozie

Coordinator lifecycle

Page 15: Everything you wanted to know, but were afraid to ask about Oozie

Oozie Bundle

Page 16: Everything you wanted to know, but were afraid to ask about Oozie

Bundle lifecycle

RUNNINGPREPSUSPENDED KILLED

SUSPENDED

PREP

FAILEDSUCCEDDED PAUSED

PREPPAUSED

Page 17: Everything you wanted to know, but were afraid to ask about Oozie

Oozie SLA

Page 18: Everything you wanted to know, but were afraid to ask about Oozie

SLA Navigation

· event_id· alert_contact· alert-frieuency· …· sla_id· ...

SLA_EVENT

· id· app_name· app_path· …

COORD_JOBS

· id· action_number· action_xml· …· external_id· ...

COORD_ACTIONS

· id· conf· console_url· …

· id· app_name· app_path· …

WF_ACTIONS

WF_JOBS

Page 19: Everything you wanted to know, but were afraid to ask about Oozie
Page 20: Everything you wanted to know, but were afraid to ask about Oozie

Using Probes to analyze/monitor Places

• Select probe data for specified time/location• Validate – Filter - Transform probe data• Calculate statistics on available probe data• Distribute data per geo-tiles• Calculate place statistics (e.g. attendance index)-------------------------------------------------------------If exception condition happens, report failureIf all steps succeeded, report success

Page 21: Everything you wanted to know, but were afraid to ask about Oozie

Workflow as acyclic graph

Page 22: Everything you wanted to know, but were afraid to ask about Oozie

Workflow – fragment 1

Page 23: Everything you wanted to know, but were afraid to ask about Oozie

Workflow – fragment 2

Page 24: Everything you wanted to know, but were afraid to ask about Oozie

Oozie tips and tricks

Page 25: Everything you wanted to know, but were afraid to ask about Oozie

Configuring workflow• Oozie provides 3 overlapping mechanisms to configure workflow -

config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations.

• The way Oozie processes these three sets of the parameters is as follows:– Use all of the parameters from command line invocation– For remaining unresolved parameters, job config is used– Use config-default.xml for everything else

• Although documentation does not describe clearly when to use which, the overall recommendation is as follows:– Use config-default.xml for defining parameters that never change for a given

workflow– Use jobs properties for the parameters that are common for a given

deployment of a workflow– Use command line arguments for the parameters that are specific for a given

workflow invocation.

Page 26: Everything you wanted to know, but were afraid to ask about Oozie

Accessing and storing process variables

• Accessing– Through the arguments in java main

• StoringString ooziePropFileName = System.getProperty("oozie.action.output.properties"); OutputStream os = new FileOutputStream(new File(ooziePropFileName));Properties props = new Properties();props.setProperty(key, value);props.store(os, "");os.close();

Page 27: Everything you wanted to know, but were afraid to ask about Oozie

Validating data presence

• Oozie provides two possible approaches for validating resource file(s) presence – using Oozie coordinator’s input events based on the data set -

technically the simplest implementation approach, but it does not provide a more complex decision support that might be required. It just either runs a corresponding workflow or not.

– custom java node inside Oozie workflow. - allows to extend decision logic by sending notifications about data absence, run execution on partial data under certain timing conditions, etc.

• Additional configuration parameters for Oozie coordinator, for example, ability to wait for files arrival, etc. can expand usage of Oozie coordinator.

Page 28: Everything you wanted to know, but were afraid to ask about Oozie

Invoking map Reduce jobs• Oozie provides two different ways of invoking Map Reduce job –

MapReduce action and java action. • Invocation of Map Reduce job with java action is somewhat

similar to invocation of this job with Hadoop command line from the edge node. You specify a driver as a class for the java activity and Oozie invokes the driver. This approach has two main advantages:– The same driver class can be used for both – running Map Reduce job

from an edge node and a java action in an Oozie process.– A driver provides a convenient place for executing additional code, for

example clean-up required for Map Reduce execution.• Driver requires a proper shutdown hook to ensure that there are

no lingering Map Reduce jobs

Page 29: Everything you wanted to know, but were afraid to ask about Oozie

Implementing predefined looping and forking

• hPDL is an XML document with the well-defined schema.

• This means that the actual workflow can be easily manipulated using JAXB objects, which can be generated from hPDL schema using xjc compiler.

• This means that we can create the complete workflow programmatically, based on calculated amount of fork branches or implementing loops as a repeated actions.

• The other option is creation of template process and modifying it based on calculated parameters.

Page 30: Everything you wanted to know, but were afraid to ask about Oozie

Oozie client security (or lack of)• By default Oozie client reads clients identity from the

local machine OS and passes it to the Oozie server, which uses this identity for MR jobs invocation

• Impersonation can be implemented by overwriting OozieClient class’ method createConfiguration, where client variables can be set through new constructor.

public Properties createConfiguration() { Properties conf = new Properties(); if(user == null) conf.setProperty(USER_NAME, System.getProperty("user.name")); else conf.setProperty(USER_NAME, user); return conf; }

Page 31: Everything you wanted to know, but were afraid to ask about Oozie

uber jars with Oozieuber jar contains resources: other jars, so libraries, zip files

<java> … <main-class>${wfUberLauncher}</main-class> <arg>-appStart=${wfAppMain}</arg> …</java>

Oozie server

launcher java action

unpack resourcesto current uber jar dir

set inverse classloader

invoke MR driverpass arguments

set shutdown hook‘wait for complete’

uber jar

Classes (Launcher)

jars so zip

mappermapper