20
IBM Research Division IBM | 2007 BPEL4Job: a Fault-handling Design for Job Flow Management Wei Tan 1 , Liana Fong 2 , Norman Bobroff 2 1 Dept. Automation, Tsinghua University, Beijing, China 2 IBM T. J. Watson Research Center, Hawthorne, USA [email protected] [email protected] , [email protected]

BPEL4Job: a Fault-handling Design for Job Flow Management

  • Upload
    langer

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

BPEL4Job: a Fault-handling Design for Job Flow Management. Wei Tan 1 , Liana Fong 2 , Norman Bobroff 2 1 Dept. Automation, Tsinghua University, Beijing, China 2 IBM T. J. Watson Research Center, Hawthorne, USA [email protected] [email protected] , [email protected]. Agenda. - PowerPoint PPT Presentation

Citation preview

Page 1: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

IBM | 2007

BPEL4Job: a Fault-handling Design for Job Flow Management

Wei Tan1, Liana Fong2, Norman Bobroff2

1 Dept. Automation, Tsinghua University, Beijing, China2 IBM T. J. Watson Research Center, Hawthorne, USA [email protected]

[email protected], [email protected]

Page 2: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

2 23/4/21

Agenda

1 Introduction

2 BPEL4Job: a fault-handling design for job flow management

3 Integrating fault-handling policies with job flow modeling

4 Fault-handling at the flow execution layer

5 Implementation and sample application

6 Conclusion and ongoing & future work

Page 3: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

3 23/4/21

1 Introduction: Motivation

Job flow is especially relevant in orchestrating batch jobs– Enforce job execution sequence– Manage job execution trace– Handle run-time fault in flow level

Various languages & systems have been devised– DAGMan/Condor, Taverna/myGrid, Job Stream/Tivoli- Workload Scheduler,

JobCommand/Tivoli-LoadLeveler

BPEL-based job flow management is attracting more attention– Resource and applications are becoming service-oriented– Requirement to combine business process (including human tasks) with back-

end batch jobs– BPEL as a framework on flow orchestration, data manipulation, fault handling,

and could be extended or enhanced– BPEL is supported by industry and open source community

Page 4: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

4 23/4/21

1 Introduction: Challenges

The use of BPEL for job flow is not without technical challenges

– Defining a job entity• BPEL does not support using JSDL or other job specification languages

– Supporting data flow and dependencies • Data staging in/out

– Incorporating the asynchronous interaction with schedulers• Usually job scheduler reports job status in an asynchronous manner

– Incorporating fault tolerance and recovery strategy in job flow• Job flow has special requirement on fault handling, like re-try and re-submit

– Supporting dynamic changes of flow instances• In case that flow execution logic could not be fully anticipated in-advance.

Page 5: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

5 23/4/21

1 Introduction: BPEL4Job

The goal of BPEL4Job

–A BPEL-based job flow system with fault-handling capability

Challenges addressed

–How to communicate with job schedulers?• A generic job proxy to facilitate the asynchronous job

submission and job status notification–How to model a job flow with fault-handling capability?

• A policy-based, two-stage approach –How to enforce various fault-handling policies at run-time?

• A set of fundamental fault-handling schemes, especially, including instance migration between flow engines

Page 6: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

6 23/4/21

2 BPEL4Job: a fault-handling design for job flow management

Flow modeling layer

– Stage 1: define base flow, job definitions, the fault-handling policies.

– Stage 2, generate expanded flow.

Flow execution layer

– Flow engine

– Job proxy

– Fault-handling service

Job scheduling layer

– Job schedulersJ ob

scheduling

Fault-handling policy

Job definition

Base flow

. . .

faulthandler

faulthandlers

compensationhandler

terminationhandler

eventhandlers

. . .

. . .

. . .

. . .

correlationsets

partnerlinks

scope

variables

Flow modeling

Flow execution

Expanded flow

Job proxyFault-handling

service

Invoke job proxy

Get notification

Scheduler

transform

B. return Job EPR

! ! !! ! !

Flow engine

C. job status notification

A. submit

Page 7: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

7 23/4/21

3 Integrating fault-handling policies with job flow modeling

BPEL4Job considers three kinds of policies

– Cleanup• generate fault report and delete the instance data in flow engine.

– Re-try• re-execute the job in the same engine.

– Re-submit• Export flow instance state• Restore flow instance in a different engine, such that the flow can resume

from the failed job

– More policies could be defined and implemented based on the three fundamental policies• Rollback, alternate job, etc.

Page 8: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

8 23/4/21

3 Integrating fault-handling policies with job flow modeling

The base flow with policies embedded

The re-try policy The re-submit policy

Page 9: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

9 23/4/21

3 Integrating fault-handling policies with job flow modeling

The transformation to implement the re-try policy of Job1

Base flowExpanded flow

Page 10: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

10 23/4/21

4 Fault-handling at the flow execution layer

We leverage:– BPEL fault-handling construct: Catch, CatchAll

We enhance– Specific capabilities to recognize job failures and to handle faults according to defined

policies.

Components in this layer– The generic job proxy for job submission and job status notification – The fault-handling service to enforce the policies defined in flow modeling layer

Flow execution

Job proxyFault-handling

service

Invoke job proxy

Get notification

! ! !! ! !

Flow engine

Page 11: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

11 23/4/21

The generic job proxy

Generic job proxy

– Receives a job submission request.

– Forwards the request to a scheduler, and start to listen for the job state notification from it.

– For notification indicating job success/failure, forwards to flow engine and returns; otherwise continue listening.

Receive job request

Submit job to scheduler

Receive job state notification

Return success

Return failure

[other]

[succeeded]

[failed]

Page 12: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

12 23/4/21

Fault-handling schemes in flow execution

Which implemented

policy

Instance suspended

[re-submit]

[re-try]

Export instance data & delete it

Instance deleted

Re-submit

Find retry entry

Suspend instance

Instance resumed

Find re-submit

entry

Generate report & delete instance

Instance deleted

[cleanup]

Submit Job

What completion

status

Ready

submitted

success failure

Navigate to next job

[failure][success]

Page 13: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

13 23/4/21

Flow re-submission and instance migration

Extract all the information related to a BPEL instance.

Re-shape the instance data and migrate it into another WPS engine.

Flow engine 1

DB 1

Flow engine 2

DB 2

1. J ob2 fails due to resource unavailability

3. Instance data exported to XML, and instance deleted in DB1

4. Instance data imported to DB2

5.Instance resumed in engine 2

2. Suspend instance in engine 1

3 4

Page 14: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

14 23/4/21

Implementation

Job scheduling

Fault-handling policy

Job definition

Base flow

. . .

faulthandler

faulthandlers

compensationhandler

terminationhandler

eventhandlers

. . .

. . .

. . .

. . .

correlationsets

partnerlinks

scope

variables

Flow modeling

Flow execution

Expanded flow

Job proxyFault-handling

service

Invoke job proxy

Get notification

Scheduler

transform

B. return Job EPR

! ! !! ! !

Flow engine

C. job status notification

A. submit

Websphere Integration Developer (WID)

Websphere Process Server (WPS)

Tivoli Dynamic Workload Broker (ITDWB)

Page 15: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

15 23/4/21

Sample Montage Job Flow

raw images

GenerateImage table

Image projectionin parallel

GenerateImage table

GenerateMosaic

Transformto jpeg

Montage: a toolkit for assembling raw astronomy images into custom mosaics.

–Developed by NASA & California Institute of Technology.

–The assembling process is usually expressed as a job flow.

Page 16: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

16 23/4/21

Montage job flow and the re-start policy

Policy says: re-submit from

mImgtbl1 when mAdd1

fails

Expanded flow (partial)Base flow

Page 17: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

17 23/4/21

(a) Montage initiated & failed at saba10

(b) Montage migrated to weitan

(c) Montage re-started and completed at weitan

Instance migration from saba10 to weitan

Page 18: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

18 23/4/21

Conclusion

BPEL4Job: the exploration of using BPEL as a job flow language

– A two-stage approach for job flow modeling with fault-handling policies

– A generic job proxy to facilitate the asynchronous nature of job submission and job status notification

– A set of fundamental fault-handling schemes, including instance migration between flow engines

Future work

– Support more complicated fault-handling policies

• Involving Human Task, expressed as business rules, etc

– Apply instance migration technique in

• Load balance between flow engines

• Instance migration to newer version

Page 19: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

19 23/4/21

Future work

Page 20: BPEL4Job:  a Fault-handling Design for  Job Flow Management

IBM Research Division

IBM | 2007

Thank you for your attention.

Please contact me at: Dept. Automation, Tsinghua Univ, Beijing, Chinahttp://[email protected]