Upload
langer
View
25
Download
0
Embed Size (px)
DESCRIPTION
BPEL4Job: a Fault-handling Design for Job Flow Management. Wei Tan 1 , Liana Fong 2 , Norman Bobroff 2 1 Dept. Automation, Tsinghua University, Beijing, China 2 IBM T. J. Watson Research Center, Hawthorne, USA [email protected] [email protected] , [email protected]. Agenda. - PowerPoint PPT Presentation
Citation preview
IBM Research Division
IBM | 2007
BPEL4Job: a Fault-handling Design for Job Flow Management
Wei Tan1, Liana Fong2, Norman Bobroff2
1 Dept. Automation, Tsinghua University, Beijing, China2 IBM T. J. Watson Research Center, Hawthorne, USA [email protected]
IBM Research Division
2 23/4/21
Agenda
1 Introduction
2 BPEL4Job: a fault-handling design for job flow management
3 Integrating fault-handling policies with job flow modeling
4 Fault-handling at the flow execution layer
5 Implementation and sample application
6 Conclusion and ongoing & future work
IBM Research Division
3 23/4/21
1 Introduction: Motivation
Job flow is especially relevant in orchestrating batch jobs– Enforce job execution sequence– Manage job execution trace– Handle run-time fault in flow level
Various languages & systems have been devised– DAGMan/Condor, Taverna/myGrid, Job Stream/Tivoli- Workload Scheduler,
JobCommand/Tivoli-LoadLeveler
BPEL-based job flow management is attracting more attention– Resource and applications are becoming service-oriented– Requirement to combine business process (including human tasks) with back-
end batch jobs– BPEL as a framework on flow orchestration, data manipulation, fault handling,
and could be extended or enhanced– BPEL is supported by industry and open source community
IBM Research Division
4 23/4/21
1 Introduction: Challenges
The use of BPEL for job flow is not without technical challenges
– Defining a job entity• BPEL does not support using JSDL or other job specification languages
– Supporting data flow and dependencies • Data staging in/out
– Incorporating the asynchronous interaction with schedulers• Usually job scheduler reports job status in an asynchronous manner
– Incorporating fault tolerance and recovery strategy in job flow• Job flow has special requirement on fault handling, like re-try and re-submit
– Supporting dynamic changes of flow instances• In case that flow execution logic could not be fully anticipated in-advance.
IBM Research Division
5 23/4/21
1 Introduction: BPEL4Job
The goal of BPEL4Job
–A BPEL-based job flow system with fault-handling capability
Challenges addressed
–How to communicate with job schedulers?• A generic job proxy to facilitate the asynchronous job
submission and job status notification–How to model a job flow with fault-handling capability?
• A policy-based, two-stage approach –How to enforce various fault-handling policies at run-time?
• A set of fundamental fault-handling schemes, especially, including instance migration between flow engines
IBM Research Division
6 23/4/21
2 BPEL4Job: a fault-handling design for job flow management
Flow modeling layer
– Stage 1: define base flow, job definitions, the fault-handling policies.
– Stage 2, generate expanded flow.
Flow execution layer
– Flow engine
– Job proxy
– Fault-handling service
Job scheduling layer
– Job schedulersJ ob
scheduling
Fault-handling policy
Job definition
Base flow
. . .
faulthandler
faulthandlers
compensationhandler
terminationhandler
eventhandlers
. . .
. . .
. . .
. . .
correlationsets
partnerlinks
scope
variables
Flow modeling
Flow execution
Expanded flow
Job proxyFault-handling
service
Invoke job proxy
Get notification
Scheduler
transform
B. return Job EPR
! ! !! ! !
Flow engine
C. job status notification
A. submit
IBM Research Division
7 23/4/21
3 Integrating fault-handling policies with job flow modeling
BPEL4Job considers three kinds of policies
– Cleanup• generate fault report and delete the instance data in flow engine.
– Re-try• re-execute the job in the same engine.
– Re-submit• Export flow instance state• Restore flow instance in a different engine, such that the flow can resume
from the failed job
– More policies could be defined and implemented based on the three fundamental policies• Rollback, alternate job, etc.
IBM Research Division
8 23/4/21
3 Integrating fault-handling policies with job flow modeling
The base flow with policies embedded
The re-try policy The re-submit policy
IBM Research Division
9 23/4/21
3 Integrating fault-handling policies with job flow modeling
The transformation to implement the re-try policy of Job1
Base flowExpanded flow
IBM Research Division
10 23/4/21
4 Fault-handling at the flow execution layer
We leverage:– BPEL fault-handling construct: Catch, CatchAll
We enhance– Specific capabilities to recognize job failures and to handle faults according to defined
policies.
Components in this layer– The generic job proxy for job submission and job status notification – The fault-handling service to enforce the policies defined in flow modeling layer
Flow execution
Job proxyFault-handling
service
Invoke job proxy
Get notification
! ! !! ! !
Flow engine
IBM Research Division
11 23/4/21
The generic job proxy
Generic job proxy
– Receives a job submission request.
– Forwards the request to a scheduler, and start to listen for the job state notification from it.
– For notification indicating job success/failure, forwards to flow engine and returns; otherwise continue listening.
Receive job request
Submit job to scheduler
Receive job state notification
Return success
Return failure
[other]
[succeeded]
[failed]
IBM Research Division
12 23/4/21
Fault-handling schemes in flow execution
Which implemented
policy
Instance suspended
[re-submit]
[re-try]
Export instance data & delete it
Instance deleted
Re-submit
Find retry entry
Suspend instance
Instance resumed
Find re-submit
entry
Generate report & delete instance
Instance deleted
[cleanup]
Submit Job
What completion
status
Ready
submitted
success failure
Navigate to next job
[failure][success]
IBM Research Division
13 23/4/21
Flow re-submission and instance migration
Extract all the information related to a BPEL instance.
Re-shape the instance data and migrate it into another WPS engine.
Flow engine 1
DB 1
Flow engine 2
DB 2
1. J ob2 fails due to resource unavailability
3. Instance data exported to XML, and instance deleted in DB1
4. Instance data imported to DB2
5.Instance resumed in engine 2
2. Suspend instance in engine 1
3 4
IBM Research Division
14 23/4/21
Implementation
Job scheduling
Fault-handling policy
Job definition
Base flow
. . .
faulthandler
faulthandlers
compensationhandler
terminationhandler
eventhandlers
. . .
. . .
. . .
. . .
correlationsets
partnerlinks
scope
variables
Flow modeling
Flow execution
Expanded flow
Job proxyFault-handling
service
Invoke job proxy
Get notification
Scheduler
transform
B. return Job EPR
! ! !! ! !
Flow engine
C. job status notification
A. submit
Websphere Integration Developer (WID)
Websphere Process Server (WPS)
Tivoli Dynamic Workload Broker (ITDWB)
IBM Research Division
15 23/4/21
Sample Montage Job Flow
raw images
GenerateImage table
Image projectionin parallel
GenerateImage table
GenerateMosaic
Transformto jpeg
Montage: a toolkit for assembling raw astronomy images into custom mosaics.
–Developed by NASA & California Institute of Technology.
–The assembling process is usually expressed as a job flow.
IBM Research Division
16 23/4/21
Montage job flow and the re-start policy
Policy says: re-submit from
mImgtbl1 when mAdd1
fails
Expanded flow (partial)Base flow
IBM Research Division
17 23/4/21
(a) Montage initiated & failed at saba10
(b) Montage migrated to weitan
(c) Montage re-started and completed at weitan
Instance migration from saba10 to weitan
IBM Research Division
18 23/4/21
Conclusion
BPEL4Job: the exploration of using BPEL as a job flow language
– A two-stage approach for job flow modeling with fault-handling policies
– A generic job proxy to facilitate the asynchronous nature of job submission and job status notification
– A set of fundamental fault-handling schemes, including instance migration between flow engines
Future work
– Support more complicated fault-handling policies
• Involving Human Task, expressed as business rules, etc
– Apply instance migration technique in
• Load balance between flow engines
• Instance migration to newer version
IBM Research Division
19 23/4/21
Future work
IBM Research Division
IBM | 2007
Thank you for your attention.
Please contact me at: Dept. Automation, Tsinghua Univ, Beijing, Chinahttp://[email protected]