A Proposal of Application Failure Detectionand Recovery in the Grid
Marian Bubak1,2, Tomasz Szepieniec2, Marcin Radecki2
1 Institute of Computer Science, AGH 2 Academic Computer Centre -- CYFRONET
Institute of Computer
Science AGH
Institute of Computer
Science AGH
Outline
Motivation & introduction
Services useful in fault recovery approach
Overview of our proposal
Problems & workflow approach
Summary
Institute of Computer
Science AGH
Motivation
Fault tolerance problembecomes important in the Grid
Environment and application size increase steadily
Reliability of single component does not raise considerably
Risk that application crashes is higher
Crash is more expensive for large application
Institute of Computer
Science AGH
Demands vs. Reality
Minimal overhead
Automatic, quick recovery
Scalability
Transparent
Porting to any kind of
application
Checkpointing is costly
Often restarting whole application
Many global operations
Additional developer’s effort
is required
Application-specific methods
Institute of Computer
Science AGH
Two classes of FT approaches
Application Built-in FT Algorithm/structure profile can be exploited, FT activity can by done more efficiently, e.g. checkpointing Naturally Fault Tolerant problem class, e.g. genetic alg. Fault Tolerant-MPI but... all must be done by developer
FT realized by external services automatic middleware services no developer effort required but... limited functionality
It would be beneficial to combine this two
Institute of Computer
Science AGH
Services useful in FT approach
Monitoring services For fault detection in hardware and software e.g. Check if process is still running,
Checkpointing, logging, redundancy services For preparing recovery e.g. Store the current state of application
Recovery services In case of failure e.g. Rollback from last checkpointing,
Scheduler and resource broker For knowledge about started application For re-scheduling, re-brokering job or it’s part
Institute of Computer
Science AGH
How to make it work together?
The component that manages this services is needed part of middleware job companion co-ordinate actions of FT services
Recovery action taken is more appropriate, because: whole job state is considered the most suitable of available
services could be used
CheckpointingServices
ApplicationMon. Services
RecoveryServices
SchedulerServices
Infrastructure Mon.Services
Fault Tolerant Manager
Institute of Computer
Science AGH
FT Manager – Architecture
Job Supervisor
Recovery ScenarioExecutor
Fault Tolerant Manager
CheckpointingServices
ApplicationMon. Services
RecoveryServices
SchedulerServices
Infrastructure Mon.Services
Infra
structu
re
Ap
plica
tion
ApplicationMonitoring
Check-pointing
Recovery
Decision Maker
Institute of Computer
Science AGH
Job Supervisor (1)
Main functionality: Monitors job execution Manages (or stores information about)
checkpointing When something is wrong generates
Fault Alarm
Fault Alarm contains not only the information what is wrong, but also the status of job (e.g. last checkpoint)
Job Supervisor can be asked to perform more checking by Decision Maker
Decision Maker
Recovery ScenarioExecutor
Fault Tolerant Manager
Job Supervisor
Fault Alarm
Institute of Computer
Science AGH
Job Supervisor (2) – Faults
Typical examples of fault: process crash node is not responding lost connection (link is down)
Extended fault characteristics: Occurring and duration
characteristics Severity for application,
E.g. Master fault is more dangerous than slave fault
Fault is not only when connection is lost, but also when performance dramatically decreases
Sophisticated performance monitoring is required
Decision Maker
Recovery ScenarioExecutor
Fault Tolerant Manager
Fault Alarm
Job Supervisor
Institute of Computer
Science AGH
Decision Maker
Main functionality: Analyzes the situation, when gets fault
alarm Prepares recovery scenarios and sends
the best of them for execution
Issues to be considered: What is possible The cost of each recovery scenario Do-nothing or wait scenario is always
possible and sometimes beneficial E.g. in case of problem with network
link when only recovery is to restart the whole application
Historical data and probabilistic methods should be used
Decision Maker
Job Supervisor
Recovery ScenarioExecutor
Fault Tolerant Manager
Fault Alarm
Recovery Scenario
Institute of Computer
Science AGH
Recovery Scenario Executor
Main functionality: Executes actions from scenario Supervises recovery process
Recovery Scenario contains several actions that could be performed by different recovery services
In case of failure in scenario execution, Decision Maker is alarmed
Decision Maker
Job Supervisor
Recovery ScenarioExecutor
Fault Tolerant Manager
Recovery Scenario
Institute of Computer
Science AGH
Problems
Many class of services to cooperate with
Many interfaces
How to obtain information about application?
Which services are available?
Semantic specification for monitoring and recovery services is needed
Institute of Computer
Science AGH
Feasibility – WorkFlows
Grid-Services-based approach could help to solve our problems
Knowledge about application architecture is accessible Workflow description details are welcomed
Exchange of single component is better that restart the whole application
Directives for FT Manager could be included in job description
Interfaces are unified
Institute of Computer
Science AGH
Summary
Fault tolerance issues become more and more important in the Grid
A service for fault tolerance management has been proposed
...which enables more sophisticated fault tolerance for Grid
Workflow-based framework facilites the task
But, this is a proposal only...You are invited for commenting and remarking!