A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

A Proposal of Application Failure Detectionand Recovery in the Grid

Marian Bubak1,2, Tomasz Szepieniec2, Marcin Radecki2

1 Institute of Computer Science, AGH 2 Academic Computer Centre -- CYFRONET

Institute of Computer

Science AGH


Science AGH

Outline

Motivation & introduction

Services useful in fault recovery approach

Overview of our proposal

Problems & workflow approach

Summary


Science AGH

Motivation

Fault tolerance problembecomes important in the Grid

Environment and application size increase steadily

Reliability of single component does not raise considerably

Risk that application crashes is higher

Crash is more expensive for large application


Science AGH

Demands vs. Reality

Minimal overhead

Automatic, quick recovery

Scalability

Transparent

Porting to any kind of

application

Checkpointing is costly

Often restarting whole application

Many global operations

Additional developer’s effort

is required

Application-specific methods


Science AGH

Two classes of FT approaches

Application Built-in FT Algorithm/structure profile can be exploited, FT activity can by done more efficiently, e.g. checkpointing Naturally Fault Tolerant problem class, e.g. genetic alg. Fault Tolerant-MPI but... all must be done by developer

FT realized by external services automatic middleware services no developer effort required but... limited functionality

It would be beneficial to combine this two


Science AGH

Services useful in FT approach

Monitoring services For fault detection in hardware and software e.g. Check if process is still running,

Checkpointing, logging, redundancy services For preparing recovery e.g. Store the current state of application

Recovery services In case of failure e.g. Rollback from last checkpointing,

Scheduler and resource broker For knowledge about started application For re-scheduling, re-brokering job or it’s part


Science AGH

How to make it work together?

The component that manages this services is needed part of middleware job companion co-ordinate actions of FT services

Recovery action taken is more appropriate, because: whole job state is considered the most suitable of available

services could be used

CheckpointingServices

ApplicationMon. Services

RecoveryServices

SchedulerServices

Infrastructure Mon.Services

Fault Tolerant Manager


Science AGH

FT Manager – Architecture

Job Supervisor

Recovery ScenarioExecutor


CheckpointingServices

ApplicationMon. Services

RecoveryServices

SchedulerServices

Infrastructure Mon.Services

Infra

structu

re

Ap

plica

tion

ApplicationMonitoring

Check-pointing

Recovery

Decision Maker


Science AGH

Job Supervisor (1)

Main functionality: Monitors job execution Manages (or stores information about)

checkpointing When something is wrong generates

Fault Alarm

Fault Alarm contains not only the information what is wrong, but also the status of job (e.g. last checkpoint)

Job Supervisor can be asked to perform more checking by Decision Maker

Decision Maker



Job Supervisor

Fault Alarm


Science AGH

Job Supervisor (2) – Faults

Typical examples of fault: process crash node is not responding lost connection (link is down)

Extended fault characteristics: Occurring and duration

characteristics Severity for application,

E.g. Master fault is more dangerous than slave fault

Fault is not only when connection is lost, but also when performance dramatically decreases

Sophisticated performance monitoring is required

Decision Maker



Fault Alarm

Job Supervisor


Science AGH

Decision Maker

Main functionality: Analyzes the situation, when gets fault

alarm Prepares recovery scenarios and sends

the best of them for execution

Issues to be considered: What is possible The cost of each recovery scenario Do-nothing or wait scenario is always

possible and sometimes beneficial E.g. in case of problem with network

link when only recovery is to restart the whole application

Historical data and probabilistic methods should be used

Decision Maker

Job Supervisor



Fault Alarm

Recovery Scenario


Science AGH

Recovery Scenario Executor

Main functionality: Executes actions from scenario Supervises recovery process

Recovery Scenario contains several actions that could be performed by different recovery services

In case of failure in scenario execution, Decision Maker is alarmed

Decision Maker

Job Supervisor



Recovery Scenario


Science AGH

Problems

Many class of services to cooperate with

Many interfaces

How to obtain information about application?

Which services are available?

Semantic specification for monitoring and recovery services is needed


Science AGH

Feasibility – WorkFlows

Grid-Services-based approach could help to solve our problems

Knowledge about application architecture is accessible Workflow description details are welcomed

Exchange of single component is better that restart the whole application

Directives for FT Manager could be included in job description

Interfaces are unified


Science AGH

Summary

Fault tolerance issues become more and more important in the Grid

A service for fault tolerance management has been proposed

...which enables more sophisticated fault tolerance for Grid

Workflow-based framework facilites the task

But, this is a proposal only...You are invited for commenting and remarking!

Documents

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,