Grid Checkpoining Architecture

Preview:

DESCRIPTION

Grid Checkpoining Architecture. Radosław Januszewski CoreGrid Summer School 2007. motivation. The Grids are complex and therefore prone to errors. The distributed nature of the Grid makes scheduling of system maintenance hard. - PowerPoint PPT Presentation

Citation preview

Managed by

Grid Checkpoining Architecture

Radosław Januszewski

CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 2

motivation

- The Grids are complex and therefore prone to errors.

- The distributed nature of the Grid makes scheduling of system maintenance hard.

- Each uncoordinated power-down or failure effects in loss of currently running applications.

- Loss of computation time means additional cost!

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 3

goal

To enhance the reliability, fault-tolerance and robustness of the Grid computing environment.

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 4

the solution

Grid Checkpoint Architecture (GCA): a proposal of placement, functionality and interaction schemes of checkpoinitng service in the Grid environment

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 5

Grid Broker

User Interface

Operating System Operating System Operating System

Globally Accessible Storage (Data Management)

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Local Resource Manager

grid - model

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 6

GCA in the Grid

Grid Broker

User Interface

Core Setvice

Operating System Core Service

Operating System Core Service

Operating System

Globally Accessible Storage (Data Management)

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Local Resource Manager

Checkpoint Translation service (CTS)

Grid Checkpoint Service

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 7

Proof of concept – the goals

• check whether the GCA survives contact with the reality

• prepare PoC on the basis of real-life installation• the Grid with the GCA should provide additional

value comparing with the „traditional” approach

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 8

GCA proof of concept installation

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 9

involved elements

• GUI: command line, Grid Sphere, Migrating Desktop

• Broker: GRMS• Local Resource Manager: Globus + TORQUE• Core service: SGIckpt

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 10

Bottom-up approach

How to make the checkpointer work with the local resource manager?

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 11

pbs/torque special features

action checkpoint

action restart

action checkpoint_abort

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 12

config

$action checkpoint 0 !/usr/pbs/bin/pbs-mom-checkpoint.sh %globid %jobid %sid %ta

skid %path

$action restart 0 !/usr/pbs/bin/pbs_restart_test.sh %path %taskid

$restart_transmogrify true

$action checkpoint_abort 0 !/usr/pbs/bin/pbs-mom-checkpoint-and-stop.sh %globid

%jobid %sid %taskid %path

Detailed description accessible on the http://checkpointing.psnc.pl

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 13

Broker – local RM connectivity

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 14

problem

The checkpointer: a service or resource?

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 15

<grmsJob appid="matrix_demo_submit"> <task taskid="matrix" persistent="true" crucial="true"> <resource> <localrmname>pbs</localrmname> </resource> <executable type="multiple" count="1"> <execfile name="matrixi"> <url>gsiftp://xxx.xxx.xxx.xxxl//home/user/povray</url> </execfile> </executable> <other> <grms_id>${JOB_ID}</grms_id> <checkpointable>true</checkpointable> <period>1</period> </other> </task></grmsJob>

job description with checkpointing

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 16

the end-user point of view

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 17

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

manual scenario

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 18

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

manual scenario - restart

Application

Failure!

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 19

<grmsJob appid="matrix_demo_resume"> <task taskid="matrix" persistent="true" crucial="true"> <resource> <hostname>node-03.checkpointing.psnc.pl</hostname> <localrmname>pbs</localrmname> </resource> <executable type="multiple" count="1"> <execfile name="matrix_long"> <url>gsiftp://xxx.xxx.xxx.xxx//home/xxxxxx/test_apps/matrix_long</url> </execfile> </executable> <other> <grms_id>${JOB_ID}</grms_id> <recovery>true</recovery> <ckpt_id>1179315947518_matrix_demo_submit_0459</ckpt_id> <checkpointable>true</checkpointable> <period>1</period> </other> </task></grmsJob>

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 20

failure – end-user view

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 21

problem

This semi-automatic solution is not optimal.

How to introduce automatic job failure handling without introducing new functionality in the Broker?

Use the workflows!

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 22

the workflow

submit job description

send results to useryes

submit „restart scenario” job

job finished successfullty?

send results to useryes

no

no

return error description

job finished successfullty?

Problem: using this broker we are not able to model loops

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 23

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

automatic scenario

Application

Failure!

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 24

end-user point of view

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 25

the benefits

user: more robust and fault-tolerant Grid environment

sysadmin: much easier system management due to automatic checkpoint and recovery mechanism

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 26

Thank you!

Recommended