Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization

Arun Babu NagarajanFrank Mueller

North Carolina State University

Proactive Fault Tolerance for HPC

using Xen Virtualization

2

Problem Statement

Trends in HPC: high end systems with thousands of processors

— Increased probability of a node failure: MTBF becomes shorter

MPI widely accepted in scientific computing— Problem with MPI: no recovery from faults in the

standard

Currently FT exist but…— only reactive: process checkpoint/restart— must restart entire job

– inefficient if only one (few) node(s) fails— overhead due to redoing some of the work— issues: checkpoint at what frequency?— 100 hr job will run for addln 150 hrs on a petaflop

machine (w/o failure) [I.philip, 2005]

3

Our Solution

Proactive FT— anticipates node failure— takes preventive action instead of a ‘reaction’ to a failure

–migrate the whole OS to a better physical node–entirely transparent to the application (rather to the OS itself)

— hence avoids high overhead compared to reactive scheme (associated overhead w/ our scheme is very little )

4

Design space

1. A mechanism to predict/anticipate the failure of a node— OpenIPMI— lm_sensors (more system specific x86 Linux)

2. A mechanism to identify the best target node— Custom centralized approaches – doesn’t scale +

unreliable— Scalable distributed approach – Ganglia

3. More importantly, a mechanism (for preventive action) which supports the relocation of the running application with

— its state preserved— minimum overhead on the application itself— Xen Virtualisation with live migration support [C.Clark et

al, May2005]– Open source

5

Mechanisms explained

1. Health Monitoring with OpenIPMI— Baseboard Mgmt Controller (BMC) equipped with sensors

to monitor diff. properties like temperature, fan speed, voltage etc. of each node

— IPMI (Intelligent Platform Management Interface)– increasingly common in HPC– std. message-based interface to monitor H/W– raw messaging harder to use and debug

— OpenIPMI: open source, higher level abstraction from raw IPMI message-response system to communicate w/ BMC ( ie. to read sensors)

— We use OpenIPMI to gather health information of nodes

6


2. Ganglia— widely used, scalable distributed load monitoring tool — All the nodes in the cluster run a ganglia daemon and

each node has a approximate view of the entire cluster — UDP used to transfer messages— Measures

–cpu usage, mem usage, n/w usage by default–We use ganglia to identify least loaded node migration target

— Also extended to distribute IPMI sensor data

7


Xen VMMXen VMM

Privileged VMPrivileged VM Guest VMGuest VM

MPI

Task

MPI

Task

3. Fault Tolerance w/ xen— para-virtualized environment

– OS modified– application unchanged

— Privileged VM & Guest VM runs on Xen hypervisor/ VMM

— Guest VMs can live migrate to other hosts little overhead– State of the VM preserved– VM halted for an insignificant period of time– Migration phases:

– phase 1: send guest image dst node, app running– phase 2: repeated diffs dst node, app still running– phase 3: commit final diffs dst node, OS/app

frozen– phase 4: activate guest on dst, app running again

H/w

8

Overall set-up of the components

Stand-by Xen host, no guest

Xen VMMXen VMM

GangliaGanglia

Privileged VMPrivileged VM

Xen VMMXen VMM


GangliaGanglia

PFT

Daemon

PFT

Daemon

Guest VMGuest VM

MPI

Task

MPI

Task

PFT

Daemon

PFT

Daemon

Migrate

Xen VMMXen VMM


MPI

Task

MPI

TaskGangliaGanglia

PFT

Daemon

PFT

Daemon

Xen VMMXen VMM


MPI

Task

MPI

TaskGangliaGanglia

PFT

Daemon

PFT

Daemon

Deteriorating health migrate guest (w/ MPI app) to stand-by hostH/w BMC H/w BMC

H/w BMC H/w BMC

BMC Baseboard Management Contoller

9

Overall set-up of the components

Stand-by Xen host, no guest

Xen VMMXen VMM

GangliaGanglia


Xen VMMXen VMM


GangliaGanglia

PFT

Daemon

PFT

Daemon

Guest VMGuest VM

MPI

Task

MPI

Task

PFT

Daemon

PFT

Daemon

Xen VMMXen VMM


MPI

Task

MPI

TaskGangliaGanglia

PFT

Daemon

PFT

Daemon

Xen VMMXen VMM


MPI

Task

MPI

TaskGangliaGanglia

PFT

Daemon

PFT

Daemon

Deteriorating health migrate guest (w/ MPI app) to stand-by host

The destination host generates unsolicited ARP reply advertising that Guest VM IP has moved to a new location [C.Clark et. Al 2005]- This will take care of

peers to resend packets to the new host

H/w BMC H/w BMC

H/w BMC H/w BMC

BMC Baseboard Management Contoller

10

Proactive Fault Tolerance (PFT) Daemon

Runs on privileged VM (host) Initialize

Read safe threshold from config file <Sensor name> <Low Thr> <Hi

Thr> CPU temperature, fan speeds extensible (corrupt sectors,

network, voltage fluctuations, …) Init connection w/ IPMI BMC using

authentication parameters and hostname

Gathers a listing of available sensors in the system and validates it against out list

InitializeInitialize

Health MonitorHealth Monitor

IPMI

Baseboard Mgmt

Controller

IPMI

Baseboard Mgmt

Controller

Threshold

Breach?

Threshold

Breach?

Load BalanceLoad BalanceGangliaGanglia

N

Y

PFT Daemon

Raise Alarm /

Maintenance of

the system

11

PFT Daemon

Health Monitoring interacts w/ IPMI BMC (via OpenIPMI) to read sensors Periodic sampling of data (event driven is also

supported) threshold exceeded control handed over to load

balancing

PFTd determines migration target by contacting Ganglia— Load-based selection (lowest load)— Load obtained by /proc file system— Invokes Xen live migration for guest VM

Xen user-land tools (at VM/host)— command line interface for live migration— PFT Daemon initiates migration for guest VM

12

Experimental Framework

Cluster of 16 nodes (dual core, dual Opteron 265, 1 Gbps Ether)

Xen-3.0.2-3 VMM

Privileged and guest VM run ported Linux kernel version 2.6.16

Guest VM:— Very same configuration as privileged VM— Has 1GB RAM— Booted on VMM w/ PXE netboot via NFS— Has access to NFS (same as the privileged VM)

Ganglia on Privileged VM (and also Guest VM) in all nodes

Node sensors obtained via OpenIPMI

13

Experimental Framework

NAS Parallel Benchmarks run on Guest Virtual Machine

MPICH-2 w/ MPD ring on n GuestVMs (no job-pause required!)

Process on Privileged domain— monitors MPI task runs— issues migration command (NFS used for synchronization)

Measured: — wallclock time with and w/o migration— actual downtime + migration overhead (modified Xen

migration)

benchmarks run 10 times, results report avg.

NPB V3.2.1: BT, CG, EP, LU and SP benchmarks— IS run is too short— MG requires > 1GB for class C

14

Experimental Results

1. Single node failure

0

200

400

600

800

1000

1200

BT CG EP LU SP

Sec

on

ds

W /o Migration

1 Migration

0

50

100

150

200

250

300

BT CG EP LU SP

Sec

on

ds

W /o Migration

1 Migration

2 Migration

2. Double node failure

NPB Class B / 4 nodesNPB Class C / 4 nodes

Single node failure – overhead of 1-4 % over total wall clock time

Double node failure - overhead of 2-8 % over total wall clock time

15


3. Behavior of Problem Scaling

NPB 4 nodes

0

2

4

6

8

10

12

14

16

18

BT CG EP LU SP

Sec

on

ds

B C B C B C B C B CClas s

Overhead

Actual Downtime

Generally overhead increases with problem size (CG is exception )

Chart depicts only the overhead section

Dark region represents the part for which the VM was halted

The light region represents the delay incurred due to migration (diff operation.. Etc)

16


4. Behavior of Task Scaling

NPB Class C

0

10

20

30

40

50

60

70

BT CG EP LU SP

Sec

on

ds

4 8 16 4 8 16 4 8 16 4 8 16 4 8 16No. of nodes

Overhead

Actual Downtime

Generally we expect a decrease in overhead on increasing the # of nodes

Some discrepancies for BT and LU observed (Migration duration is 40s but here we have 60s)

17


5. Migration duration

NPB 4 nodes

Min 13s needed to transfer a 1GB VM w/o any active processes

Max 40 seconds needed before migration is initiated

Depends on the n/w bandwidth, RAM size & on the application

0

5

10

15

20

25

30

35

40

45

BT CG EP LU SP

Sec

on

ds

Class B Inputs

Class C Inputs

0

5

10

15

20

25

30

35

40

45

BT CG EP LU SPS

eco

nd

s

4 Nodes

8 Nodes

16 Nodes

NPB 4/8/16 nodes

18


6. Scalability (Total execution time)

NPB Class C

Speedup is not very much affected

0.75

1.25

1.75

2.25

2.75

3.25

3.75

BT CG EP LU SP

Sp

eed

up

4 8 16 4 8 16 4 8 16 4 8 16 4 8 16

Loss in speedup

No. of Nodes

19

Related Work

FT – Reactive approach is more common

Automatic

Checkpoint/restart (eg: BLCR – Berkeley Labs Checkpnt Restart)[S.Sankaran et.al LACSI ’03], [G.Stellner, IPPS ’ 96]

Log based (Log msg + temporal ordering) [G.Bosilica , Supercomputing, 2002]

Non-automatic

Explicit invocation of checkpoint routines [R.T.Aulwes et. Al, IPDPS 2004], [G. E. Fagg and J. J. Dongarra, 2000]

Virtualization in HPC is less/no overhead [W.Hunaf et al, ICS ’06]

To make virtualization competitive for MP environments, vmm-bypass I/o in VM has been experimented[J.Liu et.al USENIX ’06]

n/w virtualization can be optimized [A.Menon et.al USENIX ’06]

20

Conclusion

In contrast to the currently available reactive FT schemes, we have come up with a proactive system with much less overhead

Transparent and automatic FT for arbitrary MPI applications

Ideally complements long running MPI jobs

Proactive system will complement reactive systems greatly. (It will help to reduce the high overhead associated with reactive schemes greatly)

Documents

Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization