Upload
claude-pierce
View
212
Download
0
Embed Size (px)
Citation preview
Arun Babu NagarajanFrank Mueller
North Carolina State University
Proactive Fault Tolerance for HPC
using Xen Virtualization
2
Problem Statement
Trends in HPC: high end systems with thousands of processors
— Increased probability of a node failure: MTBF becomes shorter
MPI widely accepted in scientific computing— Problem with MPI: no recovery from faults in the
standard
Currently FT exist but…— only reactive: process checkpoint/restart— must restart entire job
– inefficient if only one (few) node(s) fails— overhead due to redoing some of the work— issues: checkpoint at what frequency?— 100 hr job will run for addln 150 hrs on a petaflop
machine (w/o failure) [I.philip, 2005]
3
Our Solution
Proactive FT— anticipates node failure— takes preventive action instead of a ‘reaction’ to a failure
–migrate the whole OS to a better physical node–entirely transparent to the application (rather to the OS itself)
— hence avoids high overhead compared to reactive scheme (associated overhead w/ our scheme is very little )
4
Design space
1. A mechanism to predict/anticipate the failure of a node— OpenIPMI— lm_sensors (more system specific x86 Linux)
2. A mechanism to identify the best target node— Custom centralized approaches – doesn’t scale +
unreliable— Scalable distributed approach – Ganglia
3. More importantly, a mechanism (for preventive action) which supports the relocation of the running application with
— its state preserved— minimum overhead on the application itself— Xen Virtualisation with live migration support [C.Clark et
al, May2005]– Open source
5
Mechanisms explained
1. Health Monitoring with OpenIPMI— Baseboard Mgmt Controller (BMC) equipped with sensors
to monitor diff. properties like temperature, fan speed, voltage etc. of each node
— IPMI (Intelligent Platform Management Interface)– increasingly common in HPC– std. message-based interface to monitor H/W– raw messaging harder to use and debug
— OpenIPMI: open source, higher level abstraction from raw IPMI message-response system to communicate w/ BMC ( ie. to read sensors)
— We use OpenIPMI to gather health information of nodes
6
Mechanisms explained
2. Ganglia— widely used, scalable distributed load monitoring tool — All the nodes in the cluster run a ganglia daemon and
each node has a approximate view of the entire cluster — UDP used to transfer messages— Measures
–cpu usage, mem usage, n/w usage by default–We use ganglia to identify least loaded node migration target
— Also extended to distribute IPMI sensor data
7
Mechanisms explained
Xen VMMXen VMM
Privileged VMPrivileged VM Guest VMGuest VM
MPI
Task
MPI
Task
3. Fault Tolerance w/ xen— para-virtualized environment
– OS modified– application unchanged
— Privileged VM & Guest VM runs on Xen hypervisor/ VMM
— Guest VMs can live migrate to other hosts little overhead– State of the VM preserved– VM halted for an insignificant period of time– Migration phases:
– phase 1: send guest image dst node, app running– phase 2: repeated diffs dst node, app still running– phase 3: commit final diffs dst node, OS/app
frozen– phase 4: activate guest on dst, app running again
H/w
8
Overall set-up of the components
Stand-by Xen host, no guest
Xen VMMXen VMM
GangliaGanglia
Privileged VMPrivileged VM
Xen VMMXen VMM
Privileged VMPrivileged VM
GangliaGanglia
PFT
Daemon
PFT
Daemon
Guest VMGuest VM
MPI
Task
MPI
Task
PFT
Daemon
PFT
Daemon
Migrate
Xen VMMXen VMM
Privileged VMPrivileged VM Guest VMGuest VM
MPI
Task
MPI
TaskGangliaGanglia
PFT
Daemon
PFT
Daemon
Xen VMMXen VMM
Privileged VMPrivileged VM Guest VMGuest VM
MPI
Task
MPI
TaskGangliaGanglia
PFT
Daemon
PFT
Daemon
Deteriorating health migrate guest (w/ MPI app) to stand-by hostH/w BMC H/w BMC
H/w BMC H/w BMC
BMC Baseboard Management Contoller
9
Overall set-up of the components
Stand-by Xen host, no guest
Xen VMMXen VMM
GangliaGanglia
Privileged VMPrivileged VM
Xen VMMXen VMM
Privileged VMPrivileged VM
GangliaGanglia
PFT
Daemon
PFT
Daemon
Guest VMGuest VM
MPI
Task
MPI
Task
PFT
Daemon
PFT
Daemon
Xen VMMXen VMM
Privileged VMPrivileged VM Guest VMGuest VM
MPI
Task
MPI
TaskGangliaGanglia
PFT
Daemon
PFT
Daemon
Xen VMMXen VMM
Privileged VMPrivileged VM Guest VMGuest VM
MPI
Task
MPI
TaskGangliaGanglia
PFT
Daemon
PFT
Daemon
Deteriorating health migrate guest (w/ MPI app) to stand-by host
The destination host generates unsolicited ARP reply advertising that Guest VM IP has moved to a new location [C.Clark et. Al 2005]- This will take care of
peers to resend packets to the new host
H/w BMC H/w BMC
H/w BMC H/w BMC
BMC Baseboard Management Contoller
10
Proactive Fault Tolerance (PFT) Daemon
Runs on privileged VM (host) Initialize
Read safe threshold from config file <Sensor name> <Low Thr> <Hi
Thr> CPU temperature, fan speeds extensible (corrupt sectors,
network, voltage fluctuations, …) Init connection w/ IPMI BMC using
authentication parameters and hostname
Gathers a listing of available sensors in the system and validates it against out list
InitializeInitialize
Health MonitorHealth Monitor
IPMI
Baseboard Mgmt
Controller
IPMI
Baseboard Mgmt
Controller
Threshold
Breach?
Threshold
Breach?
Load BalanceLoad BalanceGangliaGanglia
N
Y
PFT Daemon
Raise Alarm /
Maintenance of
the system
11
PFT Daemon
Health Monitoring interacts w/ IPMI BMC (via OpenIPMI) to read sensors Periodic sampling of data (event driven is also
supported) threshold exceeded control handed over to load
balancing
PFTd determines migration target by contacting Ganglia— Load-based selection (lowest load)— Load obtained by /proc file system— Invokes Xen live migration for guest VM
Xen user-land tools (at VM/host)— command line interface for live migration— PFT Daemon initiates migration for guest VM
12
Experimental Framework
Cluster of 16 nodes (dual core, dual Opteron 265, 1 Gbps Ether)
Xen-3.0.2-3 VMM
Privileged and guest VM run ported Linux kernel version 2.6.16
Guest VM:— Very same configuration as privileged VM— Has 1GB RAM— Booted on VMM w/ PXE netboot via NFS— Has access to NFS (same as the privileged VM)
Ganglia on Privileged VM (and also Guest VM) in all nodes
Node sensors obtained via OpenIPMI
13
Experimental Framework
NAS Parallel Benchmarks run on Guest Virtual Machine
MPICH-2 w/ MPD ring on n GuestVMs (no job-pause required!)
Process on Privileged domain— monitors MPI task runs— issues migration command (NFS used for synchronization)
Measured: — wallclock time with and w/o migration— actual downtime + migration overhead (modified Xen
migration)
benchmarks run 10 times, results report avg.
NPB V3.2.1: BT, CG, EP, LU and SP benchmarks— IS run is too short— MG requires > 1GB for class C
14
Experimental Results
1. Single node failure
0
200
400
600
800
1000
1200
BT CG EP LU SP
Sec
on
ds
W /o Migration
1 Migration
0
50
100
150
200
250
300
BT CG EP LU SP
Sec
on
ds
W /o Migration
1 Migration
2 Migration
2. Double node failure
NPB Class B / 4 nodesNPB Class C / 4 nodes
Single node failure – overhead of 1-4 % over total wall clock time
Double node failure - overhead of 2-8 % over total wall clock time
15
Experimental Results
3. Behavior of Problem Scaling
NPB 4 nodes
0
2
4
6
8
10
12
14
16
18
BT CG EP LU SP
Sec
on
ds
B C B C B C B C B CClas s
Overhead
Actual Downtime
Generally overhead increases with problem size (CG is exception )
Chart depicts only the overhead section
Dark region represents the part for which the VM was halted
The light region represents the delay incurred due to migration (diff operation.. Etc)
16
Experimental Results
4. Behavior of Task Scaling
NPB Class C
0
10
20
30
40
50
60
70
BT CG EP LU SP
Sec
on
ds
4 8 16 4 8 16 4 8 16 4 8 16 4 8 16No. of nodes
Overhead
Actual Downtime
Generally we expect a decrease in overhead on increasing the # of nodes
Some discrepancies for BT and LU observed (Migration duration is 40s but here we have 60s)
17
Experimental Results
5. Migration duration
NPB 4 nodes
Min 13s needed to transfer a 1GB VM w/o any active processes
Max 40 seconds needed before migration is initiated
Depends on the n/w bandwidth, RAM size & on the application
0
5
10
15
20
25
30
35
40
45
BT CG EP LU SP
Sec
on
ds
Class B Inputs
Class C Inputs
0
5
10
15
20
25
30
35
40
45
BT CG EP LU SPS
eco
nd
s
4 Nodes
8 Nodes
16 Nodes
NPB 4/8/16 nodes
18
Experimental Results
6. Scalability (Total execution time)
NPB Class C
Speedup is not very much affected
0.75
1.25
1.75
2.25
2.75
3.25
3.75
BT CG EP LU SP
Sp
eed
up
4 8 16 4 8 16 4 8 16 4 8 16 4 8 16
Loss in speedup
No. of Nodes
19
Related Work
FT – Reactive approach is more common
Automatic
Checkpoint/restart (eg: BLCR – Berkeley Labs Checkpnt Restart)[S.Sankaran et.al LACSI ’03], [G.Stellner, IPPS ’ 96]
Log based (Log msg + temporal ordering) [G.Bosilica , Supercomputing, 2002]
Non-automatic
Explicit invocation of checkpoint routines [R.T.Aulwes et. Al, IPDPS 2004], [G. E. Fagg and J. J. Dongarra, 2000]
Virtualization in HPC is less/no overhead [W.Hunaf et al, ICS ’06]
To make virtualization competitive for MP environments, vmm-bypass I/o in VM has been experimented[J.Liu et.al USENIX ’06]
n/w virtualization can be optimized [A.Menon et.al USENIX ’06]
20
Conclusion
In contrast to the currently available reactive FT schemes, we have come up with a proactive system with much less overhead
Transparent and automatic FT for arbitrary MPI applications
Ideally complements long running MPI jobs
Proactive system will complement reactive systems greatly. (It will help to reduce the high overhead associated with reactive schemes greatly)