CIFTSCoordinated Infrastructure for Fault Tolerant Systems
Agenda
• The Problem and the purpose
• The CIFTS framework
• The CIFTS team
• Getting Involved
*System CPUs Reliability
ASCI-Q 8,192MTBI: 6.5hrs
(storage, CPU, memory)
ASCI-W 8,192MTBF: 5hrs (’01) and 40 hrs (’03)
(storage, CPU, 3rd party HW)PSC Lemieux 3,016 MTBI: 9.7hrs
Google 15,000 20 reboots/day; 2-3% machines replaced /year (storage, memory)
*“A Power-aware Run-Time System for High-Performance Computing”, Chung-hsing Hsu and Wu-chun Feng, IEEE International
Supercomputing Conference (SC), 2005
Current HPC Systems
• Top 500 statistics
– Performance growth • 35.86TF/s (2002) to 280FT/s (2007)
– Average node count growth• 128-258 (2002) to 1024-2048 (2007)
Downtime Cost
*Service Cost of One Hour Downtime
Brokerage Operations $6,450,000
Credit Card Authorization $2,600,000
eBay $225,000
Amazon $180,000
Package Shipping Services $150,000
Home Shopping Channel $113,000
Catalog Sales Center $90,000
*“A Power-aware Run-Time System for High-Performance Computing”, Chung-hsing Hsu and Wu-
chun Feng, IEEE International Supercomputing Conference (SC), 2005
“Faults directly impact system downtime and TCO”
Fault Tolerance in HPC
• Available for some HPC components– Storage (RAID variations) and File Systems ( dCache, Tera
Grid FS, Panasas, IBRIX, BulkFS)
– Checkpointing software (application checkpointing ex: BLCR, Condor; operating system checkpointing ex: TICK)
– Software built using hardware technologies like lmsensors, OpenMPI, BMC and other monitoring software like Ganglia
– Middleware (FT-MPI, MPICH-V, FE-MPI, FT ARMCI)
Components mostly deal with faults on an individual basis!Sharing of fault information globally is missing!
A typical scenario
Launches MPI Job 1
Job Scheduler
Other software on the cluster are agnostic of this MPI job failure.Other software are also agnostic of the reason of MPI job failure!
More failures
detects “communication
failure” with node X
MPI Application(job1)
MPI Aborts!
ApplicationAborts!
Launches MPI Job 2
Fault Tolerant
Backplane
The CIFTS Framework
Linear Algebra Libraries
HPCMiddleware
UniversalLogger
AutomaticActions
DiagnosticsTools
EventAnalysis
System components, libraries and applicationsAutonomics
Job Scheduler/Resource manager
File Systems
Operating systems
Networking libraries
SystemMonitoring
software
SystemManagement
hardware
Operating System Applications
CIFTS - Usage Scenario
IO node failure. File system down
Parallel FS
File System shares this information
Job SchedulerLaunch jobs with NFS file system
MPI-IOPrints a coherent
error message
Checkpoints itself
Application
Checkpoints itself
Application
Migrates existingjobs
detects increasing disk
temp. on a Node X
Hardware sensor
Sensor shares thisknowledge
Job SchedulerNot launch jobson node X until
further diagnosis
Diagnostics Utility
Runs scripts forfurther
root-causing
Starts Checkpointing
MPI
CIFTS - Usage Scenario
Parallel FS
Prepare for I/Odata migration from
Node XStarts Checkpointing
Application
Lifecycle of a componentinteraction with FTB
Component Instance
1
23
Register with FTB
Subscribe for events
Publish events
Deregister from FTB
1
2
3
Component Instance
1
23Distributed
Fault Tolerant
Backplane
4 4
4
FTB AgentFTB AgentFTB AgentFTB Agent
FTB AgentFTB Agent
FTB AgentFTB Agent
FTB AgentFTB AgentFTB AgentFTB Agent
FTB AgentFTB Agent
Register
Register
Component Instance
Subscribe to a set of
events
Component Instance
Component Instance
Register
Publish event
Publish event
Delving deeper in FTB framework
Manager Library
Network
Client Library
Component 1
NetworkModule1
FTB Agent
Component n
Linux BGL CRAY
NetworkModule2
Manager Library
Network
NetworkModule1
NetworkModule2
FTB Client API
FTB Manager API
FTB Agent software stackComponent software stack
FTB Internal Architecture Layers
Manager Library
Network
Client Library
Component 1
NetworkModule1
FTB Agent
Component n
Linux BGL CRAY
NetworkModule2
Manager Library
Network
NetworkModule1
NetworkModule2
FTB Manager API
FTB Agent software stackComponent software stack
What you need to know!
Just the FTB Client API
CIFTS API* Snapshot
• FTB_Init (IN FTB_comp_info_t *comp_info, OUT FTB_client_handle_t *client_handle, OUT char *error_msg)
• FTB_Publish_event (IN FTB_client_handle_t handle, IN char *event_name, IN FTB_event_data_t *datadetails, OUT char *error_msg)
• FTB_Create_mask (INOUT FTB_event_mask_t *evt_mask, IN char *field_name, IN char *field_val, OUT char *error_msg)
• FTB_Subscribe (IN FTB_client_handle_t chandle, IN FTB_event_mask_t *event_mask, OUT FTB_subscribe_handle_t *shandle, OUT char *error_msg IN int (*callback)(OUT FTB_catch_event_info_t *, OUT void*), IN void *arg)
• FTB_Poll_for_event (IN FTB_subscribe_handle_t shandle, OUT FTB_catch_event_info_t *catch_event, OUT char *error_msg);
• FTB_Finalize (IN FTB_client_handle_t handle);
*Under works
FTB-enabled Software -- Planned
BLCR
Fault Tolerant
Backplane
FT-LA
SWIMIPS
LAMMPSOpenMPI
PVFS
MPICH2
MVAPICH2
LAM/MPI
Cobalt
ScaLAPACK ROMIO
NWChem
ZeptoOS
CCAApplications
Status
• Alpha version under works– Demos available on SC exhibit floor
• Client API to be finalized by Q4’ CY07
• Beta release, targeted Q1’ CY08– Platforms supported : Linux clusters, IBM
BGL, Cray XT
CIFTS team
• Argonne National Laboratory– Pete Beckman, Rinku Gupta, Ewing Lusk, Rob Ross, Rajeev Thakur
• Indiana University– Andrew Lumsdaine
• Lawrence Berkeley National Laboratory– Paul Hargrove
• Oak Ridge National Laboratory– Al Geist, David Bernholdt, Pratul Agarwal, Scott Hampton, Byung-Hoon Park, Aniruddha Shet
• Ohio State University– D.K. Panda
• University of Tennessee, Knoxville– Jack Dongarra
Call for Action
BLCR
Fault Tolerant
Backplane
FT-LA
SWIMIPS
PBS/ProLAMMPS
OpenMPI
Lustre
PVFS
Scali MPI
Global Arrays
Intel MPI
MPICH2
Polyserv
GPFS
GFSIBRIX
MVAPICH2 MPICH-MX
Panasas
LAM/MPI
Other Applications
SGE
MAUI
Condor
LSF
Cobalt
Intel MLKScaLAPACK
ROMIO
SLURM NWChem
Fluent MM5 LS-Dyna
ZeptoOS
Linux
EclipseBLASTStar-CD
Need more information?
• SC’07 Exhibit floor– Demos and/or talks at ANL, ORNL and LBNL booth
• CIFTS website– http://www.mcs.anl.gov/research/cifts/
• CIFTS wiki– http://wiki.mcs.anl.gov/cifts
• CIFTS mailing list– [email protected]
Discussion Topics
• Need of CIFTS infrastucture in enterprise environment
• Requirements/constraints for adoption of CIFTS?
• …..
Backup
CIFTS - The working view
MiddlewareLikeMPI MPI-IO
UniversalLogger
AutomaticActions
DiagnosticsTools
EventAnalysis
Linear Algebra Libraries
CheckpointRestartSystem
PVFS
ResourceManager/JS
Libraries and Applications
System Components
Autonomics
BootstrapServer
Building a FTB-enabled sample component
1. List the events you may want to publish in an XML file (for convenience)
2. Use the API to make the component FTB-enabled
3. Publish and subscribe to events
FTB-Enabled Component Development (Step1)
STEP 1: Create an XML file, outlining the publishable events
<ftb_component_details><namespace>ftb.ftb_examples.watchdog<namespace><publish_event> <event_name>WATCH_DOG_EVENT</event_name> <event_severity>Info</event_severity> <event_desc>This event is used by watchdog</event_desc></publish_event><publish_event>
…</publish_event></ftb_component_details>
Developing a FTB-enabled component (Step 2)
STEP 2: Enabling your FTB component
#include "libftb.h"#include "ftb_event_def.h"#include "ftb_throw_events.h"
int main (int argc, char *argv[]){
strcpy(cinfo.comp_namespace, "FTB.FTB_EXAMPLES.Watchdog"); strcpy(cinfo.schema_ver, "0.5"); strcpy(cinfo.inst_name, "watchdog"); strcpy(cinfo.jobid,"watchdog-111"); strcpy(cinfo.catch_style,"FTB_POLLING_CATCH"); FTB_Init(&cinfo, &handle, err_msg);
FTB_Register_publishable_events(handle, ftb_ftb_examples_watchdog_events, FTB_FTB_EXAMPLES_WATCHDOG_TOTAL_EVENTS, err_msg);
FTB_Create_mask(&mask, "all", "init", err_msg);FTB_Subscribe(handle, &mask, &shandle, err_msg, NULL, NULL);
FTB_Publish_event(handle, "WATCH_DOG_EVENT", publish_event_data, err_msg);
FTB_Poll_for_event(shandle, &caught_event, err_msg);
FTB_Finalize(handle); return 0;
}
Developing a FTB-enabled component (Step 2..contd)
Creating your subscribe event mask
Create a mask to catch all events1. FTB_Create_mask(&mask, "all", "init", err_msg);
Create a mask to catch “WATCH_DOG_EVENT”1. FTB_Create_mask(&mask, "all", "init", err_msg);2. FTB_Create_mask(&mask, "event_name", "WATCH_DOG_EVENT",
err_msg);
Create a mask to catch events of severity fatal1. FTB_Create_mask(&mask, "all", "init", err_msg);2. FTB_Create_mask(&mask, “severity”, ”FTB_FATAL", err_msg);
Developing a FTB-enabled component (Step 3)
STEP 3: Provide options to end user to compile your code with FTB
• Modify configure.in and makefiles, so that you can compile your code• ./configure --with-ftb=<PATH to FTB install directory>
Setting up FTB environment
Compiling FTB
• Download FTB
1. ./configure --with-platform=linux --with-bstrap-name=hostname
2. make
3. make install
Using FTB
Starting FTB1. ./ftb_database_server2. ./ftb_agent on all linux nodes3. Run you component executables
BootstrapDB
server
FTBAgent
Agent contacts server
BS -Server providesparent address
FTBAgent
FTBAgent
FTBAgent
FTBAgent
FTBAgent
Connection Topology
Open Issues
We don’t know the answers to these questions, so we should not be discussing them in the BOF?
• Policy management– Global knowledge of component prioritization for handling
events
• How can components announce their FT capabilities?
• How can components request for action from other components?
• How to we establish scoping of events?