F AULT T OLERANCE IN MPI Minaashi Kalyanaraman Pragya Upreti CSS 534 Parallel Programming

FAULT TOLERANCE IN MPI

Minaashi Kalyanaraman Pragya Upreti

CSS 534 Parallel Programming

OVERVIEW

•Fault Tolerance in MPI

•Levels of survival in MPI

•Approaches to fault tolerance in MPI

•Advantages & disadvantages of implementing fault tolerance in MPI

•Extending MPI to HARNESS●Why FT-MPI●Implementation

•Comparison MPI and FT-MPI

•Performance consideration•Conclusion•Future scope

MPI IS NOT FAULT TOLERANT! -IS THAT TRUE?

• It is a common misconception about MPI.

• MPI provides considerable flexibility in the handling of errors.

FAULT TOLERANCE IS THE PROPERTY OF AN MPI PROGRAM!

Job1

Processes in Job1

MPI_COMM_WORLD

P1

P2

P3

P4

Sends MPI_SUCCESS

Job1

Processes in Job1

Sends MPI_ERRORS_ARE_FATAL

MPI_COMM_WORLD

P1

P2 P3

P4

By default other processes detect error and abort

Process P2 dies

Levels of Survival of an MPI Implementation

APPROACHES TO ACHIEVE FAULT TOLERANCE IN MPI

Level 1 – MPI implementation automatically recovers from failure and continues without significant change to its

behavior. Highest Level of Survival and difficult to implement.

Level 2 – The MPI implementation is notified of the problem and is prepared to take corrective action.

Example: Using Intercommunicators

Level 3 – In case of failure, certain MPI operations, although not all become invalid.

Example: Modifying MPI Semantics, Extending MPI

Level 4 – In case of failure, the MPI program can abort and be restarted from a checkpoint.

Example: Checkpointing

Program state of the failed process is

retained for the overall

computation to proceed

THE MPI STANDARD AND FAULT TOLERANCE Reliable Communication:• The MPI implementation is responsible for detecting and handling network faults.• The MPI implementation can retransmit the message or inform the application

that an error has occurred, allowing the application to take its own corrective action.

Error Handlers:• Error handlers are set on communicators with MPI_Comm_set_errhandler.• The default is MPI_ERRORS_ARE_FATAL and it can be changed to

MPI_ERRORS_RETURN.• Users can define their own error handlers and attach them to communicators.

ERROR HANDLING- CONTINUED• In c++ , MPI :: ERRORS_THROW_EXCEPTIONS is defined to handle the

errors

• If an error is returned, the standard does not require that subsequent operations succeed or that they fail.

• Thus the standard allows implementations to take various approaches to the fault tolerance issue

APPROACH TO FAULT TOLERANCE IN MPI PROGRAMS1.Checkpointing:• This is a common technique that periodically saves the state

of a computation, allowing the computation to be restarted from that point in the event of a failure.

The cost of checkpointing is determined by,● Cost to create and write checkpoint.● Cost to read and restore checkpoint.● Probability of failure.● Time between checkpoints.● Total time to run without checkpoints.

Types of checkpointing:● User-Directed checkpointing.● System-Directed checkpointing.

Advantage & disadvantage:● It is easy to implement.● Cost of saving and restoring checkpoints must be

relatively small.

2.Using Intercommunicators:• It contains two groups of processes.• All communications occurs between processes in one group

and processes in the other group.Example:• Manager-Worker

● Manager process keeps track of a pool of tasks and dispatches them to working processes for completion.

● Workers return results to the manager, simultaneously requesting a new task.

Advantages & Disadvantage:● The manager can easily recognize that a particular worker

has failed and communicate to other processes.● Each group can keep tack of the state held by the other

group.● Difficult to implement in complex systems.

APPROACH TO FAULT TOLERANCE IN MPI PROGRAMS

3.Modifying MPI Semantics:• Takes advantage of the existing MPI objects that contain

more state and MPI functions defined in the standard.Example:

● MPI objects guarantees that the number of processes and its rank in a communicator is constant.

● This property can be used by the program,• To decompose data according to a communicator’s

size.• Calculate the data assigned to a process using its rank.

Advantage & Disadvantage:● Fault tolerant programs can be written for a wider set of

algorithms.● This approach uses the already existing semantics and

therefore provides lesser fault tolerant features compared to other approaches.


4.Extending MPI:•This approach is developed to address the difficulty of using MPI

communicators when processes may fail.

• It is difficult to construct the communicator consisting of the two individual processes.

• If the Manager group has failed, then it is even more difficult because of collective semantics of communicator construction in MPI.


ADVANTAGES OF USING MPI FAULT TOLERANCE FEATURES

•It is simple and easy to use the existing error handling features in MPI.

•Users can extend the “MPI_ERRORS_RETURN” to define errors specific to their needs.

•Error handling is purely local. Every process can have a different handler.

•The ability to attach error handlers on a communicator increases the modularity of MPI.

•MPI provides the ability to define one’s own application-specific error handler which is an important approach to fault tolerance.

LIMITATIONS OF FAULT TOLERANCE IN MPI

•The specification makes no demands on MPI to survive failures. •The defined MPI error classes are used only to clarify to the user

about the source of the error.

•It is difficult for MPI to notify users of the failure of a given function that happen after the function has already returned.

•There is no description of when error notification will happen relative to the occurrence of the error.

•It is not possible for one application process to ask to be informed of errors on other processes or for the application to be informed of specific classes of errors.

HARNESS/ FAULT TOLERANT MPI: AN EXTENSION TO MPI

• HARNESS (Heterogeneous Adaptive Reconfigurable Networked SyStem)

• Experimental System which provides highly dynamic, fault-tolerant computing environment for high performance computing applications

• HARNESS is a joint DOE funded project involving Oak Ridge National Laboratory (ORNL), University of Tennessee at Knoxville (UTK/ICL) and Emory University in Atlanta, GA.

HARNESS : AN EXTENSION TO MPI

• Current MPI implementations either abort or use check-pointing

• Communications only via communicator

• MPI communicator based on static model

IMPLEMENTATION

• FT MPI (HARNESS) extends MPI

• Allows applications to decide when errors occurs● Restart failed node● Continue with less number of nodes

• When member communicator fails:● Communicator state changes to indicate problem● Message transfer continues if safe or be stopped or ignored● User application can fix or abort communicator to continue

COMPARISON FT-MPI AND MPI: COMMUNICATOR AND PROCESS STATES

FT-MPI MPI

FT_OK

FT_DETECTED VALID

FT_RECOVER INVALID

FT_RECOVERED

FT_FAILED

PROCESS STATES

OK OK

UNAVAILABLE FAILED

JOINING

FAILED

IMPLEMENTATION: EXTENDING MPI• When running an FT-MPI application, there

are two parameters used to specify modes in which application is running.

• The first parameter, the ’communicator mode’, indicates what is the status of an MPI object after recovery. Which can be specified when starting the application:

ABORT BLANK REBUILD SHRINK

Like MPI FTMPI can abort error

Failed process are not replace

Failed process respawned surviving process has same rank. Default mode

Failed Process not replaced.No gaps in lists of processors

FT/MPI :

• Second parameter communication mode:• Two types of communications

Cont/ CONTINUE NOOP /RESET

All operations which returned MPI_SUCCESS code will finish properly.

All ongoing messeages dropped.Error on application sents it to last consistent state.

FT/MPI : COMMUNICATOR (COMM.) FAILURE HANDLING• COMM. Invalidated if failure detected

• Underlying system sends a state update to all processes for that COMM.

• System behavior depends on COMM. mode chosen

• All COMM. are not updated for• communication errors• Process exit

FT/MPI USAGE• In form of error check

• Some corrective action like communicator rebuild

For example*: (Simple FT-MPI send usage)

rc = MPI_Send(-------, com);

if (rc == MPI_ERR_OTHER)

MPI_Comm_dup (com, newcom);

com = newcom;

SPMD master-worker node only need master code to check for errors if user only takes master code as the point of failure

EXAMPLE : MPI ERROR HANDLING

EXAMPLE OF ERROR HANDLING USING FT-MPI

PERFORMANCE CONSIDERATION

• Fault free overhead of P2P communication in MPI/FT is negligible in long running applications.

• Check-pointing increases communication overhead considerably therefore user must determine less frequency of checkpoints.

CONCLUSIONS

• FT-MPI is tool to provide with methods of dealing with failures within MPI applications

• FT-MPI is useful for experimenting with ● Self tuning collective communications● Distributed control algorithms● Dynamics libraries download methods

FUTURE SCOPE

• Developing further implementations that support more restrictive environments (ie. embedded clusters)

• Creation of number of drop-in library templates to simplify the construction of fault tolerant applications

• High performance and survivability

REFERENCES• Fault Tolerance in MPI Programs:

• http://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf

• LEGION:

• http://legion.virginia.edu/documentation/FAQ_mpi_run.html

• HARNESS:

• http://icl.cs.utk.edu/ftmpi/index.html

• MPI 3.0 Fault Tolerance Working Group:

• http://meetings.mpi-forum.org/mpi3.0_ft.php

• Graham E. Fagg, George Bosilca, Thara Angskun, Zhizhong Chen, Jelena Pjesivac-Grbovic, Kevin London and Jack J. Dongarra "Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems" manual HARNESS

• Graham E. Fagg, Antonin Bukovsky,Jack J. Dongarra "HARNESS and fault tolerant MPI" Parallel Computing 27, 2001 1479-1495

• Graham E. Fagg, Jack J. Dongarra "BUILDING AND USING A FAULT–TOLERANT MPI IMPLEMENTATION" The International Journal of High Performance Computing Applications, Volume 18, No. 3, Fall 2004, pp. 353–361

• Conference proceedings FT-MPI Presentation Graham E. Fagg, Jack J. Dongarra

http://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf

http://legion.virginia.edu/documentation/FAQ_mpi_run.html

http://icl.cs.utk.edu/ftmpi/index.html

http://meetings.mpi-forum.org/mpi3.0_ft.php

http://meetings.mpi-forum.org/mpi3.0_ft.php

Q & A ?

FAQS

1.MPI vs TCP socket: • Arguably, one of the biggest weaknesses of MPI is its lack of

resilience — most (if not all) MPI implementations will kill an entire MPI job if any individual process dies. This is in contrast to the reliability of TCP sockets, for example: if a process on one side of a socket suddenly goes away, the peer just gets a stale socket.

2. Does MPI guarantee that user-defines handler be used as MPI_ERRORS_RETURN

• The specification does not state whether an error that would cause MPI functions to return an error code under the MPI_ERRORS_RETURN error handler would cause a user-defined error handler to be called during the same MPI function or at some earlier or later point in time.

3. Relation between checkpointing and I/O•The practicality of checkpointing is related to performance of parallel

I/O as checkpoint data is saved to a parallel file system.

FAQs

4. Usability of HARNESS FT-MPI

•The fault tolerance feature provides by HARNESS depends on its implementation. The HARNESS team actually works on the reported bugs and releases new versions.

5. Data Recovery in MPI

•The MPI standard does not provide a way to recover data. It depends on the implementation of the MPI program.

6. Is fault tolerance in MPI can be made transparent?

• It is very difficult to make the fault tolerance in MPI transparent. This is because of the complexity involved in communication between processes.

Reference Slides

REFERRENCE: STRUCTURE OF FT-MPI

•

DERIVED DATATYPE HANDLING

• Reduces memory copies while allowing overlapping 3 stages of data handling

● Gather/Scatter● Encoding/Decoding● Send/Receive Package

HANDLING OF COMPACTED DATATYPE:ONLY MPI_SND AND RECEIVE WER USED

PERFORMANCE CONSIDERATION

• Tests show compacted data handling gives 10% to 19% imrovement.

• Benefit of buffer reuse and reordering of data elements leads to considerable improvements on heterogeneous networks.

Documents

F AULT T OLERANCE IN MPI Minaashi Kalyanaraman Pragya Upreti CSS 534 Parallel Programming