Skip to main content
SHARE
Publication

Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance...

Publication Type
Conference Paper
Publication Date
Page Numbers
329 to 332
Volume
6960
Conference Name
EuroMPI 2011
Conference Location
Santorini, Greece
Conference Date
-

The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum's Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard.