Skip to main content
SHARE
Publication

A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI...

by Joshua J Hursey, Thomas J Naughton Iii, Geoffroy R Vallee, Richard L Graham
Publication Type
Conference Paper
Publication Date
Page Numbers
255 to 263
Volume
6960
Conference Name
EuroMPI 2011
Conference Location
Santorini, Greece
Conference Date
-

The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.