Skip to main content
SHARE
Publication

The Impact of a Fault Tolerant MPI on Scalable Systems Services and Applications...

by Richard L Graham, Joshua J Hursey, Geoffroy R Vallee, Thomas J Naughton Iii, Swen Boehm
Publication Type
Conference Paper
Publication Date
Conference Name
Cray User Group (CUG)
Conference Location
Stuttgart, Germany
Conference Date
-

Exascale targeted scientific applications must be prepared for a highly
concurrent computing environment where failure will be a regular event
during execution. Natural and algorithm-based fault tolerance (ABFT)
techniques can often manage failures more efficiently than traditional
checkpoint/restart techniques alone. Central to many petascale applications
is an MPI standard that lacks support for ABFT. The Run-Through
Stabilization (RTS) proposal, under consideration for MPI 3, allows an
application to continue execution when processes fail. The requirements of
scalable, fault tolerant MPI implementations and applications will stress
the capabilities of many system services. System services must evolve to
efficiently support such applications and libraries in the presence of
system component failures. This paper discusses how the RTS proposal
impacts system services, highlighting specific requirements. Early
experimentation results from Cray systems at ORNL using prototype MPI and
runtime implementations are presented. Additionally, this paper outlines
fault tolerance techniques targeted at leadership class applications.