Skip to main content
SHARE
Publication

Combining Partial Redundancy and Checkpointing for HPC...

Publication Type
Conference Paper
Publication Date
Conference Name
32nd International Conference on Distributed Computing Systems (ICDCS) 2012
Conference Location
Macau, China
Conference Date
-

Today's largest High Performance Computing (HPC) systems
exceed one Petaflops (10^15 floating point operations per
second) and exascale systems are projected within seven
years. But reliability is becoming one of the major
challenges faced by exascale computing. With billion-core
parallelism, the mean time to failure is projected to be in
the range of minutes or hours instead of days. Failures are
becoming the norm rather than the exception during execution
of HPC applications. Current fault tolerance techniques in
HPC focus on reactive ways to mitigate faults, namely via
checkpoint and restart (C/R). Apart from storage overheads,
C/R-based fault recovery comes at an additional cost in
terms of application performance because normal execution
is disrupted when checkpoints are taken. Studies have shown
that applications running at a large scale spend more than
50\% of their total time saving checkpoints, restarting and
redoing lost work. Redundancy is another fault tolerance
technique, which employs redundant processes performing the
same task. If a process fails, a replica of it can take over
its execution. Thus, redundant copies can decrease the
overall failure rate. The downside of redundancy is that
extra resources are required and there is an additional
overhead on communication and synchronization. This work
contributes a model and analyzes the benefit of C/R in
coordination with redundancy at different degrees to
minimize the total wallclock time and resources utilization
of HPC applications. We further conduct experiments with an
implementation of redundancy within the MPI layer on a
cluster. Our experimental results confirm the benefit of dual
and triple redundancy - but not for partial redundancy - and
show a close fit to the model. At 80,000 processes, dual
redundancy requires twice the number of processing resources
for an application but allows two jobs of 128 hours wallclock
time to finish within the time of just one job without
redundancy. For narrow ranges of processor counts, partial
redundancy results in the lowest time. Once the count exceeds
770, 000, triple redundancy has the lowest overall cost.
Thus, redundancy allows one to trade-off additional resource
requirements against wallclock time, which provides a tuning
knob for users to adapt to resource availabilities.