Skip to main content
SHARE
Publication

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance...

by Chao Wang, Frank Mueller, Christian Engelmann, Steven L Scott
Publication Type
Conference Paper
Book Title
CD Proceedings
Publication Date
Page Number
0
Conference Name
21st International Parallel and Distributed Processing Symposium (IPDPS) 2007
Conference Location
Long Beach, California, United States of America
Conference Date
-

Checkpoint/restart (C/R) has become a requirement for
long-running jobs in large-scale clusters due to a meantime-
to-failure (MTTF) in the order of hours. After a failure,
C/R mechanisms generally require a complete restart of an
MPI job from the last checkpoint. A complete restart, however,
is unnecessary since all but one node are typically still
alive. Furthermore, a restart may result in lengthy job requeuing
even though the original job had not exceeded its
time quantum.

In this paper, we overcome these shortcomings. Instead
of job restart, we have developed a transparent mechanism
for job pause within LAM/MPI+BLCR. This mechanism
allows live nodes to remain active and roll back to
the last checkpoint while failed nodes are dynamically replaced
by spares before resuming from the last checkpoint.
Our methodology includes LAM/MPI enhancements in support
of scalable group communicationwith fluctuating number
of nodes, reuse of network connections, transparent
coordinated checkpoint scheduling and a BLCR enhancement
for job pause. Experiments in a cluster with the NAS
Parallel Benchmark suite show that our overhead for job
pause is comparable to that of a complete job restart. A
minimal overhead of 5.6% is only incurred in case migration
takes place while the regular checkpoint overhead remains
unchanged. Yet, our approach alleviates the need
to reboot the LAM run-time environment, which accounts
for considerable overhead resulting in net savings of our
scheme in the experiments. Our solution further provides
full transparency and automation with the additional benefit
of reusing existing resources. Executing continues after
failures within the scheduled job, i.e., the application staging
overhead is not incurred again in contrast to a restart.
Our scheme offers additional potential for savings through
incremental checkpointing and proactive diskless live migration,
which we are currently working on.