A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance...

by Chao Wang, Frank Mueller, Christian Engelmann, Steven L Scott

Publication Type

Conference Paper

Book Title

CD Proceedings

Publication Date

March, 2007

Page Number

Conference Name

21st International Parallel and Distributed Processing Symposium (IPDPS) 2007

Conference Location

Long Beach, California, United States of America

Conference Date

Mar 26, 2007 - Mar 30, 2007

Abstract

Checkpoint/restart (C/R) has become a requirement for
long-running jobs in large-scale clusters due to a meantime-
to-failure (MTTF) in the order of hours. After a failure,
C/R mechanisms generally require a complete restart of an
MPI job from the last checkpoint. A complete restart, however,
is unnecessary since all but one node are typically still
alive. Furthermore, a restart may result in lengthy job requeuing
even though the original job had not exceeded its
time quantum.

In this paper, we overcome these shortcomings. Instead
of job restart, we have developed a transparent mechanism
for job pause within LAM/MPI+BLCR. This mechanism
allows live nodes to remain active and roll back to
the last checkpoint while failed nodes are dynamically replaced
by spares before resuming from the last checkpoint.
Our methodology includes LAM/MPI enhancements in support
of scalable group communicationwith fluctuating number
of nodes, reuse of network connections, transparent
coordinated checkpoint scheduling and a BLCR enhancement
for job pause. Experiments in a cluster with the NAS
Parallel Benchmark suite show that our overhead for job
pause is comparable to that of a complete job restart. A
minimal overhead of 5.6% is only incurred in case migration
takes place while the regular checkpoint overhead remains
unchanged. Yet, our approach alleviates the need
to reboot the LAM run-time environment, which accounts
for considerable overhead resulting in net savings of our
scheme in the experiments. Our solution further provides
full transparency and automation with the additional benefit
of reusing existing resources. Executing continues after
failures within the scheduled job, i.e., the application staging
overhead is not incurred again in contrast to a restart.
Our scheme offers additional potential for savings through
incremental checkpointing and proactive diskless live migration,
which we are currently working on.

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance...

Abstract

Researchers

Organizations