Lazy Checkpointing : Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems...

by Devesh Tiwari, Saurabh Gupta, Sudharshan S Vazhkudai

Publication Type

Conference Paper

Publication Date

June, 2014

Page Numbers

25 to 36

Conference Name

The 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2014)

Conference Location

Atlanta, Georgia, United States of America

Conference Date

Jun 23, 2014 - Jun 26, 2014

Abstract

Continuing increase in the computational power of supercomputers has enabled large-scale scientific applications in the areas of astrophysics, fusion, climate and combustion to run larger and longer-running simulations, facilitating deeper scientific insights. However, these long-running simulations are often interrupted by multiple system failures. Therefore, these applications rely on ``checkpointing'' as a resilience mechanism to store application state to permanent storage and recover from failures. \\
\indent Unfortunately, checkpointing incurs excessive I/O overhead on supercomputers due to large size of checkpoints, resulting in a sub-optimal performance and resource utilization. In this paper, we devise novel mechanisms to show how checkpointing overhead can be mitigated significantly by exploiting the temporal characteristics of system failures. We provide new insights and detailed quantitative understanding of the checkpointing overheads and trade-offs on large-scale machines. Our prototype implementation shows the viability of our approach on extreme-scale machines.

Lazy Checkpointing : Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems...

Abstract

Organizations