Orchestrating Fault Prediction with Live Migration and Checkpointing...

by Subhendu Behera, Lipeng Wan, Frank Mueller, Matthew D Wolf, Scott A Klasky

Publication Type

Conference Paper

Journal Name

Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing

Book Title

HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing

Publication Date

June, 2020

Page Numbers

167 to 171

Issue

Conference Name

International Symposium on High-Performance Parallel and Distributed Computing (HPDC '20)

Conference Location

Stokholm, Sweden

Conference Sponsor

ACM

Conference Date

Jun 23, 2020 - Jun 26, 2020

View DOI Listing

Abstract

Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ~20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ~29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.

Orchestrating Fault Prediction with Live Migration and Checkpointing...

Abstract

Researchers

Organizations