LADR: low-cost application-level detector for reducing silent output corruptions...

by Chao Chen, Greg Eisenhauer, Matthew D Wolf, Santosh Pande

Publication Type

Conference Paper

Journal Name

ACM Digital Library

Publication Date

June, 2018

Page Numbers

156 to 167

Volume

Issue

Conference Name

International Symposium on High-Performance Parallel and Distributed Computing

Conference Location

New York, New York, United States of America

Conference Sponsor

ACM

Conference Date

Jun 11, 2018 - Jun 15, 2018

View DOI Listing

Abstract

Applications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) operations. A transient fault could corrupt application state without warning, possibly leading to incorrect application output. Such errors are called silent data corruptions (SDCs).

In this paper, we present LADR, a low-cost application-level SDC detector for scientific applications. LADR protects scientific applications from SDCs by watching for data anomalies in their state variables (those of scientific interest). It employs compile-time data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads while maintaining a high level of fault coverage with low false positive rates. We evaluated LADR with 4 scientific workloads and results show that LADR achieved < 80% fault coverage with only ∼ 3% runtime overheads and ∼ 1% memory overheads. As compared to prior state-of-the-art anomaly-based detection methods, SDC achieved comparable or improved fault coverage, but reduced runtime overheads by 21% ∼ 75%, and memory overheads by 35% ∼ 55% for the evaluated workloads. We believe that such an approach with low memory and runtime overheads coupled with attractive detection precision makes LADR a viable approach for assuring the correct output from large-scale high performance simulations.

LADR: low-cost application-level detector for reducing silent output corruptions...

Abstract

Organizations