Skip to main content
SHARE
Publication

LADR: low-cost application-level detector for reducing silent output corruptions...

by Chao Chen, Greg Eisenhauer, Matthew D Wolf, Santosh Pande
Publication Type
Conference Paper
Journal Name
ACM Digital Library
Publication Date
Page Numbers
156 to 167
Volume
0
Issue
0
Conference Name
International Symposium on High-Performance Parallel and Distributed Computing
Conference Location
New York, New York, United States of America
Conference Sponsor
ACM
Conference Date
-

Applications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) operations. A transient fault could corrupt application state without warning, possibly leading to incorrect application output. Such errors are called silent data corruptions (SDCs).

In this paper, we present LADR, a low-cost application-level SDC detector for scientific applications. LADR protects scientific applications from SDCs by watching for data anomalies in their state variables (those of scientific interest). It employs compile-time data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads while maintaining a high level of fault coverage with low false positive rates. We evaluated LADR with 4 scientific workloads and results show that LADR achieved < 80% fault coverage with only ∼ 3% runtime overheads and ∼ 1% memory overheads. As compared to prior state-of-the-art anomaly-based detection methods, SDC achieved comparable or improved fault coverage, but reduced runtime overheads by 21% ∼ 75%, and memory overheads by 35% ∼ 55% for the evaluated workloads. We believe that such an approach with low memory and runtime overheads coupled with attractive detection precision makes LADR a viable approach for assuring the correct output from large-scale high performance simulations.