Skip to main content
SHARE
Publication

A Tunable, Software-based DRAM Error Detection and Correction Library for HPC...

by David J Fiala, Kurt Ferreira, Frank Mueller, Christian Engelmann
Publication Type
Conference Paper
Book Title
Lecture Notes in Computer Science: Proceedings of the 17th European Conference on Parallel and Distributed Computing (Euro-Par) 2011 Workshops, Part II: 4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids
Publication Date
Page Numbers
251 to 261
Volume
7156
Publisher Location
Berlin, Germany
Conference Name
4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids 2011
Conference Location
Bordeaux, France
Conference Date

Proposed exascale systems will present a number of
considerable resiliency challenges. In particular, DRAM
soft-errors, or bit-flips, are expected to greatly increase
due to the increased memory density of these systems.
Current hardware-based fault-tolerance methods will be
unsuitable for addressing the expected soft error frequency
rate. As a result, additional software will be needed to
address this challenge. In this paper we introduce LIBSDC,
a tunable, transparent silent data corruption detection and
correction library for HPC applications. LIBSDC provides
comprehensive SDC protection for program memory by
implementing on-demand page integrity verification.
Experimental benchmarks with Mantevo HPCCG show that once
tuned, LIBSDC is able to achieve SDC protection with 50\%
overhead of resources, less than the 100\% needed for double
modular redundancy.