Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility Conference Paper November, 2015
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems... Conference Paper June, 2015
Experience with GPUs on the Titan Supercomputer from a Reliability, Performance and Power Perspective Conference Paper May, 2015
Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer Conference Paper April, 2015
Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation... Conference Paper February, 2015