Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility

by Devesh Tiwari, Saurabh Gupta, George Gallarno, James H Rogers Ii, Don E Maxwell

Publication Type

Conference Paper

Book Title

Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Publication Date

November, 2015

Page Number

Publisher Location

New Jersey, United States of America

Conference Name

Supercomputing (SC)

Conference Location

Austin, Texas, United States of America

Conference Date

Nov 15, 2015

View DOI Listing

Abstract

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world’s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility

Abstract

Researchers

Organizations