Skip to main content
SHARE
Publication

GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability...

Publication Type
Conference Paper
Book Title
SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Publication Date
Conference Name
SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Conference Location
Atlanta (Held virtually due to COVID-19), Georgia, United States of America
Conference Sponsor
Association for Computing Machinery
Conference Date
-

The Cray XK7 Titan was the top supercomputer system in the world for a long time and remained critically important throughout its nearly seven year life. It was an interesting machine from a reliability viewpoint as most of its power came from 18,688 GPUs whose operation was forced to execute three rework cycles, two on the GPU mechanical assembly and one on the GPU circuitboards. We write about the last rework cycle and a reliability analysis of over 100,000 years of GPU lifetimes during Titan’s 6-year-long productive period. Using time between failures analysis and statistical survival analysis techniques, we find that GPU reliability is dependent on heat dissipation to an extent that strongly correlates with detailed nuances of the cooling architecture and job scheduling. We describe the history, data collection, cleaning, and analysis and give recommendations for future supercomputing systems. We make the data and our analysis codes publicly available.