Skip to main content
SHARE
Publication

Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications...

by Saurabh Gupta, Tirthak Patel, Christian Engelmann, Devesh Tiwari
Publication Type
Conference Paper
Book Title
Proceedings of the 30th IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17) 2017
Publication Date
Conference Name
30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017
Conference Location
Denver, Colorado, United States of America
Conference Sponsor
IEEE/ACM
Conference Date
-

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.