Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications...

by Saurabh Gupta, Tirthak Patel, Christian Engelmann, Devesh Tiwari

Publication Type

Conference Paper

Book Title

Proceedings of the 30th IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17) 2017

Publication Date

November, 2017

Conference Name

30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017

Conference Location

Denver, Colorado, United States of America

Conference Sponsor

IEEE/ACM

Conference Date

Nov 12, 2017 - Nov 17, 2017

View DOI Listing

Abstract

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.

Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications...

Abstract

Researchers

Organizations