Skip to main content
SHARE
Publication

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System...

Publication Type
Conference Paper
Book Title
Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018
Publication Date
Page Numbers
107 to 114
Conference Name
48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018
Conference Location
Luxembourg City, Luxembourg
Conference Sponsor
IEEE Computer Society
Conference Date
-

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.