Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System...

Show authors

Publication Type

Conference Paper

Book Title

Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018

Publication Date

June, 2018

Page Numbers

107 to 114

Conference Name

48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018

Conference Location

Luxembourg City, Luxembourg

Conference Sponsor

IEEE Computer Society

Conference Date

Jun 25, 2018 - Jun 28, 2018

View DOI Listing

Abstract

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System...

Abstract

Researchers

Organizations