Skip to main content
SHARE
Publication

Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations...

by Christian Engelmann, Steven L Scott, Leangsuksun Chokchai, X. He
Publication Type
Conference Paper
Book Title
Interner Proceedings
Publication Date
Page Number
0
Conference Name
High Availability and Performance Workshop (HAPCW) 2006
Conference Location
Santa Fe, New Mexico, United States of America
Conference Date

During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University, and Tennessee Technological University focused on efficient redundancy strategies for head and service nodes of high-performance computing (HPC) systems in order to pave the way for high availability (HA) in HPC. These nodes typically run critical HPC system services, like job and resource management, and represent single points of failure and control for an entire HPC system. The overarching goal of our research is to provide high-level reliability, availability, and serviceability (RAS) for HPC systems by combining HA and HPC technology. This paper summarizes our accomplishments, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment.