Skip to main content
SHARE
Publication

Pattern-based Modeling of High-Performance Computing Resilience...

by Saurabh Hukerikar, Christian Engelmann
Publication Type
Conference Paper
Book Title
Lecture Notes in Computer Science: Proceedings of the 23rd European Conference on Parallel and Distributed Computing (Euro-Par) 2017 Workshops
Publication Date
Page Numbers
557 to 568
Volume
10659
Conference Name
10th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids at the 23rd International European Conference on Parallel and Distributed Computing (Euro-Par)
Conference Location
Santiago de Compostela, Spain
Conference Sponsor
Euro-Par
Conference Date
-

With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the reliability requirements with the overheads to performance and power. Design patterns enable a structured approach to the development of resilience solutions, providing hardware and software designers with the building block elements for the rapid development of novel solutions and for adapting existing technologies for emerging, extreme-scale HPC environments. In this paper, we develop analytical models that enable designers to evaluate the reliability and performance characteristics of the design patterns. These models are particularly useful in building a unified framework that analyzes and compares various resilience solutions built using a combination of patterns.