Skip to main content
SHARE
Publication

Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems...

by Christian Engelmann, Thomas J Naughton Iii
Publication Type
Conference Paper
Book Title
Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI)
Publication Date
Page Numbers
962 to 971
Conference Name
International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI) 2013
Conference Location
Lyon, France
Conference Date
-

xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.