Skip to main content
SHARE
Publication

Concepts for OpenMP Target Offload Resilience...

by Christian Engelmann, Geoffroy R Vallee, Swaroop S Pophale
Publication Type
Conference Paper
Book Title
OpenMP: Conquering the Full Hardware Spectrum
Publication Date
Page Numbers
78 to 93
Volume
11718
Conference Name
15th International Workshop on OpenMP (IWOMP 2019)
Conference Location
AUCKLAND, New Zealand
Conference Sponsor
N/A
Conference Date
-

Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing graphics processing unit (GPGPU) errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing (HPC) systems.