Skip to main content
SHARE
Publication

Towards Acceptance Testing at the Exascale Frontier...

by Veronica G Melesse Vergara, Michael J Brim, Arnold N Tharrington, Reuben D Budiardja, Wayne D Joubert
Publication Type
Conference Paper
Book Title
Conference Proceedings of the Cray User Group 2020
Publication Date
Conference Name
Cray User Group 2020
Conference Location
Virtual, Tennessee, United States of America
Conference Sponsor
Cray Inc.
Conference Date
-

At the 2007 Cray User Group meeting, the Oak Ridge Leadership Computing Facility (OLCF) introduced the OLCF Test Harness (OTH), a framework[1] used for acceptance testing of the Jaguar supercomputer[2]. Since then, the OTH framework has evolved to version 2.0 which adds new features and streamlines usability. The OTH is the key piece of software used to orchestrate acceptance testing for all OLCF computational resources before they are deployed for production use, including our leadership class high performance computing (HPC) systems. The OTH framework is written in Python and is publicly available[3].
In this paper, we first describe the requirements, design, and structure of the OTH. Then, we present specific improvements developed to support acceptance testing of the OLCF’s Summit system[4]. We will also showcase new OTH features that have been added to streamline the acceptance test process as well as the motivation behind those changes. As part of this work, we also evaluated different workflow tools in order to determine whether these tools could complement the OTH in two key areas: automation and reporting. The advantages and disadvantages identified with each tool will be discussed. Lastly, we summarize the challenges and lessons learned collected from using the OTH for the acceptance of the last three flagship systems at the OLCF. These may be useful for other HPC centers developing their own testing frameworks or those interested in using the OTH.