Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems

by Arnab Kumar Paul, Ahmad Maroof Karimi, Feiyi Wang

Publication Type

Conference Paper

Book Title

2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)

Publication Date

November, 2021

Page Numbers

1 to 8

Publisher Location

New Jersey, United States of America

Conference Name

IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)

Conference Location

Virtual, Tennessee, United States of America

Conference Sponsor

IEEE Computer Society

Conference Date

Nov 3, 2021 - Nov 5, 2021

View DOI Listing

Abstract

High performance computing (HPC) is no longer solely limited to traditional workloads such as simulation and modeling. With the increase in the popularity of machine learning (ML) and deep learning (DL) technologies, we are observing that an increasing number of HPC users are incorporating ML methods into their workflow and scientific discovery processes, across a wide spectrum of science domains such as biology, earth science, and physics. This gives rise to a diverse set of I/O patterns than the traditional checkpoint/restart-based HPC I/O behavior. The details of the I/O characteristics of such ML I/O workloads have not been studied extensively for large-scale leadership HPC systems. This paper aims to fill that gap by providing an in-depth analysis to gain an understanding of the I/O behavior of ML I/O workloads using darshan - an I/O characterization tool designed for lightweight tracing and profiling. We study the darshan logs of more than 23, 000 HPC ML I/O jobs over a time period of one year running on Summit - the second-fastest supercomputer in the world. This paper provides a systematic I/O characterization of ML I/O jobs running on a leadership scale supercomputer to understand how the I/O behavior differs across science domains and the scale of workloads, and analyze the usage of parallel file system and burst buffer by ML I/O workloads.

Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems

Abstract

Researchers

Organizations