Skip to main content
SHARE
Publication

Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems...

by Arnab Kumar Paul, Ahmad Maroof Karimi Nln, Feiyi Wang
Publication Type
Conference Paper
Journal Name
IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (IEEE MASCOTS)
Book Title
2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)
Publication Date
Page Numbers
1 to 8
Publisher Location
New Jersey, United States of America
Conference Name
IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)
Conference Location
Virtual, Tennessee, United States of America
Conference Sponsor
IEEE Computer Society
Conference Date
-

High performance computing (HPC) is no longer solely limited to traditional workloads such as simulation and modeling. With the increase in the popularity of machine learning (ML) and deep learning (DL) technologies, we are observing that an increasing number of HPC users are incorporating ML methods into their workflow and scientific discovery processes, across a wide spectrum of science domains such as biology, earth science, and physics. This gives rise to a diverse set of I/O patterns than the traditional checkpoint/restart-based HPC I/O behavior. The details of the I/O characteristics of such ML I/O workloads have not been studied extensively for large-scale leadership HPC systems. This paper aims to fill that gap by providing an in-depth analysis to gain an understanding of the I/O behavior of ML I/O workloads using darshan - an I/O characterization tool designed for lightweight tracing and profiling. We study the darshan logs of more than 23, 000 HPC ML I/O jobs over a time period of one year running on Summit - the second-fastest supercomputer in the world. This paper provides a systematic I/O characterization of ML I/O jobs running on a leadership scale supercomputer to understand how the I/O behavior differs across science domains and the scale of workloads, and analyze the usage of parallel file system and burst buffer by ML I/O workloads.