Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems...

by Arnab Kumar Paul, Jong Youl Choi, Ahmad Maroof Karimi Nln, Feiyi Wang

Publication Type

Conference Paper

Journal Name

The 31st International Symposium on High-Performance Parallel and Distributed Computing

Book Title

HPDC '22: Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing

Publication Date

June, 2022

Page Numbers

199 to 212

Publisher Location

New York, United States of America

Conference Name

The 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC)

Conference Location

Minneapolis, Minnesota, United States of America

Conference Sponsor

ACM SIGARCH, ACM SIGHPC, University of Minnesota

Conference Date

Jun 27, 2022 - Jul 1, 2022

View DOI Listing

Abstract

Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in maintaining mission-critical performance in a large-scale, multi-user, parallel storage system. Center-wide I/O traces can provide high-level information and fine-grained activities per application or per user running in the system. Studying such large-scale traces can provide helpful insights into the system. It can be used to develop predictive methods for making predictive decisions, adjusting scheduling policies, or providing decisions for the design of next-generation systems. However, sharing real-world I/O traces to expedite such research efforts leaves a few concerns; i) the cost of sharing the large traces is expensive due to this large size, and ii) privacy concern is an issue.

We address such issues by building an end-to-end machine learn- ing (ML) workflow that can generate I/O traces for large-scale HPC applications. We leverage ML based feature selection and gener- ative models for I/O trace generation. The generative models are trained on I/O traces collected by the darshan I/O characterization tool over a period of one year. We present a two-step generation process consisting of two deep-learning models, called the feature generator and the trace generator. The combination of two-step generative models provides robustness by reducing the bias of the model and accounting for the stochastic nature of the I/O traces across different runs of an application. We evaluate the performance of the generative models and show that the two-step model can generate time-series I/O traces with less than 20% root mean square error.

Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems...

Abstract

Researchers

Organizations