Skip to main content
SHARE
Publication

Power Profile Monitoring and Tracking Evolution of System-Wide HPC Workloads

by Ahmad Maroof Karimi, Naw Safrin Sattar, Woong Shin, Feiyi Wang
Publication Type
Conference Paper
Book Title
2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)
Publication Date
Page Numbers
93 to 104
Publisher Location
New Jersey, United States of America
Conference Name
44th IEEE International Conference on Distributed Computing Systems (ICDCS)
Conference Location
Jersey City, New Jersey, United States of America
Conference Sponsor
IEEE
Conference Date
-

The power & energy demands of HPC machines have grown significantly. Modern exascale HPC systems require tens of megawatts of combined power for computing resources and cooling facilities at full capacity. The current energy trend is not sustainable for future HPC systems, and there is a need to work toward the energy efficiency aspect of HPC performance. Energy awareness of the HPC applications at the job level is essential for running an efficient HPC system. This work aims to develop a pipeline to provide a production-level system-wide overview of the HPC workloads' power profile while handling evolving workloads exhibiting new power trends. We developed an open-set classification model for HPC jobs based on the properties of power profiles to continuously provide a system-wide holistic view of recently completed jobs. The pipeline helps continuously monitor the job-level power usage pattern of HPC and enables us to capture the new trends in applications' power behavior. We employed a comprehensive set of techniques to generate job-level data, custom-designed feature extraction methods to extract critical features from jobs' power profiles, clustering techniques powered by generative modeling, and open-set classification for identifying job profiles into known classes or an unknown set. With extensive evaluations, we demonstrate the effectiveness of each component in our pipeline. We provide an analysis of the resulting clusters that characterize the power profile landscape of the Summit supercomputer from more than 60K jobs executed in a year. The open-set classification classifies the known data sets into known classes with high accuracy and identifies unknown data noints with over 85% accuracy.