HPC Usage Behavior Analysis and Performance Estimation with Machine Learning Techniques

by Hao Zhang, Haihang You, Bilel Hadri, Mark R Fahey

Publication Type

Conference Paper

Publication Date

July, 2012

Conference Name

18th International Conference on Parallel and Distributed Processing Techniques and Applications

Conference Location

Las Vegas, Nevada, United States of America

Conference Date

Jul 16, 2012 - Jul 19, 2012

Abstract

Most researchers with little high performance computing (HPC) experience have difficulties productively
using the supercomputing resources. To address this issue, we investigated usage behaviors of the world’s fastest academic Kraken supercomputer, and built a knowledge-based recommendation system to improve user productivity. Six clustering techniques, along with three cluster validation measures, were implemented to investigate the underlying patterns of usage behaviors. Besides manually defining a category for very large job submissions, six behavior categories were identified, which cleanly separated the data intensive jobs and computational intensive jobs. Then, job statistics of each behavior category were used to develop a knowledge-based recommendation system that can provide users with instructions about choosing appropriate software packages, setting job parameter values, and estimating job queuing time and runtime. Experiments were conducted to evaluate the performance of the proposed recommendation system, which included 127 job submissions by users from different research fields. Great feedback indicated the usefulness of the provided information. The average runtime estimation accuracy of 64.2%, with 28.9% job termination rate, was achieved in the experiments, which almost doubled the average accuracy in the Kraken dataset.

HPC Usage Behavior Analysis and Performance Estimation with Machine Learning Techniques

Abstract

Organizations