Skip to main content
SHARE
Publication

SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning...

by Di Zhang, Dong Dai, Bing Xie
Publication Type
Conference Paper
Book Title
HPDC '22: Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
Publication Date
Page Numbers
97 to 109
Publisher Location
New York, United States of America
Conference Name
International Symposium on High-Performance Parallel and Distributed Computing (HPDC)
Conference Location
Minneapolis, Minnesota, United States of America
Conference Sponsor
ACM
Conference Date
-

Improving the performance of job executions is an important goal of HPC batch job schedulers, such as minimizing job waiting time, slowdown, or completion time. Such a goal is often accomplished using carefully designed heuristics based on job features, such as job size and job duration. However, these heuristics overlook important runtime factors (e.g., cluster availability and waiting job patterns), which may vary across time and make a previously sound scheduling decision not hold any longer. In this study, we propose a new approach to incorporate runtime factors into batch job scheduling for better job execution performance. The key idea is to add a scheduling inspector on top of the base job scheduler to scrutinize its scheduling decisions. The inspector will take the runtime factors into consideration and accordingly determine the fitness of the scheduled job. It then either accepts the scheduled job or rejects it and asks the base schedulers to try again later. We realize such an inspector, namely SchedInspector, by leveraging the intelligence of reinforcement learning. Through extensive experiments, we show SchedInspector can intelligently integrate the runtime factors into various batch job scheduling policies, including the state-of-the-art one, to gain better job execution performance, such as smaller average bounded job slowdown (up to 69% better) or average job waiting time (up to 52% better), across various real-world workloads. We also show that although rejecting scheduling decisions may leave the resources idle hence affect the system utilization, SchedInspector is able to achieve the job execution performance improvement with marginal impact on the system utilization (typically less than 1%). We consider one key advantage of SchedInspector is it automatically learns to work with and improve existing job scheduling policies without changing them, which makes it promising to serve as a generic enhancer for various batch job scheduling policies.