Abstract
In high-performance computing (HPC) workflows, data analytics is typically utilized to gain insights from scientific simulations. Approaching the era of exascale, online analysis is gaining popularity due to the savings of I/O to persistent storage. As computing capability keeps growing, power consumption is becoming critical to HPC facilities. Enforcing power limits is emerging as a practical trend for power-constrained HPC facilities. However, it remains unclear how to choose the appropriate power limits for various HPC workflows and how to distribute the power limit of a workflow between simulation and analysis. In addition, given a power limit, it is unclear what the optimal scales and power capping levels are for various workflows, especially when taking reliability into account. In order to resolve these issues in power-constrained HPC, in this paper, we propose a reliability-aware model to determine the aforementioned platform configurations for HPC workflows. We also validate our model and present model-driven studies for a wide range of real-system scenarios. Our study reveals interesting insights about how platform configuration affects the performance and energy efficiency of HPC workflows under power constraints.