Skip to main content
SHARE
Publication

Optimizing Communication in 2D Grid-Based MPI Applications at Exascale

Publication Type
Conference Paper
Book Title
EuroMPI '23: Proceedings of the 30th European MPI Users' Group Meeting
Publication Date
Page Numbers
1 to 11
Publisher Location
New York, New York, United States of America
Conference Name
EuroMPI 23: European MPI Users' Group Meeting
Conference Location
Bristol, United Kingdom
Conference Sponsor
University of Bristol
Conference Date
-

The new reality of exascale computing faces many challenges in achieving optimal performance on large numbers of nodes. A key challenge is the efficient utilization of the message-passing interface (MPI), a critical component for process communication. This paper explores communication optimization strategies to harness the GPU-accelerated architectures of these supercomputers. We focus on MPI applications where processors form a two-dimensional process grid, a common arrangement in applications involving dense matrix operations. This configuration offers a unique opportunity to implement innovative strategies to improve performance and maintain effective load distribution. We study two applications— Dist-FW (Apsp:all-pair-shortest-path) and HPL-MxP (LU factorization with Mixed precision)—on two accelerated systems: Summit (IBM Power and NVIDIA V100) and Frontier (AMD EPYC and MI250X). These supercomputers are operated by the Oak Ridge Leadership Computing Facility (OLCF) and are currently ranked #1 and #5 on the Top500 list. We show how to scale up both applications to exascale levels and tackle the MPI challenges related to implementation, synchronization, and performance. We also compare the performance of several communication strategies at an unprecedented scale. Accurately predicting application performance becomes crucial for cost reduction as the computational scale grows. To address this, we suggest a hyperbolic model as a better alternative to the traditional one-sided asymptotic model for predicting future application performance at such large scales.