A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

by Piyush K Sao, Xiaoye Li, Richard Vuduc

Publication Type

Journal

Journal Name

Journal of Parallel and Distributed Computing

Publication Date

April, 2019

Page Numbers

218 to 234

Volume

131

Issue

View DOI Listing

Abstract

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of O (√log n) and latency by a factor of O (log n). For nonplanar cases, our algorithm can reduce the per-process communication volume by 3× and latency by O (n 1/3) times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (Halo) algorithm for co-processor offload [44]. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration.

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Abstract

Researchers

Organizations