Skip to main content
SHARE
Research Highlight

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication

Overview of the 3D Coded SUMMA (left) and comparison of the total execution time between 3D SUMMA (no resilience), replication, and 3D Coded SUMMA. Computer Science and Mathematics Division CSMD ORNL
Overview of the 3D Coded SUMMA (left) and comparison of the total execution time between 3D SUMMA (no resilience), replication, and 3D Coded SUMMA.

The Science

A team of researchers from ORNL, Carnegie Mellon University (CMU), University of California, Berkeley (UCB), and Penn State University (PSU) developed a novel algorithm for resilient and communication-efficient parallel matrix multiplication in HPC systems.

The algorithm, known as 3D Coded SUMMA:

  • performs the communication-efficient parallel matrix multiplication and is capable of recovering from compute node failures using redundancy through coded computation.
  • requires 50% less redundancy than traditional replication and has an execution time overhead of only 5-10%.

The Impact

Current HPC strategies for obtaining timely and efficient results (through checkpoint/restart, algorithm-based fault tolerance, etc.) may not provide the necessary failure tolerance at a reasonable cost in systems that experience high failure rates.

The developed algorithm:

  • offers a new capability for such systems by providing the failure tolerance of traditional redundant computing at significantly lower cost. 
  • applies the latest advances in coding theory to failure tolerant computing, opening up an entirely new area of research.

PI(s): Pulkit Grover (CMU) and Christian Engelmann (ORNL)
ASCR Program/Facility: Early Career
Funding: DOE/ASCR for ORNL, NSF for CMU, UCB, and PSU
Publication: Haewon Jeong, Yaoqing Yang, Christian Engelmann, Vipul Gupta, Tze Meng Low, Pulkit Grover, Viveck Cadambe, and Kannan Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the 26th European Conference on Parallel and Distributed Computing (Euro-Par) 2020, Warsaw, Poland, August 24-28, 2020 DOI: https://doi.org/10.1007/978-3-030-57675-2_25.