Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation

This work proposed a design for implementing blocking and non-blocking reduction collective operations for modern multicore systems. An implementation based on the design performed an order of magnitude better than the state of-the-art on variety of systems including Cray and InfiniBand systems. The Conjugate Gradient solver using this implementation completed over 195% faster, compared to the completion time while using the state-of-the-art. These reduction implementations are integrated into Open MPI, a popular implementation of MPI standard, and we expect to release these implementations publicly as part of future Open MPI release. A paper describing the design, implementation, and evaluation of these reductions is accepted to be published in IEEE Cluster 2013 conference proceedings.

Team Members: Manjunath Gorentla Venkata and Pavel Shamis and Richard L. Graham and Joshua S. Ladd and Rahul Sampath


