Abstract
The performance and scalability of collective operations plays a key role in the
performance and scalability of many scientific applications. Within the Open
MPI code base we have developed a general purpose hierarchical collective
operations framework called Cheetah, and applied it at large scale on the Oak
Ridge Leadership Computing Facility's Jaguar (OLCF) platform, obtaining better
performance and scalability than the native MPI implementation. This paper discuss
Cheetah's design and implementation, and optimizations to the framework for
Cray XT 5 platforms. Our results show that the Cheetah's Broadcast and Barrier perform
better than the native MPI implementation. For medium data, the Cheetah's Broadcast
outperforms the native MPI implementation by 93% for 49,152 processes problem size.
For small and large data, it out performs the native MPI implementation by 10% and 9%, respectively,
at 24,576 processes problem size. The Cheetah's Barrier performs 10% better than the
native MPI implementation for 12,288 processes problem
size.