Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks...

Show authors

Publication Type

Conference Paper

Book Title

Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation

Publication Date

April, 2022

Page Numbers

1027 to 1040

Publisher Location

Renton, Washington, United States of America

Conference Name

NSDI '22: 19th USENIX Symposium on Networked Systems Design and Implementation

Conference Location

Renton, Washington, United States of America

Conference Sponsor

USENIX

Conference Date

Apr 4, 2022 - Apr 6, 2022

Abstract

This work develops new techniques within Horovod, a generic communication library supporting data parallel training across deep learning frameworks. In particular, we improve the Horovod control plane by implementing a new coordination scheme that takes advantage of the characteristics of the typical data parallel training paradigm, namely the repeated execution of collectives on the gradients of a fixed set of tensors. Using a caching strategy, we execute Horovod’s existing coordinator-worker logic only once during a typical training run, replacing it with a more efficient decentralized orchestration strategy using the cached data and a global intersection of a bitvector for the remaining training duration. Next, we introduce a feature for end users to explicitly group collective operations, enabling finer grained control over the communication buffer sizes. To evaluate our proposed strategies, we conduct experiments on a world-class supercomputer — Summit. We compare our proposals to Horovod’s original design and observe 2x performance improvement at a scale of 6000 GPUs; we also compare them against tf.distribute and torch.DDP and achieve 12% better and comparable performance, respectively, using up to 1536 GPUs; we compare our solution against BytePS in typical HPC settings and achieve about 20% better performance on a scale of 768 GPUs. Finally, we test our strategies on a scientific application (STEMDL) using up to 27,600 GPUs (the entire Summit) and show that we achieve a near-linear scaling of 0.93 with a sustained performance of 1.54 exaflops (with standard error +- 0.02) in FP16 precision.

Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks...

Abstract

Researchers

Organizations