Skip to main content
SHARE
Publication

Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks...

Publication Type
Conference Paper
Book Title
Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation
Publication Date
Page Numbers
1027 to 1040
Publisher Location
Renton, Washington, United States of America
Conference Name
NSDI '22: 19th USENIX Symposium on Networked Systems Design and Implementation
Conference Location
Renton, Washington, United States of America
Conference Sponsor
USENIX
Conference Date
-

This work develops new techniques within Horovod, a generic communication library supporting data parallel training across deep learning frameworks. In particular, we improve the Horovod control plane by implementing a new coordination scheme that takes advantage of the characteristics of the typical data parallel training paradigm, namely the repeated execution of collectives on the gradients of a fixed set of tensors. Using a caching strategy, we execute Horovod’s existing coordinator-worker logic only once during a typical training run, replacing it with a more efficient decentralized orchestration strategy using the cached data and a global intersection of a bitvector for the remaining training duration. Next, we introduce a feature for end users to explicitly group collective operations, enabling finer grained control over the communication buffer sizes. To evaluate our proposed strategies, we conduct experiments on a world-class supercomputer — Summit. We compare our proposals to Horovod’s original design and observe 2x performance improvement at a scale of 6000 GPUs; we also compare them against tf.distribute and torch.DDP and achieve 12% better and comparable performance, respectively, using up to 1536 GPUs; we compare our solution against BytePS in typical HPC settings and achieve about 20% better performance on a scale of 768 GPUs. Finally, we test our strategies on a scientific application (STEMDL) using up to 27,600 GPUs (the entire Summit) and show that we achieve a near-linear scaling of 0.93 with a sustained performance of 1.54 exaflops (with standard error +- 0.02) in FP16 precision.