Deep learning (DL) is a popular technique for building models from large quantities of data such as pictures, videos, messages generated from edges devices at rapid pace all over the world. It is often infeasible to migrate large quantities of data from the edges to centralized data center(s) over WANs for training due to privacy, cost, and performance reasons. At the same time, training large DL models on edge devices is infeasible due to their limited resources. An attractive alternative for DL training distributed data is to use micro-clouds - -small-scale clouds deployed near edge devices in multiple locations. However, micro-clouds present the challenges of both computation and network resource heterogeneity as well as dynamism. In this paper, we introduce DLion, a new and generic decentralized distributed DL system designed to address the key challenges in micro-cloud environments, in order to reduce overall training time and improve model accuracy. We present three key techniques in DLion: (1) Weighted dynamic batching to maximize data parallelism for dealing with heterogeneous and dynamic compute capacity, (2) Per-link prioritized gradient exchange to reduce communication overhead for model updates based on available network capacity, and (3) Direct knowledge transfer to improve model accuracy by merging the best performing model parameters. We build a prototype of DLion on top of TensorFlow and show that DLion achieves up to 4.2X speedup in an Amazon GPU cluster, and up to 2X speed up and 26% higher model accuracy in a CPU cluster over four state-of-the-art distributed DL systems.
|Original language||English (US)|
|Title of host publication||HPDC 2021 - Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing|
|Publisher||Association for Computing Machinery, Inc|
|Number of pages||12|
|State||Published - Jun 21 2021|
|Event||30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2021 - Virtual, Online, Sweden|
Duration: Jun 21 2021 → Jun 25 2021
|Name||HPDC 2021 - Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing|
|Conference||30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2021|
|Period||6/21/21 → 6/25/21|
Bibliographical noteFunding Information:
This work is supported in part by NSF grant CNS-1717834.
© 2020 ACM.
- deep learning
- edge computing
- resource allocation