TY - JOUR
T1 - Hybrid-order distributed SGD
T2 - Balancing communication overhead, computational complexity, and convergence rate for distributed learning
AU - Omidvar, Naeimeh
AU - Hosseini, Seyed Mohammad
AU - Maddah-Ali, Mohammad Ali
N1 - Publisher Copyright:
© 2024 Elsevier B.V.
PY - 2024/9/28
Y1 - 2024/9/28
N2 - Communication overhead, computation load, and convergence speed are three major challenges in the scalability of distributed stochastic optimization algorithms in training large neural networks. In this paper, we propose the approach of hybrid-order distributed stochastic gradient descent (HO-SGD) which strikes a better balance between these three than the previous methods, for a general class of non-convex stochastic optimization problems. In particular, we advocate that by properly interleaving zeroth-order and first-order gradient updates, it is possible to significantly reduce the communication and computation overheads while guaranteeing a fast convergence. The proposed method guarantees the same order of convergence rate as in the fastest distributed methods (i.e., fully synchronous SGD) while having significantly less computational complexity and communication overhead per iteration, and the same order of communication overhead as in the state-of-the-art communication-efficient methods, with order-wisely less computational complexity. Moreover, it order-wisely improves the convergence rate of zeroth-order SGD methods. Finally and remarkably, empirical studies demonstrate that the proposed hybrid-order approach provides significantly higher test accuracies and superior generalization than all the baselines, owing to its novel exploration mechanism.
AB - Communication overhead, computation load, and convergence speed are three major challenges in the scalability of distributed stochastic optimization algorithms in training large neural networks. In this paper, we propose the approach of hybrid-order distributed stochastic gradient descent (HO-SGD) which strikes a better balance between these three than the previous methods, for a general class of non-convex stochastic optimization problems. In particular, we advocate that by properly interleaving zeroth-order and first-order gradient updates, it is possible to significantly reduce the communication and computation overheads while guaranteeing a fast convergence. The proposed method guarantees the same order of convergence rate as in the fastest distributed methods (i.e., fully synchronous SGD) while having significantly less computational complexity and communication overhead per iteration, and the same order of communication overhead as in the state-of-the-art communication-efficient methods, with order-wisely less computational complexity. Moreover, it order-wisely improves the convergence rate of zeroth-order SGD methods. Finally and remarkably, empirical studies demonstrate that the proposed hybrid-order approach provides significantly higher test accuracies and superior generalization than all the baselines, owing to its novel exploration mechanism.
KW - Communication overhead
KW - Computational complexity
KW - Convergence rate
KW - Distributed learning
KW - Distributed optimization
KW - Generalization
KW - Non-convex
KW - Stochastic optimization
UR - http://www.scopus.com/inward/record.url?scp=85197420076&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85197420076&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2024.128020
DO - 10.1016/j.neucom.2024.128020
M3 - Article
AN - SCOPUS:85197420076
SN - 0925-2312
VL - 599
JO - Neurocomputing
JF - Neurocomputing
M1 - 128020
ER -