This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the “Adam-type”, includes the popular algorithms such as Adam (Kingma & Ba, 2014), AMSGrad (Reddi et al., 2018), AdaGrad (Duchi et al., 2011). Despite their popularity in training deep neural networks (DNNs), the convergence of these algorithms for solving non-convex problems remains an open question. In this paper, we develop an analysis framework and a set of mild sufficient conditions that guarantee the convergence of the Adam-type methods, with a convergence rate of order O(log T/T) for non-convex stochastic optimization. Our convergence analysis applies to a new algorithm called AdaFom (AdaGrad with First Order Momentum). We show that the conditions are essential, by identifying concrete examples in which violating the conditions makes an algorithm diverge. Besides providing one of the first comprehensive analysis for Adam-type methods in the non-convex setting, our results can also help the practitioners to easily monitor the progress of algorithms and determine their convergence behavior.
|Original language||English (US)|
|State||Published - Jan 1 2019|
|Event||7th International Conference on Learning Representations, ICLR 2019 - New Orleans, United States|
Duration: May 6 2019 → May 9 2019
|Conference||7th International Conference on Learning Representations, ICLR 2019|
|Period||5/6/19 → 5/9/19|