A Finite Sample Analysis of the Actor-Critic Algorithm

Zhuoran Yang, Kaiqing Zhang, Mingyi Hong, Tamer Basar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We study the finite-sample performance of batch actor-critic algorithm for reinforcement learning with nonlinear function approximations. Specifically, in the critic step, we estimate the action-value function corresponding to the policy of the actor within some parametrized function class, while in the actor step, the policy is updated using the policy gradient estimated based on the critic, so as to minimize the objective function defined as the expected value of discounted cumulative rewards. Under this setting, for the parameter sequence created by the actor steps, we show that the gradient norm of the objective function at any limit point is close to zero up to some fundamental error. In particular, we show that the error corresponds to the statistical rate of policy evaluation with nonlinear function approximations. For the special class of linear functions and when the number of samples goes to infinity, our result recovers the classical convergence results for the online actor-critic algorithm, which is based on the asymptotic behavior of two-time-scale stochastic approximation.

Original languageEnglish (US)
Title of host publication2018 IEEE Conference on Decision and Control, CDC 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2759-2764
Number of pages6
ISBN (Electronic)9781538613955
DOIs
StatePublished - Jan 18 2019
Event57th IEEE Conference on Decision and Control, CDC 2018 - Miami, United States
Duration: Dec 17 2018Dec 19 2018

Publication series

NameProceedings of the IEEE Conference on Decision and Control
Volume2018-December
ISSN (Print)0743-1546

Conference

Conference57th IEEE Conference on Decision and Control, CDC 2018
CountryUnited States
CityMiami
Period12/17/1812/19/18

Fingerprint

Nonlinear Approximation
Function Approximation
Nonlinear Function
Objective function
Gradient
Stochastic Approximation
Limit Point
Reinforcement Learning
Expected Value
Reward
Value Function
Linear Function
Convergence Results
Batch
Time Scales
Reinforcement learning
Asymptotic Behavior
Infinity
Actors
Minimise

Cite this

Yang, Z., Zhang, K., Hong, M., & Basar, T. (2019). A Finite Sample Analysis of the Actor-Critic Algorithm. In 2018 IEEE Conference on Decision and Control, CDC 2018 (pp. 2759-2764). [8619440] (Proceedings of the IEEE Conference on Decision and Control; Vol. 2018-December). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CDC.2018.8619440

A Finite Sample Analysis of the Actor-Critic Algorithm. / Yang, Zhuoran; Zhang, Kaiqing; Hong, Mingyi; Basar, Tamer.

2018 IEEE Conference on Decision and Control, CDC 2018. Institute of Electrical and Electronics Engineers Inc., 2019. p. 2759-2764 8619440 (Proceedings of the IEEE Conference on Decision and Control; Vol. 2018-December).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yang, Z, Zhang, K, Hong, M & Basar, T 2019, A Finite Sample Analysis of the Actor-Critic Algorithm. in 2018 IEEE Conference on Decision and Control, CDC 2018., 8619440, Proceedings of the IEEE Conference on Decision and Control, vol. 2018-December, Institute of Electrical and Electronics Engineers Inc., pp. 2759-2764, 57th IEEE Conference on Decision and Control, CDC 2018, Miami, United States, 12/17/18. https://doi.org/10.1109/CDC.2018.8619440
Yang Z, Zhang K, Hong M, Basar T. A Finite Sample Analysis of the Actor-Critic Algorithm. In 2018 IEEE Conference on Decision and Control, CDC 2018. Institute of Electrical and Electronics Engineers Inc. 2019. p. 2759-2764. 8619440. (Proceedings of the IEEE Conference on Decision and Control). https://doi.org/10.1109/CDC.2018.8619440
Yang, Zhuoran ; Zhang, Kaiqing ; Hong, Mingyi ; Basar, Tamer. / A Finite Sample Analysis of the Actor-Critic Algorithm. 2018 IEEE Conference on Decision and Control, CDC 2018. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 2759-2764 (Proceedings of the IEEE Conference on Decision and Control).
@inproceedings{f76551276021433190eb8c85cf2ce530,
title = "A Finite Sample Analysis of the Actor-Critic Algorithm",
abstract = "We study the finite-sample performance of batch actor-critic algorithm for reinforcement learning with nonlinear function approximations. Specifically, in the critic step, we estimate the action-value function corresponding to the policy of the actor within some parametrized function class, while in the actor step, the policy is updated using the policy gradient estimated based on the critic, so as to minimize the objective function defined as the expected value of discounted cumulative rewards. Under this setting, for the parameter sequence created by the actor steps, we show that the gradient norm of the objective function at any limit point is close to zero up to some fundamental error. In particular, we show that the error corresponds to the statistical rate of policy evaluation with nonlinear function approximations. For the special class of linear functions and when the number of samples goes to infinity, our result recovers the classical convergence results for the online actor-critic algorithm, which is based on the asymptotic behavior of two-time-scale stochastic approximation.",
author = "Zhuoran Yang and Kaiqing Zhang and Mingyi Hong and Tamer Basar",
year = "2019",
month = "1",
day = "18",
doi = "10.1109/CDC.2018.8619440",
language = "English (US)",
series = "Proceedings of the IEEE Conference on Decision and Control",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "2759--2764",
booktitle = "2018 IEEE Conference on Decision and Control, CDC 2018",

}

TY - GEN

T1 - A Finite Sample Analysis of the Actor-Critic Algorithm

AU - Yang, Zhuoran

AU - Zhang, Kaiqing

AU - Hong, Mingyi

AU - Basar, Tamer

PY - 2019/1/18

Y1 - 2019/1/18

N2 - We study the finite-sample performance of batch actor-critic algorithm for reinforcement learning with nonlinear function approximations. Specifically, in the critic step, we estimate the action-value function corresponding to the policy of the actor within some parametrized function class, while in the actor step, the policy is updated using the policy gradient estimated based on the critic, so as to minimize the objective function defined as the expected value of discounted cumulative rewards. Under this setting, for the parameter sequence created by the actor steps, we show that the gradient norm of the objective function at any limit point is close to zero up to some fundamental error. In particular, we show that the error corresponds to the statistical rate of policy evaluation with nonlinear function approximations. For the special class of linear functions and when the number of samples goes to infinity, our result recovers the classical convergence results for the online actor-critic algorithm, which is based on the asymptotic behavior of two-time-scale stochastic approximation.

AB - We study the finite-sample performance of batch actor-critic algorithm for reinforcement learning with nonlinear function approximations. Specifically, in the critic step, we estimate the action-value function corresponding to the policy of the actor within some parametrized function class, while in the actor step, the policy is updated using the policy gradient estimated based on the critic, so as to minimize the objective function defined as the expected value of discounted cumulative rewards. Under this setting, for the parameter sequence created by the actor steps, we show that the gradient norm of the objective function at any limit point is close to zero up to some fundamental error. In particular, we show that the error corresponds to the statistical rate of policy evaluation with nonlinear function approximations. For the special class of linear functions and when the number of samples goes to infinity, our result recovers the classical convergence results for the online actor-critic algorithm, which is based on the asymptotic behavior of two-time-scale stochastic approximation.

UR - http://www.scopus.com/inward/record.url?scp=85062191910&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062191910&partnerID=8YFLogxK

U2 - 10.1109/CDC.2018.8619440

DO - 10.1109/CDC.2018.8619440

M3 - Conference contribution

AN - SCOPUS:85062191910

T3 - Proceedings of the IEEE Conference on Decision and Control

SP - 2759

EP - 2764

BT - 2018 IEEE Conference on Decision and Control, CDC 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -