Gaussian process temporal-difference learning with Scalability and worst-case performance guarantees

Research output: Contribution to journalConference articlepeer-review

Abstract

Value function approximation is a crucial module for policy evaluation in reinforcement learning when the state space is large or continuous. The present paper revisits policy evaluation via temporal-difference (TD) learning from the Gaussian process (GP) perspective. Leveraging random features to approximate the GP prior, an online scalable (OS) approach, termed OS-GPTD, is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To benchmark the performance of OS-GPTD even in the adversarial setting, where the modeling assumptions are violated, complementary worst-case analyses are performed. The cumulative Bellman error, as well as the long-term reward prediction error, are upper bounded relative to their counterparts from a fixed value function estimator with the entire state-reward trajectory in hindsight. Performance of the novel OS-GPTD is evaluated on two benchmark problems.

Original languageEnglish (US)
Pages (from-to)3485-3489
Number of pages5
JournalICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2021-June
DOIs
StatePublished - 2021
Event2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Virtual, Toronto, Canada
Duration: Jun 6 2021Jun 11 2021

Bibliographical note

Funding Information:
This work was supported in part by NSF grants 1711471, and 1901134.

Publisher Copyright:
©2021 IEEE.

Keywords

  • Gaussian process
  • Temporal-difference learning
  • Worst-case performance analysis

Fingerprint

Dive into the research topics of 'Gaussian process temporal-difference learning with Scalability and worst-case performance guarantees'. Together they form a unique fingerprint.

Cite this