Value function approximation is a crucial module for policy evaluation in reinforcement learning when the state space is large or continuous. The present paper revisits policy evaluation via temporal-difference (TD) learning from the Gaussian process (GP) perspective. Leveraging random features to approximate the GP prior, an online scalable (OS) approach, termed OS-GPTD, is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To benchmark the performance of OS-GPTD even in the adversarial setting, where the modeling assumptions are violated, complementary worst-case analyses are performed. The cumulative Bellman error, as well as the long-term reward prediction error, are upper bounded relative to their counterparts from a fixed value function estimator with the entire state-reward trajectory in hindsight. Performance of the novel OS-GPTD is evaluated on two benchmark problems.
|Original language||English (US)|
|Number of pages||5|
|Journal||ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings|
|State||Published - 2021|
|Event||2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Virtual, Toronto, Canada|
Duration: Jun 6 2021 → Jun 11 2021
Bibliographical noteFunding Information:
This work was supported in part by NSF grants 1711471, and 1901134.
- Gaussian process
- Temporal-difference learning
- Worst-case performance analysis