Information-directed policy sampling for episodic Bayesian Markov decision processes

Research output: Contribution to journalArticlepeer-review

Abstract

We consider finite-stage Markov Decision Processes (MDPs) under incomplete information, where the decision-maker only knows that the true transition probability and reward matrices belong to given, finite sets. The decision-maker interacts with the system over a finite number of episodes. The first episode begins with a probabilistic belief about the true probability and reward matrices. This belief is updated at the end of each episode using observed events. The goal is to maximize the expected total reward earned over all episodes. In the resulting model-based episodic Bayesian MDP, it suffices to only consider (the known) policies that are optimal to each one of the possible probability and reward matrices. Nevertheless, the decision-maker should execute policies that provide information about the true probabilities and rewards (exploration), but also exploit this knowledge to increase rewards. We propose a framework called Information-Directed Policy Sampling (IDPS). In each episode, the decision-maker balances the exploitation-exploration trade-off by executing a randomized policy that minimizes a so-called convex information ratio. We derive a regret bound that is independent of state- and action-space cardinalities when the set of matrices is exogenously determined. Numerical experiments show IDPS outperforming a state-of-the-art approach called Posterior Sampling.

Original languageEnglish (US)
Pages (from-to)905-919
Number of pages15
JournalIISE Transactions
Volume57
Issue number8
DOIs
StatePublished - 2025
Externally publishedYes

Bibliographical note

Publisher Copyright:
© Copyright © 2024 “IISE”.

Keywords

  • Bayesian learning
  • Model-based reinforcement learning
  • regret analysis

Fingerprint

Dive into the research topics of 'Information-directed policy sampling for episodic Bayesian Markov decision processes'. Together they form a unique fingerprint.

Cite this