Inverse Markov decision processes with unknown transition probabilities

Zahra Ghatrani, Archis Ghate

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Inverse optimization involves recovering parameters of a mathematical model using observed values of decision variables. In Markov Decision Processes (MDPs), it has been applied to estimate rewards that render observed policies optimal. A counterpart is not available for transition probabilities. We study two variants of this problem. First, the decision-maker wonders whether there exist a policy and transition probabilities that attain given target values of expected total discounted rewards over an infinite horizon. We derive necessary and sufficient existence conditions, and formulate a feasibility linear program whose solution yields the requisite policy and transition probabilities. We extend these results when the decision-maker wants to render the target values optimal. In the second variant, the decision-maker wishes to find transition probabilities that make a given policy optimal. The resulting problem is nonconvex bilinear, and we propose tailored versions of two heuristics called Convex-Concave Procedure and Sequential Linear Programming (SLP). Their performance is compared via numerical experiments against an exact method. Computational experiments on randomly generated MDPs reveal that SLP outperforms the other two both in runtime and objective values. Further insights into SLP’s performance are derived via numerical experiments on inverse inventory control, equipment replacement, and multi-armed bandit problems.

Original languageEnglish (US)
Pages (from-to)588-601
Number of pages14
JournalIISE Transactions
Volume55
Issue number6
DOIs
StatePublished - 2023
Externally publishedYes

Bibliographical note

Publisher Copyright:
© Copyright © 2022 “IISE”.

Keywords

  • bilinear programs
  • Linear programs
  • quadratic programs

Fingerprint

Dive into the research topics of 'Inverse Markov decision processes with unknown transition probabilities'. Together they form a unique fingerprint.

Cite this