Abstract
We introduce a family of s-rectangular robust Markov decision processes (s-RMDPs) indexed with \rho ∊ [1, ∞]. In each state, the ambiguity set of transition probability mass functions (pmfs) across actions equals a sublevel set of the \ell\rho-norm of a vector of distances from reference pmfs. Setting \rho = ∞ recovers (s, a)-RMDPs. For any s-RMDP from this family, there is an (s, a)RMDP whose robust optimal value is at least as good, and vice versa. This occurs because s- and (s, a)-RMDPs can employ different ambiguity set radii, casting doubt on the belief that (s, a)RMDPs are more conservative than s-RMDPs. If the distance is lower semicontinuous and convex, then, for any s-RMDP, there exists an (s, a)-RMDP with an identical robust optimal value. We also study data-driven s-RMDPs, where the reference pmf is constructed from state transition samples. If the distance satisfies a Pinsker-type inequality, the robust optimal and out-of-sample values both converge with sample-size to the true optimal. We derive rates of convergence and sample complexity when the distance satisfies a concentration inequality. Under this concentration inequality, the robust optimal value provides a probabilistic lower bound on the out-of-sample value. An artifact of the analyses behind these guarantees is the surprising conclusion that (s, a)-RMDPs might be the least conservative among all s-RMDPs within our family. The asymptotic and finite sample properties also extend for a class of nonrectangular RMDPs.
Original language | English (US) |
---|---|
Pages (from-to) | 1540-1568 |
Number of pages | 29 |
Journal | SIAM Journal on Optimization |
Volume | 34 |
Issue number | 2 |
DOIs | |
State | Published - Jun 2024 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2024 Society for Industrial and Applied Mathematics.
Keywords
- distributionally robust optimization
- dynamic programming
- probabilistic performance guarantees
- sample complexity
- value convergence