Abstract
A multi-armed bandit (MAB) problem is described as follows. At each time-step, a decision-maker selects one arm from a finite set. A reward is earned from this arm and the state of that arm evolves stochastically. The goal is to determine an arm-pulling policy that maximizes expected total discounted reward over an infinite horizon. We study MAB problems where the rewards are multivariate Gaussian, to account for data-driven estimation errors. We employ a percentile optimization approach, wherein the goal is to find an arm-pulling policy that maximizes the sum of percentiles of expected total discounted rewards earned from individual arms. The idea is motivated by recent work on percentile optimization in Markov decision processes. We demonstrate that, when applied to MABs, this yields an intractable second-order cone program (SOCP) whose size is exponential in the number of arms. We use Lagrangian relaxation to break the resulting curse-of-dimensionality. Specifically, we show that the relaxed problem can be reformulated as an SOCP with size linear in the number of arms. We propose three approaches to recover feasible arm-pulling decisions during run-time from an off-line optimal solution of this SOCP. Our numerical experiments suggest that one of these three method appears to be more effective than the other two.
Original language | English (US) |
---|---|
Pages (from-to) | 837-862 |
Number of pages | 26 |
Journal | Annals of Operations Research |
Volume | 340 |
Issue number | 2-3 |
DOIs | |
State | Published - Sep 2024 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
Keywords
- Chance-constrained programming
- Dynamic programming
- Lagrangian relaxation