Abstract
Motivated by applications in personalized web services and clinical research, we consider a multi-armed bandit problem in a setting where the mean reward of each arm is associated with some covariates. A multi-stage randomized allocation with arm elimination algorithm is proposed to combine the flexibility in reward function modeling and a theoretical guarantee of a cumulative regret minimax rate. When the function smoothness parameter is unknown, the algorithm is equipped with a histogram estimation based smoothness parameter selector using Lepski’s method, and is shown to maintain the regret minimax rate up to a logarithmic factor under a “self-similarity” condition.
Original language | English (US) |
---|---|
Pages (from-to) | 242-270 |
Number of pages | 29 |
Journal | Electronic Journal of Statistics |
Volume | 10 |
Issue number | 1 |
DOIs | |
State | Published - 2016 |
Bibliographical note
Publisher Copyright:© 2016, Institute of Mathematical Statistics. All right reserved.
Keywords
- Adaptive estimation
- Contextual bandit problem
- MABC
- Nonparametric bandit
- Regret bound