Abstract
We develop and evaluate a deep contextual bandit framework for multivariate off-policy evaluation within a controlled simulation-based validation setting. Using real covariate distributions from the Adult, Boston Housing, and Wine Quality datasets, we construct synthetic treatment assignments and multivariate potential outcomes to enable rigorous benchmarking under known data-generating processes. We compare CNN-LSTM, LSTM, and Feed-forward Neural Network (FNN) architectures as nonlinear action-value estimators. To examine representation learning under structured dependence, an (Formula presented.) feature augmentation scheme is employed, while multivariate outcomes are standardized using empirical copula transformations to preserve cross-dimensional dependence. Policy values are estimated using Stabilized Importance Sampling (SIPS) and doubly robust (DR) estimators with bootstrap inference. Although the decision problem is strictly one-step, empirical results indicate that CNN-LSTM architectures provide competitive action-value calibration under temporal augmentation. Across all datasets, the DR estimator demonstrates substantially lower variance and greater stability than SIPS, consistent with its theoretical variance-reduction properties. Diagnostic analyses—including propensity overlap assessment, cumulative oracle regret (with oracle values known by construction), calibration evaluation, and sensitivity analysis—support the reliability of the proposed evaluation framework. Overall, the results demonstrate that combining copula-normalized multivariate outcomes with doubly robust off-policy evaluation yields a statistically principled and variance-efficient approach for offline policy learning in high-dimensional simulated environments.
| Original language | English (US) |
|---|---|
| Article number | 846 |
| Journal | Mathematics |
| Volume | 14 |
| Issue number | 5 |
| DOIs | |
| State | Published - Mar 2026 |
Bibliographical note
Publisher Copyright:© 2026 by the author.
Keywords
- CNN-LSTM
- contextual bandits
- doubly robust estimator
- off-policy evaluation (OPE)
Fingerprint
Dive into the research topics of 'Deep Contextual Bandits with Multivariate Outcomes: Empirical Copula Normalization, Temporal Feature Learning, and Doubly Robust Policy Evaluation'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS