Skip to main navigation Skip to search Skip to main content

Deep Contextual Bandits with Multivariate Outcomes: Empirical Copula Normalization, Temporal Feature Learning, and Doubly Robust Policy Evaluation

Research output: Contribution to journalArticlepeer-review

Abstract

We develop and evaluate a deep contextual bandit framework for multivariate off-policy evaluation within a controlled simulation-based validation setting. Using real covariate distributions from the Adult, Boston Housing, and Wine Quality datasets, we construct synthetic treatment assignments and multivariate potential outcomes to enable rigorous benchmarking under known data-generating processes. We compare CNN-LSTM, LSTM, and Feed-forward Neural Network (FNN) architectures as nonlinear action-value estimators. To examine representation learning under structured dependence, an (Formula presented.) feature augmentation scheme is employed, while multivariate outcomes are standardized using empirical copula transformations to preserve cross-dimensional dependence. Policy values are estimated using Stabilized Importance Sampling (SIPS) and doubly robust (DR) estimators with bootstrap inference. Although the decision problem is strictly one-step, empirical results indicate that CNN-LSTM architectures provide competitive action-value calibration under temporal augmentation. Across all datasets, the DR estimator demonstrates substantially lower variance and greater stability than SIPS, consistent with its theoretical variance-reduction properties. Diagnostic analyses—including propensity overlap assessment, cumulative oracle regret (with oracle values known by construction), calibration evaluation, and sensitivity analysis—support the reliability of the proposed evaluation framework. Overall, the results demonstrate that combining copula-normalized multivariate outcomes with doubly robust off-policy evaluation yields a statistically principled and variance-efficient approach for offline policy learning in high-dimensional simulated environments.

Original languageEnglish (US)
Article number846
JournalMathematics
Volume14
Issue number5
DOIs
StatePublished - Mar 2026

Bibliographical note

Publisher Copyright:
© 2026 by the author.

Keywords

  • CNN-LSTM
  • contextual bandits
  • doubly robust estimator
  • off-policy evaluation (OPE)

Fingerprint

Dive into the research topics of 'Deep Contextual Bandits with Multivariate Outcomes: Empirical Copula Normalization, Temporal Feature Learning, and Doubly Robust Policy Evaluation'. Together they form a unique fingerprint.

Cite this