Exp3.P-Based Autonomous Decision Algorithm Against Nonstationary Opponents With Partially Known Policies

Research output: Contribution to journalArticlepeer-review

Abstract

This article takes into account multiagent games during which the opponents can change policies and their policy sets are partially known. Our goal is to generate an effective policy such that our agent can obtain a higher reward and meanwhile guarantee bounded regret. Considering such games against nonstationary opponents with partially known policies, Exp3.P-based autonomous decision (EAD) algorithm is put forward, which contains three steps. First, we learn the embedding of the opponent’s policy via conditional encoder–decoder and employ conditional RL to generate the targeted policy. Second, we estimate the opponent policy through online Bayesian belief updates. Finally, we select the adversarial and targeted policy via a multiarmed bandit algorithm. Theoretical analysis is performed for the EAD algorithm. We give the lower bound of the expected reward when using the targeted policy and prove that the EAD algorithm has a bounded regret. Experimental results on Kuhn poker and Grid-world Predator–Prey show the effectiveness of the proposed EAD algorithm.

Original languageEnglish (US)
Pages (from-to)975-988
Number of pages14
JournalIEEE Transactions on Games
Volume17
Issue number4
DOIs
StatePublished - 2025
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2018 IEEE.

Keywords

  • Exp3.P-based autonomous decision (EAD)
  • multiarmed bandits
  • nonstationary opponents with partially known policies
  • opponent modeling

Fingerprint

Dive into the research topics of 'Exp3.P-Based Autonomous Decision Algorithm Against Nonstationary Opponents With Partially Known Policies'. Together they form a unique fingerprint.

Cite this