The Epoch-Greedy algorithm for contextual multi-armed bandits

TLDR

The complexity term S in a standard supervised learning sample‑complexity bound underlies the analysis. The authors introduce Epoch‑Greedy, an algorithm for contextual multi‑armed bandits with side information. Epoch‑Greedy alternates exploration and exploitation epochs, leveraging contextual side information to guide action selection. Epoch‑Greedy achieves horizon‑independent performance, with regret bounded by O(T^{2/3}S^{1/3}) or better, driven by the hypothesis‑class sample‑complexity bound.

Abstract

We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T2/3S1/3) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning.

References

Page 1

	Year	Citations

Page 1