Publication | Closed Access
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
328
Citations
6
References
2004
Year
EngineeringGame TheoryComputational ComplexitySample ComplexityStochastic GameCombinatorial OptimizationDecision TheoryStatisticsOnline AlgorithmLower BoundProbability TheoryComputer ScienceGamesExploration V ExploitationContextual BanditMulti-armed Bandit ProblemStochastic OptimizationStatistical InferenceLower Bounds
The PAC formulation of the multi‑armed bandit problem has a known sample complexity of O((n/e²)log(1/δ)) trials to find an e‑optimal arm, and the case of known arm statistics but unknown identities is also considered. We aim to establish matching lower bounds on the expected number of trials for any sampling policy and to derive lower bounds on expected regret in the PAC setting. We generalize the lower bound to depend explicitly on unknown arm statistics, provide a Bayesian analogue, and analyze the maximum‑sample‑path complexity, yielding Θ((n/e²)log(1/δ)) bounds. For the case of known statistics, we prove a Θ((1/e²)(n+log(1/δ))) lower bound on expected trials with a matching policy, establish matching upper and lower bounds on the maximum number of trials, and derive regret lower bounds analogous to Lai and Robbins.
We consider the multi-armed bandit problem under the PAC (probably approximately correct) model. It was shown by Even-Dar et al. (2002) that given n arms, a total of O((n/e2)log(1/δ)) trials suffices in order to find an e-optimal arm with probability at least 1-δ. We establish a matching lower bound on the expected number of trials under any sampling policy. We furthermore generalize the lower bound, and show an explicit dependence on the (unknown) statistics of the arms. We also provide a similar bound within a Bayesian setting. The case where the statistics of the arms are known but the identities of the arms are not, is also discussed. For this case, we provide a lower bound of Θ((1/e2)(n+log(1/δ))) on the expected number of trials, as well as a sampling policy with a matching upper bound. If instead of the expected number of trials, we consider the maximum (over all sample paths) number of trials, we establish a matching upper and lower bound of the form Θ((n/e2)log(1/δ)). Finally, we derive lower bounds on the expected regret, in the spirit of Lai and Robbins.
| Year | Citations | |
|---|---|---|
Page 1
Page 1