Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Abstract

We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O((n/e2)log(1/δ)) times to find an e-optimal arm with probability of at least 1-δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over e-greedy Q-learning.

References

Page 1

	Year	Citations
Learning from delayed rewards Chris Watkins OpenGrey (Institut de l'Information Scientifique et Technique) Artificial IntelligenceEngineeringMachine LearningStochastic GameGame Theory	1989	5.5K
Reinforcement Learning Andrew G. Barto IFAC Proceedings Volumes	1998	3K
Asymptotically efficient adaptive allocation rules Tze-Leung Lai, Herbert Robbins Advances in Applied Mathematics Mathematical ProgrammingEngineeringDynamic Resource AllocationComputational ComplexityComputer Science	1985	2.4K
The Nonstochastic Multiarmed Bandit Problem Peter Auer, Nicolò Cesa‐Bianchi, Yoav Freund, SIAM Journal on Computing Mathematical ProgrammingBandit ProblemEngineeringMultiarmed Bandit ProblemStochastic Game	2002	2.2K
Near-Optimal Reinforcement Learning in Polynomial Time Michael Kearns, Satinder Singh Machine Learning	2002	849
Approximately Optimal Approximate Reinforcement Learning Sham M. Kakade, John Langford	2002	590
Fast probabilistic algorithms for hamiltonian circuits and matchings Dana Angluin, Leslie G. Valiant Journal of Computer and System Sciences Mathematical ProgrammingCircuit ComplexityHamiltonian CircuitsEngineeringQuantum Computing	1979	586
The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning Vivek S. Borkar, Sean Meyn SIAM Journal on Control and Optimization EngineeringStochastic OptimizationUncertainty QuantificationStochastic GameAsymptotic Stability	2000	523
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem Shie Mannor, John N. Tsitsiklis Journal of Machine Learning Research EngineeringGame TheoryComputational ComplexitySample ComplexityStochastic Game	2004	328
Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms Michael Kearns, Satinder Singh	1998	198

Page 1