Publication | Closed Access
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
370
Citations
16
References
2006
Year
Artificial IntelligenceEngineeringMachine LearningGame TheoryConfidence IntervalEducationReinforcement Learning (Educational Psychology)Multi-agent LearningLifelong Reinforcement LearningStopping ConditionsReinforcement Learning (Computer Engineering)Stochastic GameAction EliminationDecision TheoryStatistical Confidence IntervalsReinforcement Learning AlgorithmsSequential Decision MakingComputer ScienceProbability TheoryExploration V ExploitationContextual BanditDeep Reinforcement LearningReinforcement Learning Problems
We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O((n/e2)log(1/δ)) times to find an e-optimal arm with probability of at least 1-δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over e-greedy Q-learning.
| Year | Citations | |
|---|---|---|
1989 | 5.5K | |
1998 | 3K | |
1985 | 2.4K | |
2002 | 2.2K | |
2002 | 849 | |
2002 | 590 | |
1979 | 586 | |
2000 | 523 | |
2004 | 328 | |
1998 | 198 |
Page 1
Page 1