Online Markov Decision Processes Under Bandit Feedback

Abstract

We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in hindsight in terms of the total reward received. Specifically, in each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is an algorithm with an expected regret of O(T <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2/3</sup> lnT). In this paper, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O(T <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1/2</sup> lnT), giving the first rigorously proven, essentially tight regret bound for the problem.

References

Page 1

	Year	Citations
Reinforcement Learning: An Introduction Richard S. Sutton, Andy Barto IEEE Transactions on Neural Networks Artificial IntelligenceEngineeringDeep Reinforcement LearningComputer ScienceRobot Learning	1998	26.8K
Reinforcement Learning: An Introduction IEEE Transactions on Neural Networks Artificial IntelligenceEngineeringDeep Reinforcement LearningStochastic GameGame Theory	2005	25.7K
Markov Decision Processes: Discrete Stochastic Dynamic Programming. Kasra Hazeghi, Martin L. Puterman Journal of the American Statistical Association Markov Decision ProcessEngineeringStochastic GameUncertainty QuantificationStochastic Processes	1995	8.4K
The Nonstochastic Multiarmed Bandit Problem Peter Auer, Nicolò Cesa‐Bianchi, Yoav Freund, SIAM Journal on Computing Mathematical ProgrammingBandit ProblemEngineeringMultiarmed Bandit ProblemStochastic Game	2002	2.2K
Advances in Neural Information Processing Systems 21 Thomas L. Griffiths, Christopher G. Lucas, Joseph Jay Williams, Neural Information Processing Systems Computational NeuroscienceNeural RecodingComputer ScienceBrain-like ComputingNeurocomputers	2009	1.9K
Near-optimal Regret Bounds for Reinforcement Learning Thomas Jaksch, Ronald Ortner, Peter Auer	2010	707
On the Generalization Ability of On-Line Learning Algorithms Nicolò Cesa‐Bianchi, Alex Conconi, Claudio Gentile IEEE Transactions on Information Theory Artificial IntelligenceLarge DeviationsEngineeringMachine LearningAlgorithmic Learning	2004	442
Stochastic Learning and Optimization - A Sensitivity-Based Approach Xi‐Ren Cao IFAC Proceedings Volumes Data-driven OptimizationEngineeringMachine LearningStochastic OptimizationUncertainty Quantification	2008	250
On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization Sham M. Kakade, Karthik Sridharan, Ambuj Tewari ScholarlyCommons (University of Pennsylvania)	2008	246
Near-optimal Regret Bounds for Reinforcement Learning JakschThomas, OrtnerRonald, AuerPeter Journal of Machine Learning Research Artificial IntelligenceEngineeringMachine LearningStochastic GameGame Theory	2010	208

Page 1