Apprenticeship learning via inverse reinforcement learning

TLDR

Learning in a Markov decision process without an explicit reward function is addressed by observing expert demonstrations, a setting relevant to tasks like driving where specifying rewards is difficult. The authors aim to recover a reward function expressed as a linear combination of known features and learn the demonstrated task. They propose an inverse reinforcement learning algorithm that infers the unknown reward function. The algorithm converges in few iterations and produces a policy whose performance, measured against the expert’s unknown reward, is close to that of the expert.

Abstract

We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be difficult to write down an explicit reward function specifying exactly how different desiderata should be traded off. We think of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and give an algorithm for learning the task demonstrated by the expert. Our algorithm is based on using "inverse reinforcement learning" to try to recover the unknown reward function. We show that our algorithm terminates in a small number of iterations, and that even though we may never recover the expert's reward function, the policy output by the algorithm will attain performance close to that of the expert, where here performance is measured with respect to the expert's unknown reward function.

References

Page 1

	Year	Citations

Page 1