Relative Entropy Inverse Reinforcement Learning

TLDR

Imitation learning with limited demonstrations is challenging; IRL generalizes by assuming optimal expert behavior, but existing methods require computing near‑optimal policies for many reward functions, which is infeasible in large or continuous state spaces. The authors propose a model‑free IRL algorithm that minimizes the relative entropy between the empirical state‑action distribution of a baseline policy and that of the learned policy using stochastic gradient descent. The method is evaluated against established IRL algorithms on learned MDP models, demonstrating its effectiveness. Experiments on simulated car racing, gridworld, and ball‑in‑a‑cup tasks show that the approach learns effective policies from a small number of demonstrations.

Abstract

We consider the problem of imitation learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an ecient tool for generalizing the demonstration, based on the assumption that the expert is optimally acting in a Markov Decision Process (MDP). Most of the past work on IRL requires that a (near)optimal policy can be computed for dierent reward functions. However, this requirement can hardly be satised in systems with a large, or continuous, state space. In this paper, we propose a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a baseline policy and their distribution under the learned policy is minimized by stochastic gradient descent. We compare this new approach to well-known IRL algorithms using learned MDP models. Empirical results on simulated car racing, gridworld and ball-in-a-cup problems show that our approach is able to learn good policies from a small number of demonstrations.

References

Page 1

	Year	Citations

Page 1