Safe and Efficient Off-Policy Reinforcement Learning

TLDR

The authors aim to develop a novel off‑policy, return‑based reinforcement learning algorithm, Retrace(λ), that offers low variance, safety across arbitrary behaviour policies, and efficient use of near on‑policy samples. They formulate existing algorithms in a unified framework, analyze the contractive properties of the associated operator for both evaluation and control, derive online sample‑based algorithms, and demonstrate Retrace(λ)’s advantages on Atari 2600 games. Retrace(λ) is the first return‑based off‑policy control algorithm proven to converge almost surely to Q* without the GLIE assumption, and the authors also resolve the long‑standing convergence question for Watkins’ Q(λ).

Abstract

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace($λ$), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q($λ$), which was an open problem since 1989. We illustrate the benefits of Retrace($λ$) on a standard suite of Atari 2600 games.

References

Page 1

	Year	Citations

Page 1