Least-Squares Temporal Difference Learning

TLDR

TD(λ) is a widely used incremental policy evaluation method that suffers from inefficient data use and the need for manual step‑size tuning, whereas the Least‑Squares TD (LSTD) algorithm removes step‑size parameters and improves data efficiency for linear value functions. This work extends LSTD by providing a simpler derivation, generalizing it to arbitrary λ (including λ = 1 as supervised regression), and offering a novel model‑based reinforcement learning interpretation. The authors achieve this by deriving a streamlined LSTD algorithm, extending it to any λ, and re‑framing it as a model‑based RL technique.

Abstract

TD( ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD( ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes ine cient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (Bradtke and Barto, 1996) eliminates all stepsize parameters and improves data e ciency. This paper extends Bradtke and Barto's work in three signi cant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting algorithm is shown to be a practical formulation of supervised linear regression. Third, it presents a novel, intuitive interpretation of LSTD as a model-based reinforcement learning technique.

References

Page 1

	Year	Citations

Page 1