Publication | Open Access
An analysis of temporal-difference learning with function approximation
1.2K
Citations
19
References
1997
Year
Temporal-difference Learning AlgorithmEngineeringMachine LearningData ScienceStochastic OptimizationSequential LearningOnline UpdatingSequential Decision MakingProbability TheoryComputer ScienceRobot LearningFunction ApproximationLearning ControlApproximation TheoryMarkov ChainMarkov Decision Process
The paper investigates temporal‑difference learning for approximating the cost‑to‑go function in infinite‑horizon discounted Markov chains. It analyzes an online linear TD algorithm that updates a function approximator along a single endless trajectory of an irreducible aperiodic Markov chain, using a novel analytical framework to elucidate TD dynamics. The authors prove almost‑sure convergence, characterize the limiting value, bound the approximation error, and show that divergence can arise when updates are not trajectory‑based or when nonlinear approximators are used, thereby reconciling conflicting results in the literature.
We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator online during a single endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability one), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. In addition to proving new and stronger positive results than those previously available, we identify the significance of online updating and potential hazards associated with the use of nonlinear function approximators. First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-difference learning. Second, we present an example illustrating the possibility of divergence when temporal difference learning is used in the presence of a nonlinear function approximator.
| Year | Citations | |
|---|---|---|
Page 1
Page 1