A theoretical and empirical analysis of Expected Sarsa

TLDR

Expected Sarsa reduces update variance by exploiting stochasticity in the behavior policy, allowing higher learning rates and, in deterministic settings, zero‑variance updates that enable a learning rate of one. The paper conducts a theoretical and empirical analysis of Expected Sarsa, a variant of the classic on‑policy TD method Sarsa for model‑free reinforcement learning. The authors analyze Expected Sarsa theoretically and empirically, comparing it to Sarsa and Q‑learning across multiple domains. They prove Expected Sarsa converges under the same conditions as Sarsa, hypothesize when it outperforms Sarsa and Q‑learning, and confirm these hypotheses experimentally, showing significant advantages over the other methods.

Abstract

This paper presents a theoretical and empirical analysis of Expected Sarsa, a variation on Sarsa, the classic on-policy temporal-difference method for model-free reinforcement learning. Expected Sarsa exploits knowledge about stochasticity in the behavior policy to perform updates with lower variance. Doing so allows for higher learning rates and thus faster learning. In deterministic environments, Expected Sarsas updates have zero variance, enabling a learning rate of 1. We prove that Expected Sarsa converges under the same conditions as Sarsa and formulate specific hypotheses about when Expected Sarsa will outperform Sarsa and Q-learning. Experiments in multiple domains confirm these hypotheses and demonstrate that Expected Sarsa has significant advantages over these more commonly used methods.

References

Page 1

	Year	Citations

Page 1