Concepedia

Publication | Open Access

Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees

101

Citations

60

References

2018

Year

TLDR

Model‑based reinforcement learning promises to reduce the sample complexity that limits model‑free methods, yet its theoretical foundations remain sparse. This work proposes a new algorithmic framework for constructing and analyzing model‑based RL algorithms that come with provable guarantees. The framework introduces a meta‑algorithm that iteratively builds a lower bound on expected reward from an estimated dynamics model and sample trajectories, jointly optimizes policy and model to ensure monotone improvement, and extends optimism‑in‑face‑of‑uncertainty to nonlinear dynamics without explicit uncertainty quantification, yielding the SLBO variant. Empirical results show that SLBO attains state‑of‑the‑art performance on continuous‑control benchmarks using at most one million samples.

Abstract

Model-based reinforcement learning (RL) is considered to be a promising approach to reduce the sample complexity that hinders model-free RL. However, the theoretical understanding of such methods has been rather limited. This paper introduces a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees. We design a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward. The meta-algorithm iteratively builds a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and then maximizes the lower bound jointly over the policy and the model. The framework extends the optimism-in-face-of-uncertainty principle to non-linear dynamical models in a way that requires \textit{no explicit} uncertainty quantification. Instantiating our framework with simplification gives a variant of model-based RL algorithms Stochastic Lower Bounds Optimization (SLBO). Experiments demonstrate that SLBO achieves state-of-the-art performance when only one million or fewer samples are permitted on a range of continuous control benchmark tasks.

References

YearCitations

Page 1