Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data

TLDR

Approximate dynamic programming is important for control of dynamical systems but usually requires full state information, which is often unavailable; this work focuses on linear deterministic systems and their stochastic counterparts, partially observable Markov decision processes, where output‑feedback methods are relevant. The paper demonstrates how to implement ADP using only measured input/output data. Policy‑iteration and value‑iteration algorithms are developed that converge to an optimal controller based solely on output‑feedback. The resulting methods do not require knowledge of system dynamics, only the system order and an upper bound on its observability index, and produce a polynomial ARMA controller with performance equivalent to optimal state‑feedback.

Abstract

Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. In this paper, we show how to implement ADP methods using only measured input/output data from the system. Linear dynamical systems with deterministic behavior are considered herein, which are systems of great interest in the control system community. In control system theory, these types of methods are referred to as output feedback (OPFB). The stochastic equivalent of the systems dealt with in this paper is a class of partially observable Markov decision processes. We develop both policy iteration and value iteration algorithms that converge to an optimal controller that requires only OPFB. It is shown that, similar to Q -learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control. Only the order of the system, as well as an upper bound on its "observability index," must be known. The learned OPFB controller is in the form of a polynomial autoregressive moving-average controller that has equivalent performance with the optimal state variable feedback gain.

References

Page 1

	Year	Citations

Page 1