Input generalization in delayed reinforcement learning: an algorithm and performance comparisons

TLDR

Delayed reinforcement learning offers a promising unsupervised framework for autonomous agents, yet its application to realistic problems is hindered by several challenges, and prior input generalization has relied on connectionist backpropagation. The study addresses the input generalization problem in delayed reinforcement learning by proposing the G algorithm. The G algorithm recursively partitions the state space using statistical reinforcement differences, and its performance is compared analytically and empirically against back‑propagation. The G algorithm’s statistical foundation allows reliable prediction of its applicability, and in a nonlinear domain it consistently found the optimal policy while back‑propagation failed, whereas back‑propagation succeeded only in a linear domain.

Abstract

Delayed reinforcement learning is an attractive framework for the unsupervised learning of action policies for autonomous agents. Some existing delayed reinforcement learning techniques have shown promise in simple domains. However, a number of hurdles must be passed before they are applicable to realistic problems. This paper describes one such difficulty, the input generalization problem (whereby the system must generalize to produce similar actions in similar situations) and an implemented solution, the G algorithm. This algorithm is based on recursive splitting of the state space based on statistical measures of differences in reinforcements received. Connectionist backpropagation has previously been used for input generalization in reinforcement learning. We compare the two techniques analytically and empirically. The G algorithm's sound statistical basis makes it easy to predict when it should and should not work, whereas the behavior of back-propagation is unpredictable. We found that a previous successful use of backpropagation can be explained by the linearity of the application domain. We found that in another domain, G reliably found the optimal policy, whereas none of a set of runs of backpropagation with many combinations of parameters did.

References

Page 1

	Year	Citations

Page 1