Recurrent Models of Visual Attention

TLDR

Convolutional neural networks are computationally expensive because processing scales linearly with pixel count, yet they possess translation invariance; the proposed model retains this invariance while allowing computation to be independently controlled from image size. The study introduces a recurrent neural network that selectively attends to image or video regions, processing only chosen high‑resolution patches. The non‑differentiable recurrent architecture is trained via reinforcement learning to learn task‑specific policies for selecting high‑resolution image or video regions. On image classification, the model outperforms a CNN baseline on cluttered images, and on a dynamic visual control task it learns to track an object without explicit training signals.

Abstract

Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

References

Page 1

	Year	Citations

Page 1