Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient\n Image Recognition

Abstract

Vision Transformers (ViT) have achieved remarkable success in large-scale\nimage recognition. They split every 2D image into a fixed number of patches,\neach of which is treated as a token. Generally, representing an image with more\ntokens would lead to higher prediction accuracy, while it also results in\ndrastically increased computational cost. To achieve a decent trade-off between\naccuracy and speed, the number of tokens is empirically set to 16x16 or 14x14.\nIn this paper, we argue that every image has its own characteristics, and\nideally the token number should be conditioned on each individual input. In\nfact, we have observed that there exist a considerable number of "easy" images\nwhich can be accurately predicted with a mere number of 4x4 tokens, while only\na small fraction of "hard" ones need a finer representation. Inspired by this\nphenomenon, we propose a Dynamic Transformer to automatically configure a\nproper number of tokens for each input image. This is achieved by cascading\nmultiple Transformers with increasing numbers of tokens, which are sequentially\nactivated in an adaptive fashion at test time, i.e., the inference is\nterminated once a sufficiently confident prediction is produced. We further\ndesign efficient feature reuse and relationship reuse mechanisms across\ndifferent components of the Dynamic Transformer to reduce redundant\ncomputations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100\ndemonstrate that our method significantly outperforms the competitive baselines\nin terms of both theoretical computational efficiency and practical inference\nspeed. Code and pre-trained models (based on PyTorch and MindSpore) are\navailable at https://github.com/blackfeather-wang/Dynamic-Vision-Transformer\nand https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore.\n