Rethinking Attention with Performers

Abstract

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

References

Page 1

	Year	Citations
UniProt: a worldwide hub of protein knowledge Nucleic Acids Research Molecular BiologyUniprot WebsiteBioinformatics DatabaseProtein FoldingProteomics	2018	8.1K
Dual Attention Network for Scene Segmentation Jun Fu, Jing Liu, Haijie Tian, Scene AnalysisImage AnalysisMachine LearningMachine VisionDual Attention Network	2019	6.7K
Hierarchical Attention Networks for Document Classification Zichao Yang, Diyi Yang, Chris Dyer, EngineeringPart-of-speech TaggingLanguage ProcessingText MiningNatural Language Processing	2016	4.7K
Convergence Results for Neural Networks via Electrodynamics arXiv (Cornell University) Numerical AnalysisGeometric LearningGradient DescentEngineeringMachine Learning	2018	2.9K
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences Alexander Rives, Joshua Meier, Tom Sercu, Proceedings of the National Academy of Sciences	2021	2.8K
Random Features for Large-Scale Kernel Machines Ali Rahimi, Benjamin Recht	2007	2.7K
Longformer: The Long-Document Transformer Iz Beltagy, Matthew E. Peters, Arman Cohan arXiv (Cornell University) Llm Fine-tuningEngineeringMachine LearningLong-document TransformerMultilingual Pretraining	2020	2.2K
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books Yukun Zhu, Ryan Kiros, Rich Zemel, EngineeringMachine LearningNarrative SummarizationVideo SummarizationMovie/book Alignment	2015	2K
Parallel Prefix Computation Richard E. Ladner, Michael J. Fischer Journal of the ACM	1980	1.3K
Conv-Linformer: Boosting Linformer's Performance with Convolution in Small-Scale Settings Sinong Wang, Belinda Z. Li, Madian Khabsa, arXiv (Cornell University)	2020	878

Page 1