Publication | Open Access
Rethinking Attention with Performers
122
Citations
38
References
2020
Year
Structured PredictionEngineeringMachine LearningFull-rank-attention TransformersMusic PsychologyMusicologyNatural Language ProcessingData SciencePerformance TheoryMulti-task LearningVideo TransformerMachine TranslationLarge Ai ModelRegular TransformersVision Language ModelComputer ScienceDeep LearningPerformance StudiesArtsSoftmax Attention-kernelsAudience Reception
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
| Year | Citations | |
|---|---|---|
2018 | 8.1K | |
2019 | 6.7K | |
2016 | 4.7K | |
2018 | 2.9K | |
2021 | 2.8K | |
2007 | 2.7K | |
2020 | 2.2K | |
2015 | 2K | |
1980 | 1.3K | |
2020 | 878 |
Page 1
Page 1