Publication | Open Access
Understanding and Overcoming the Challenges of Efficient Transformer Quantization
73
Citations
68
References
2021
Year
Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token.
| Year | Citations | |
|---|---|---|
Page 1
Page 1