16.2 A 28nm 53.8TOPS/W 8b Sparse Transformer Accelerator with In-Memory Butterfly Zero Skipper for Unstructured-Pruned NN and CIM-Based Local-Attention-Reusable Engine

Abstract

Transformer networks, from BERT, GPT to Alphafold, have demonstrated unprecedented advances in a variety of AI tasks. Fig. 16.2.1 shows the computing flow of self-attention - the fundamental operation in transformers. Queries <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(Q)$</tex> , keys <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(K)$</tex> and values (V) are first obtained by multiplying inputs with 3 weight matrices. Afterward, scores that evaluate <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$Q-K$</tex> relevance are computed as scaled dot products and converted to probabilities through the softmax function. The probabilities are then multiplied by <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$V$</tex> generating the final self-attention results. Transformer networks have led to an explosion in parameter counts, for example, 175B parameters for GPT-3. This demands significant growth in computing hardware and memory. Owing to expanding network sizes and corresponding power consumption, compute-in-memory (CIM) block-wise sparsity-aware architectures were proposed for matrix multiplication [1] and local attention [2] accelerators, where weight storage and compute are skipped for zero-value blocks. Yet, such structured sparsity is at the cost of notable accuracy loss [3]. Consequently, a challenge for CIM-based accelerators is in how to handle unstructured pruned NNs, while maintaining high efficiency. These unstructured patterns can be represented as: 1) irregularly distributed zero weights inside matrices, and 2) varied local attention <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">s</sup> pans for different attention heads.

References

Page 1

	Year	Citations

Page 1