15.4 A 5.99-to-691.1TOPS/W Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity-Based Optimization and Variable-Precision Quantization - Concepedia

Concepedia

Abstract

Computing-in-memory (CIM) improves energy efficiency by enabling parallel multiply-and-accumulate (MAC) operations and reducing memory accesses [1-4]. However, today's typical neural networks (NNs) usually exceed on-chip memory capacity. Thus, a CIM-based processor may encounter a memory bottleneck [5]. Tensor-train (TT) is a tensor decomposition method, which decomposes a d-dimensional tensor to d 4D tensor-cores (TCs: G <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> [r <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k-1</sub> , n <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> , m <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> , r <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> ], k = 1, ... , d) [6]. G <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> can be viewed as a 2D n <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k×</sub> m <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> array, where each element is an r <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k-1</sub> ×r <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> matrix. The TCs require Σ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k∈[1,d]</sub> r <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k-1</sub> n <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> m <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> r <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> parameters to represent the original tensor, which has Π <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k∈[1,d]</sub> n <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> m <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</sub> parameters. Since rk is typically small, kernels and weight matrices of convolutional, fully-connected and recurrent layers can be compressed significantly by using TT decomposition, thereby enabling storage of an entire NN in a CIM-based processor.

References

	Year	Citations

Page 1