A Communication-Aware DNN Accelerator on ImageNet Using In-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65-nm CMOS

Abstract

This article presents a communication-aware processing-in-memory deep neural network accelerator, which implements an in-memory entry-counting scheme for low bit-width quantized multiplication-and-accumulations (MACs). To maintain good accuracy on ImageNet, the proposed design adopts a full-stack co-design methodology, from algorithms, circuits to architectures. In the algorithm level, an entry-counting based MAC is proposed to fit the learned step-sized quantization scheme, and exploit the sparsity of both activations and weights intrinsically. In the circuit level, content addressable memory cells and multiplexed arrays are developed in the processing-in-memory macro. In the architecture level, the proposed design is compatible with different stationary dataflow mappings, further reducing the memory access. An in-memory entry-counting silicon prototype and its entire peripheral circuits are fabricated in 65nm LP CMOS technology with an active area of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$0.76\times 0.66$ </tex-math></inline-formula> mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> . The 7.36-Kb processing-in-memory macro with 128 search entries can reduce the multiplication number by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$12.8\times $ </tex-math></inline-formula> . The peak throughput is 3.58 GOPS, achieved at a clock rate of 143MHz and a power supply of 1.23V. The peak energy efficiency of the processing-in-memory macro is 11.6 TOPS/W, achieved at a clock rate of 40MHz and a power supply of 1.01V. Note that the physical design of the entry-counting memory is completed in a standard digital placement and routing flow by augmenting the library with two dedicated memory cells. A 3-bit quantized ResNet-18 on the ImageNet dataset is performed, where the top-1 accuracy is 64.4%.

References

Page 1

	Year	Citations

Page 1