Scalable and Programmable Neural Network Inference Accelerator Based on In-Memory Computing

TLDR

In‑memory computing accelerates matrix‑vector multiplications, the dominant operation in neural networks, by reducing memory accesses. The study presents a programmable in‑memory computing accelerator that scales neural‑network inference using high‑signal‑to‑noise capacitor‑based analog technology. The accelerator comprises a configurable on‑chip network and scalable core array that combine mixed‑signal in‑memory computing with programmable near‑memory SIMD digital logic, enabling data‑ and pipeline‑parallel execution mappings across models. A 4×4 core prototype fabricated in 16 nm CMOS achieved 3 TOPS peak MAC throughput and 30 TOPS/W energy efficiency for 8‑bit operations, with analog accuracy matching bit‑true simulations, and demonstrated 91.5 % CIFAR‑10 and 73.3 % ImageNet accuracy at 7,815 and 581 images/s, respectively, with 51.5 k and 3.0 k images/s/W using 4‑bit weights and activations.

Abstract

This work demonstrates a programmable in-memory-computing (IMC) inference accelerator for scalable execution of neural network (NN) models, leveraging a high-signal-to-noise ratio (SNR) capacitor-based analog technology. IMC accelerates computations and reduces memory accessing for matrix-vector multiplies (MVMs), which dominate in NNs. The accelerator architecture focuses on scalable execution, addressing the overheads of state swapping and the challenges of maintaining high utilization across highly dense and parallel hardware. The architecture is based on a configurable on-chip network (OCN) and scalable array of cores, which integrate mixed-signal IMC with programmable near-memory single-instruction multiple-data (SIMD) digital computing, configurable buffering, and programmable control. The cores enable flexible NN execution mappings that exploit data- and pipeline-parallelism to address utilization and efficiency across models. A prototype is presented, incorporating a <inline-formula> <tex-math notation="LaTeX">$4 \times 4$ </tex-math></inline-formula> array of cores demonstrated in 16 nm CMOS, achieving peak multiply-accumulate (MAC)-level throughput of 3 TOPS and peak MAC-level energy efficiency of 30 TOPS/W, both for 8-b operations. The measured results shows high accuracy of the analog computations, matching bit-true simulations. This enables the abstractions required for robust and scalable architectural and software integration. Developed software libraries and NN-mapping tools are used to demonstrate CIFAR-10 and ImageNet classification, with an 11-layer CNN and ResNet-50, respectively, achieving accuracy, throughput, and energy efficiency of 91.51% and 73.33%, 7815 and 581 image/s, 51.5 k and 3.0 k image/s/W, with 4-b weights and activations.

References

Page 1

	Year	Citations

Page 1