CHIMERA: A 0.92-TOPS, 2.2-TOPS/W Edge AI Accelerator With 2-MByte On-Chip Foundry Resistive RAM for Efficient Training and Inference

TLDR

Edge AI inference and training are hindered by memory constraints, a problem that worsens as deep neural networks grow larger. The authors introduce CHIMERA, the first non‑volatile DNN chip that performs both edge AI training and inference using on‑chip resistive RAM macros without off‑chip memory, fabricated in 40‑nm CMOS. CHIMERA’s accelerator, optimized for RRAM, delivers 0.92 TOPS peak performance and 2.2 TOPS/W, scales inference up to six‑times larger networks by linking six chips with only 4 % time and 5 % energy overhead, and employs a low‑rank incremental training algorithm that reduces weight‑update steps by 283× and improves energy‑delay product by 340×, while an ENDURER remapping module ensures ten years of 20‑sample‑per‑minute training. These features enable efficient, low‑energy edge AI training and inference, achieving state‑of‑the‑art performance and endurance while maintaining accuracy comparable to conventional algorithms.

Abstract

Implementing edge artificial intelligence (AI) inference and training is challenging with current memory technologies. As deep neural networks (DNNs) grow in size, this problem is only getting worse. This article presents CHIMERA, the first non-volatile DNN chip for both edge AI training and inference using foundry on-chip resistive RAM (RRAM) macros and no off-chip memory, fabricated in 40-nm CMOS. CHIMERA’s DNN accelerator is specifically optimized for RRAM and achieves 0.92-TOPS peak performance and 2.2-TOPS/W energy efficiency. We scale inference up to <inline-formula> <tex-math notation="LaTeX">$6\times $ </tex-math></inline-formula> larger DNNs by connecting six CHIMERAs in an illusion system with just 4% overhead in measured execution time and 5% in energy, enabled by communication-sparse DNN mappings that exploit RRAM non-volatility through quick chip wake-up and shutdown (<inline-formula> <tex-math notation="LaTeX">$ < 33 ~\mu \text{s}$ </tex-math></inline-formula>). Our incremental edge AI training algorithm, called low-rank training, overcomes RRAM write energy, speed, and endurance challenges and achieves the same accuracy as traditional algorithms with up to <inline-formula> <tex-math notation="LaTeX">$283\times $ </tex-math></inline-formula> fewer RRAM weight update steps and <inline-formula> <tex-math notation="LaTeX">$340\times $ </tex-math></inline-formula> better energy-delay product. Combined with ENDUrance REsiliency using random Remapping (ENDURER), a hardware module that provides resilience to write endurance failures, we enable ten years of 20-samples/min incremental edge AI training.

References

Page 1

	Year	Citations

Page 1