A 22nm 832Kb Hybrid-Domain Floating-Point SRAM In-Memory-Compute Macro with 16.2-70.2TFLOPS/W for High-Accuracy AI-Edge Devices

Abstract

Advanced artificial-intelligence (Al) edge devices require high energy-efficiency <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\eta_{\mathrm{E}})$</tex> and high inference-accuracy [2,4-6]. An SRAM-based compute-in-memory (CIM) based on MAC operations is well-suited for improving the <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\eta_{\mathrm{E}}$</tex> of Al edge devices. However, without support for floating-point (FP) computation, Al chips using integer-based SRAM-CIMs (INT-CIM) [2,4-5] are prone to precision loss when applied to complex datasets or neural network models. Product <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\text{PD}=\text{IN}\times \mathrm{W})$</tex> . alignment-based FP-MACs align the product's mantissa <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\text{PD}_{\mathrm{M}})$</tex> prior to accumulation, based on the product's exponent <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\text{PD}_{\mathrm{E}})$</tex> . This approach is commonly used for digital circuits [3] and for near-memory compute [1], but is not practical for in-memory-compute (IMC) macros: each <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\text{PD}_{\mathrm{E}}$</tex> within a physical row/column is different and thus cannot be accumulated. An INT-IMC with off-macro digital circuits and off-chip software pre-alignment was used in [6] to process the exponents of inputs <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\text{IN}_{\mathrm{E}})$</tex> and weights <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\mathrm{W}_{\mathrm{E}})$</tex> externally for the FP-MAC. An INT-CIM with extra FP-to-INT converters can emulate an FP-MAC, but incurs additional area, power consumption, and latency (PPA). Researchers have yet to develop a true FP-IMC macro capable of exponent and mantissa computation. Analog CIMs suffer from a low readout accuracy due to intrinsic transistor variation. Digital CIMs are insensitive to variation, but are limited in terms of compute parallelism due to routing congestion, as Fig. 7.1.1 shows. This paper presents a true FP-IMC macro featuring (1) a hybrid-domain macro structure that enables computation of both the exponent and mantissa in an FP-MAC within the same IMC macro. A high <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\eta_{\mathrm{E}}$</tex> and accuracy are achieved by exploiting advantages of computing in the time, digital, and analog-voltage domain by identifying the proper functional blocks for the FP-MAC [2,4-5]. (2) Time-domain based <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\text{PD}_{\mathrm{E}}$</tex> generation, a <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\text{maximum}-\text{PD}_{\mathrm{E}}(\text{PD}_{\mathrm{E}-\text{MAX}})$</tex> finder (TD-MPEF), and a <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\text{PD}_{\mathrm{E}}-\text{PD}_{\mathrm{E}-\text{MAX}}$</tex> generator <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\text{TD}-\text{PD}_{\mathrm{E}}-\text{DG})$</tex> to achieve a high <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\eta_{\mathrm{E}}$</tex> for all exponent computation. (3) <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\text{PD}_{\mathrm{E}}$</tex> -based input-mantissa alignment (PEB-IMA) scheme to enable accumulation for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\text{PD}_{\mathrm{M}}$</tex> in the same column. (4) A place-value dependent digital/analog-hybrid computing scheme for mantissa computation with a high inference accuracy and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\eta_{\mathrm{E}}$</tex> . A 22-nm 832-kb FP SRAM-IMC macro is fabricated using foundry-provided compact-6T SRAM cells. The FP SRAM-IMC support FP-MACs with 128-accumulators (ACCU) for BF16 inputs (IN) and weights (W) with FP32 outputs (OUT) and achieves the highest reported FP-MAC <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\eta_{\mathrm{E}}$</tex> , 70.2TFLOPS/W.

References

Page 1

	Year	Citations

Page 1