WRA: A 2.2-to-6.3 TOPS Highly Unified Dynamically Reconfigurable Accelerator Using a Novel Winograd Decomposition Algorithm for Convolutional Neural Networks

Abstract

As convolutional neural networks (CNNs) become more and more diverse and complicated, acceleration of CNNs increasingly encounters a bottleneck of balancing performance, energy efficiency, and flexibility in a unified architecture. This paper proposed a Winograd-based highly efficient and dynamically Reconfigurable Accelerator (named WRA) for quickly evolving CNN models. A cost-effective convolution decomposition method (CDW) was proposed, and it extends the application of the fast Winograd algorithm. Based on CDW, a high-throughput and reconfigurable processing element (PE) array was designed to exploit the parallelism of Winograd. Besides, a highly compact memory structure employed four levels of data reuse schemes to achieve maximal data reuse and minimize external bandwidth requirement. Provided with dynamically reconfigurable capability, WRA implements CDW and other convolutions (e.g., standard convolution, depthwise separable convolution, and group convolution) on a unified hardware architecture. The WRA accelerator was implemented on a Xilinx XCVU9P platform running at 330 MHz clock frequency, controlled by a POWER8 processor via a coherent accelerator processor interface (CAPI) interface. At different configurations, WRA can provide 2.2-6.3 TOPS performance for different convolution shapes. The average performance and energy efficiency for VGG16/AlexNet/MobileNetV1/MobileNetV2 are 5288 GOP/s at 151.2 GOPs/W, 3478 GOP/s at 99.4 GOPs/W, 2674 GOP/s at 76.4 GOPs/W, and 2194 GOP/s at 62.7 GOPs/W. It achieves $1.7\times $ -$24\times $ speedup compared with the previous FPGA-based designs.

References

Page 1

	Year	Citations

Page 1