FPGA-Based Training Accelerator Utilizing Sparseness of Convolutional Neural Network

Abstract

Training of convolutional neural networks (CNNs) is almost exclusively performed on large clusters of GPUs. However, it consumes vast amounts of power. Thus, high-speed training systems superior in low-power consumption are desired. This paper proposes an FPGA-based training accelerator utilizing a sparseness of a CNN, which consists of universal convolutional units and pooling units with distributed stacks. The proposed universal convolution architecture supports various convolution operations, such as the point-wise, depth-wise, large kernel and atrous convolutions used in the modern CNN. Additionally, we utilize a fine-tuning scheme, which loads a pre-trained dense CNN to reduce the memory size for the training process, while it considers important connectivity to preserve recognition accuracy. Our training scheme reduces 85% parameters to accelerate the training computation and reduce on-chip size. Thus, it eliminates energy-consuming DRAM accesses. We implemented the proposed training accelerator on a Xilinx Virtex UltraScale+ VC1525 acceleration development board. Experimental results show that the proposed sparse CNN training accelerator on the FPGA can achieve four times faster, 2.9 times lower power consumption, and 11.6 times better performance per power, compared to the existing NVIDIA RTX2080Ti GPU.

References

Page 1

	Year	Citations

Page 1