A High-speed Low-power Deep Neural Network on an FPGA based on the Nested RNS: Applied to an Object Detector

Abstract

A pre-trained convolutional deep neural network (CNN) is the feed-forward computation perspective, and it is widely used for the embedded vision systems. One of the applications of the CNN is a frame object detection problem. It is widely used in the embedded systems, such as a robot, an automobile, a security camera, and a drone, that require a highly performance-power efficient device. In the CNN, the 2D convolutional operation occupies more than 90time. Since the 2D convolutional operation performs massive multiply-accumulation (MAC) operations, conventional realizations could not implement a fully parallel CNN. The RNS decomposes an integer into a tuple of integers by residues of moduli set. Since no pair of modulus has a common factor with any other, the conventional RNS decomposes the MAC unit into circuits with different sizes means that the RNS could not utilize resources of an FPGA with uniform size. In this paper, we use the nested RNS (NRNS), which recursively decompose the RNS. It can decompose the MAC unit into circuits with small sizes. In the CNN using the NRNS, a MAC unit is decomposed into 4-bit ones realized by look-up tables of the FPGA. Thus, it leads to a high clock frequency with less hardware. We designed the Tiny YOLOv2 for the practical object detection, and it using the CNN based on the NRNS is implemented on a Digilent NetFPGA-SUME FPGA board. Compared with the NVidia GTX1080Ti (Pascal architecture) for the designed Tiny YOLOv2, the FPGA using the NRNS was 3.19 times better than the GPU as for the performance-power efficiency.

References

Page 1

	Year	Citations

Page 1