Publication | Closed Access
Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs
206
Citations
21
References
2017
Year
Unknown Venue
Convolutional Neural NetworkEngineeringMachine LearningHardware AlgorithmComputer ArchitectureConventional Convolution AlgorithmImage AnalysisData SciencePattern RecognitionVideo TransformerHeterogeneous AlgorithmsMachine VisionObject DetectionComputer EngineeringHeterogeneous SystemsComputer ScienceDeep LearningNeural Architecture SearchFpga DesignModel CompressionComputer VisionHardware AccelerationFpga BitstreamDomain-specific Accelerator
CNNs are widely used in computer vision, and many algorithms exist for their computation, with Winograd’s minimal filtering offering lower compute but higher memory bandwidth demands. The study investigates efficient FPGA acceleration of CNNs by comparing conventional convolution with Winograd’s minimal filtering and proposing a fusion architecture that selects optimal algorithms per layer. They implement a layer‑fusion architecture, formulate an optimal per‑layer algorithm selection strategy, and build an automated Caffe‑to‑FPGA toolchain with Vivado HLS. Experiments on VGG and AlexNet show up to a 1.99× performance speedup versus the previous fusion‑based FPGA accelerator.
Convolutional neural network (CNN) finds applications in a variety of computer vision applications ranging from object recognition and detection to scene understanding owing to its exceptional accuracy. There exist different algorithms for CNNs computation. In this paper, we explore conventional convolution algorithm with a faster algorithm using Winograd's minimal filtering theory for efficient FPGA implementation. Distinct from the conventional convolution algorithm, Winograd algorithm uses less computing resources but puts more pressure on the memory bandwidth. We first propose a fusion architecture that can fuse multiple layers naturally in CNNs, reusing the intermediate data. Based on this fusion architecture, we explore heterogeneous algorithms to maximize the throughput of a CNN. We design an optimal algorithm to determine the fusion and algorithm strategy for each layer. We also develop an automated toolchain to ease the mapping from Caffe model to FPGA bitstream using Vivado HLS. Experiments using widely used VGG and AlexNet demonstrate that our design achieves up to 1.99X performance speedup compared to the prior fusion-based FPGA accelerator for CNNs.
| Year | Citations | |
|---|---|---|
Page 1
Page 1