Specializing FGPU for Persistent Deep Learning

Abstract

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance when compared to fully customized FPGA designs. When used in concert with a hand-tuned FPGA solution, a performant overlay architecture can improve the time-to-solution and thus overall productivity of FPGA solutions. In this work, we tune and specialize FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our PDL-FGPU architecture is able to maintain the ease-of-programming and generality of a software programmable soft GPU while achieving high performance due to specialization in the persistent deep learning domain. We also propose a easy method to specialize for different domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA running a set of persistent DL applications (RNN, GRU, LSTM), as well as general non-DL applications to demonstrate generality. PDL-FGPU requires 1.5-3x more ALMs, 4.4-6.4x more M20ks, and 4.6-10x more DSPs than the FGPU baseline, but improves performance by 55-727x for persistent DL applications with an average 15% degradation on general non-PDL applications. We also demonstrate that the PDL-FGPU is only 4-7x slower than the Nvidia Volta V100 GPU.

References

Page 1

	Year	Citations

Page 1