BSTC - Concepedia

Abstract

Binarized neural networks (or BNNs) promise tremendous performance improvement over traditional DNNs through simplified bit-level computation and significantly reduced memory access/storage cost. In addition, it has advantages of low-cost, low-energy, and high-robustness, showing great potential in resources-constrained, volatile, and latency-critical applications, which are critical for future HPC, cloud, and edge applications. However, the promised significant performance gain of BNN inference has never been fully demonstrated on general-purpose processors, particularly on GPUs, due to: (i) the challenge of extracting and leveraging sufficient finegrained bit-level-parallelism to saturate GPU cores when the batch size is small; (ii) the fundamental design conflict between bit-based BNN algorithm and word-based architecture; and (iii) architecture & performance unfriendly to BNN network design. To address (i) and (ii), we propose a binarized-soft-tensor-core as a software-hardware codesign approach to construct bit-manipulation capability for modern GPUs and thereby effectively harvest bit-level-parallelism (BLP). To tackle (iii), we propose intra- and inter-layer fusion techniques so that the entire BNN inference execution can be packed into a single GPU kernel, and so avoid the high-cost of frequent launching and releasing. Experiments show that our Singular-Binarized-Neural-Network (SBNN) design can achieve over 1000X speedup for raw inference latency over the state-of-the-art full-precision BNN inference for AlexNet on GPUs. Comparisons with CPU, GPU, FPGA and Xeon-Phi demonstrate the effectiveness of our design. SBNN is opensourced and available at https://github.com/uuudown/SBNN.

References

Page 1

	Year	Citations

Page 1