Publication | Closed Access
A portable, automatic data qantizer for deep neural networks
25
Citations
51
References
2018
Year
Unknown Venue
Artificial IntelligenceConvolutional Neural NetworkDeep Neural NetworksEngineeringMachine LearningSparse Neural NetworkAutomatic Quantization FrameworkAutoencodersComputer EngineeringQuantization TechniquesDnn FrameworksNeural Architecture SearchComputer ScienceDeep LearningAutomatic Data QantizerModel Compression
With the proliferation of AI-based applications and services, there are strong demands for efficient processing of deep neural networks (DNNs). DNNs are known to be both compute-and memory-intensive as they require a tremendous amount of computation and large memory space. Quantization is a popular technique to boost efficiency of DNNs by representing a number with fewer bits, hence reducing both computational strength and memory footprint. However, it is a difficult task to find an optimal number representation for a DNN due to a combinatorial explosion in feasible number representations with varying bit widths, which is only exacerbated by layer-wise optimization. Besides, existing quantization techniques often target a specific DNN framework and/or hardware platform, lacking portability across various execution environments. To address this, we propose libnumber, a portable, automatic quantization framework for DNNs. By introducing Number abstract data type (ADT), libnumber encapsulates the internal representation of a number from the user. Then the auto-tuner of libnumber finds a compact representation (type, bit width, and bias) for the number that minimizes the user-supplied objective function, while satisfying the accuracy constraint. Thus, libnumber effectively separates the concern of developing an effective DNN model from low-level optimization of number representation. Our evaluation using eleven DNN models on two DNN frameworks targeting an FPGA platform demonstrates over 8× (7×) reduction in the parameter size on average when up to 7% (1%) loss of relative accuracy is tolerable, with a maximum reduction of 16×, compared to the baseline using 32-bit floating-point numbers. This leads to an geomean speedup of 3.79× with a maximum speedup of 12.77× over the baseline, while requiring only minimal programmer effort.
| Year | Citations | |
|---|---|---|
Page 1
Page 1