Computation offloading for fast CNN inference in edge computing

Abstract

Convolutional Neural Network (CNN) is an important computation model for many popular mobile artificial intelligence applications. However, CNN inference, i.e., processing input data based on well-trained CNN models, is computation-intensive and incurs a heavy overhead for mobile devices with limited hardware resources. In this paper, we propose to offload a portion of CNN inference computation of mobile devices to the edge computing site. We find that batching tasks on GPU can significantly reduce average inference time on GPUs. Based on this important observation, we design an algorithm that jointly considers the tasks on all mobile devices and the corresponding batching benefit on the edge site, different from existing work on the collaborative inference that let each mobile device independently make offloading decisions. Furthermore, an online algorithm is proposed to handle the scenario that CNN inference tasks arrive at different time. It can significantly reduce average inference time without the knowledge of future task arrivals. Finally, extensive simulations are conducted to evaluate the performance of our proposed algorithms and the results show they outperform existing work under different settings.

References

Page 1

	Year	Citations

Page 1