Joint Model and Data Adaptation for Cloud Inference Serving

Abstract

Real-time deep learning inference serving systems often require prohibitive resources and diverse user requirements. The existing design of inference serving systems mainly focusing on computation resource efficiency, largely ignoring the trade-off between computation and bandwidth resources in need. Sub-optimal resource utilization usually leads to huge serving cost waste. In this paper, we tackle the dual challenge of computation-bandwidth trade-off and cost-effectiveness by proposing A 2 , an efficient joint Adaptive model, and Adaptive data deep learning serving solution across the geo-datacenters. Inspired by the insight that a trade-off between computational cost and bandwidth cost in achieving the same accuracy, we design a real-time inference serving framework, which selectively places different "versions" of the deep learning models at different geo-locations, and schedules different data sample versions to be sent to those model versions for inference. The goal is to minimize the total serving cost while meeting latency and accuracy demand for the serving requests. We formulate a joint placement and serving problem and propose an efficient approximation algorithm to solve it with a theoretical performance guarantee. We deploy A 2 on Amazon EC2 for experiments, which shows that A 2 achieves 30%-50% serving cost reduction under the same required latency and accuracy as compared to baselines.

References

Page 1

	Year	Citations

Page 1