Publication | Closed Access
Joint Model and Data Adaptation for Cloud Inference Serving
17
Citations
21
References
2021
Year
Unknown Venue
Real-time InferenceMachine LearningEngineeringDeep Learning ModelsCloud Resource ManagementData ScienceCloud ContinuumManagementComputing SystemsCloud InferenceData IntegrationServing RequestsEmbedded Machine LearningPerformance ImprovementData ManagementData ModelingNetwork FlowsComputer ScienceCloud Service AdaptationDeep LearningNeural Architecture SearchModel CompressionDeep Reinforcement LearningCloud ComputingStatistical InferenceResource Optimization
Real-time deep learning inference serving systems often require prohibitive resources and diverse user requirements. The existing design of inference serving systems mainly focusing on computation resource efficiency, largely ignoring the trade-off between computation and bandwidth resources in need. Sub-optimal resource utilization usually leads to huge serving cost waste. In this paper, we tackle the dual challenge of computation-bandwidth trade-off and cost-effectiveness by proposing A <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> , an efficient joint Adaptive model, and Adaptive data deep learning serving solution across the geo-datacenters. Inspired by the insight that a trade-off between computational cost and bandwidth cost in achieving the same accuracy, we design a real-time inference serving framework, which selectively places different "versions" of the deep learning models at different geo-locations, and schedules different data sample versions to be sent to those model versions for inference. The goal is to minimize the total serving cost while meeting latency and accuracy demand for the serving requests. We formulate a joint placement and serving problem and propose an efficient approximation algorithm to solve it with a theoretical performance guarantee. We deploy A <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> on Amazon EC2 for experiments, which shows that A <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> achieves 30%-50% serving cost reduction under the same required latency and accuracy as compared to baselines.
| Year | Citations | |
|---|---|---|
Page 1
Page 1