Concepedia

TLDR

Choosing the optimal cloud configuration for recurring big data analytics jobs is difficult because of the vast number of VM instance types and cluster sizes, and poor choices can increase performance degradation and cost by up to 12×. The study aims to automatically identify the best cloud configuration for diverse analytics applications while keeping search cost low. CherryPick employs Bayesian Optimization to construct lightweight performance models that can differentiate the best or near‑best cloud configuration from the rest after only a few test runs. On five AWS EC2 analytics workloads, CherryPick achieved a 45–90 % probability of locating the optimal or near‑optimal configuration, reducing search cost by up to 75 % versus prior methods.

Abstract

Picking the right cloud configuration for recurring big data analytics jobs running in clouds is hard, because there can be tens of possible VM instance types and even more cluster sizes to pick from. Choosing poorly can significantly degrade performance and increase the cost to run a job by 2-3x on average, and as much as 12x in the worst-case. However, it is challenging to automatically identify the best configuration for a broad spectrum of applications and cloud configurations with low search cost. CherryPick is a system that leverages Bayesian Optimization to build performance models for various applications, and the models are just accurate enough to distinguish the best or close-to-the-best configuration from the rest with only a few test runs. Our experiments on five analytic applications in AWS EC2 show that CherryPick has a 45-90% chance to find optimal configurations, otherwise near-optimal, saving up to 75% search cost compared to existing solutions.

References

YearCitations

Page 1