Robust estimation of resource consumption for SQL queries using statistical techniques

TLDR

Accurate estimation of SQL query resource consumption is essential for admission control, scheduling, and query optimization, and recent work has explored statistical techniques as a promising alternative to manually constructed cost models that can capture hardware characteristics and cardinality bias. This study proposes to combine database query processing knowledge with statistical models to overcome the lack of robustness in existing approaches. The authors model resource usage at the operator level, assigning distinct models and features to each operator type while explicitly capturing asymptotic behavior, and validate the approach on large‑scale real‑life and benchmark workloads on Microsoft SQL Server. The resulting method achieves significantly better estimation accuracy and can predict resource usage for arbitrary query plans, even when they differ markedly from the training data.

Abstract

The ability to estimate resource consumption of SQL queries is crucial for a number of tasks in a database system such as admission control, query scheduling and costing during query optimization. Recent work has explored the use of statistical techniques for resource estimation in place of the manually constructed cost models used in query optimization. Such techniques, which require as training data examples of resource usage in queries, offer the promise of superior estimation accuracy since they can account for factors such as hardware characteristics of the system or bias in cardinality estimates. However, the proposed approaches lack robustness in that they do not generalize well to queries that are different from the training examples, resulting in significant estimation errors. Our approach aims to address this problem by combining knowledge of database query processing with statistical models. We model resource-usage at the level of individual operators, with different models and features for each operator type, and explicitly model the asymptotic behavior of each operator. This results in significantly better estimation accuracy and the ability to estimate resource usage of arbitrary plans, even when they are very different from the training instances. We validate our approach using various large scale real-life and benchmark workloads on Microsoft SQL Server.

References

Page 1

	Year	Citations

Page 1