Just-In-Time Data Virtualization: Lightweight Data Management with ViDa

TLDR

Traditional database architectures struggle to handle the growing size and heterogeneity of data, making data integration and ad‑hoc query processing bottlenecks that current approaches, which rely on copying and static operators, cannot adequately address. There is a need to move beyond static query processing primitives and develop dynamic, fully adaptive architectures. ViDa achieves this by virtualizing data so it can be processed in its raw format, generating a just‑in‑time query engine with adaptive caches and operators that tailor themselves to each query and workload, and providing a language that supports heterogeneous data models and can be translated from existing languages. The system gives users the flexibility to select the most appropriate language for their analysis.

Abstract

As the size of data and its heterogeneity increase, traditional database system architecture becomes an obstacle to data analysis. Integrating and ingesting (loading) data into databases is quickly becoming a bottleneck in face of massive data as well as increasingly heterogeneous data formats. Still, state-of-the-art approaches typically rely on copying and transforming data into one (or few) repositories. Queries, on the other hand, are often ad-hoc and supported by pre-cooked operators which are not adaptive enough to optimize access to data. As data formats and queries increasingly vary, there is a need to depart from the current status quo of static query processing primitives and build dynamic, fully adaptive architectures. We build ViDa, a system which reads data in its raw format and processes queries using adaptive, just-in-time operators. Our key insight is use of virtualization, i.e., abstracting data and manipulating it regardless of its original format, and dynamic generation of operators. ViDa’s query engine is generated just-in-time; its caches and its query operators adapt to the current query and the workload, while also treating raw datasets as its native storage structures. Finally, ViDa features a language expressive enough to support heterogeneous data models, and to which existing languages can be translated. Users therefore have the power to choose the language best suited for an analysis.

References

Page 1

	Year	Citations

Page 1