Publication | Closed Access
Photon: A Fast Query Engine for Lakehouse Systems
30
Citations
26
References
2022
Year
Cluster ComputingEngineeringComputer ArchitectureApache Spark ApiSemantic WebApache ParquetInformation RetrievalData ScienceDatabase SupportData-intensive PlatformManagementData IntegrationBig DataParallel ComputingData ManagementHigh-performance Data AnalyticsComputer ScienceDistributed Query ProcessingData-intensive ComputingQuery OptimizationData LakesCloud ComputingParallel ProgrammingFast Query EngineApproximate Query AnsweringMassive Data ProcessingData Modeling
Many organizations are shifting to a data management paradigm called the "Lakehouse," which implements the functionality of structured data warehouses on top of unstructured data lakes. This presents new challenges for query execution engines. The engine needs to provide good performance on the raw uncurated datasets that are ubiquitous in data lakes, and excellent performance on structured data stored in popular columnar file formats like Apache Parquet. Toward these goals, we present Photon, a vectorized query engine for Lakehouse environments that we developed at Databricks. Photon can outperform existing warehouses on SQL workloads and also supports the Apache Spark API. We discuss the design choices we made in Photon (e.g., vectorization vs. code generation) and describe its integration with our existing SQL and Apache Spark runtimes, its task model, and its memory manager. Photon has accelerated some customer workloads by over 10x and has recently allowed Databricks to set a new audited performance record for the official 100TB TPC-DS benchmark.
| Year | Citations | |
|---|---|---|
Page 1
Page 1