Publication | Open Access
A Cost Model for SPARK SQL
26
Citations
18
References
2018
Year
Cluster ComputingSpark SqlEngineeringComputer ArchitectureCost ModelMap-reduceBusiness AnalyticsOperations ResearchData ScienceManagementBig DataParallel ComputingData ManagementQuantitative ManagementParallel DatabaseHigh-performance Data AnalyticsVery Large DatabaseComputer EngineeringNovel Cost ModelComputer ScienceDistributed Query ProcessingDatabase TechnologyQuery OptimizationCloud ComputingParallel ProgrammingData Modeling
In this paper, we propose a novel cost model for Spark SQL. The cost model covers the class of Generalized Projection, Selection, Join (GPSJ) queries. The cost model keeps into account the network and IO costs as well as the most relevant CPU costs. The execution cost is computed starting from a physical plan produced by Spark. The set of operations adopted by Spark when executing a GPSJ query are analytically modeled based on the cluster and application parameters, together with a set of database statistics. Experimental results carried out on three benchmarks and on two clusters of different sizes and with different computation features show that our model can estimate the actual execution time with about the 20 percent of errors on the average. Such an accuracy is good enough to let the system choose the most effective plan even when the execution time differences are limited. The error can be reduced to 14 percent, if the analytic model is coupled with our straggler handling strategy.
| Year | Citations | |
|---|---|---|
Page 1
Page 1