Publication | Closed Access
A comparison of approaches to large-scale data analysis
1.1K
Citations
14
References
2009
Year
Unknown Venue
Cluster ComputingEngineeringParallel DbmssComputer ArchitectureMap-reduceBasic Control FlowData ScienceData MiningManagementData IntegrationParallel ComputingData ManagementStatisticsParallel DatabaseHigh-performance Data AnalyticsData ModelingKnowledge DiscoveryLarge-scale Data AnalysisComputer ScienceBig Data SearchData-intensive ComputingMr SystemCloud ComputingParallel ProgrammingMassive Data ProcessingBig Data
MapReduce has generated enthusiasm for large‑scale data analysis, yet its core control flow has existed in parallel SQL database systems for decades. The paper aims to describe and compare the MapReduce and parallel DBMS paradigms. The authors benchmarked an open‑source MapReduce implementation and two parallel DBMSs on a 100‑node cluster, measuring performance across varying degrees of parallelism for a set of tasks. Results show that although DBMSs require longer data loading and tuning, they outperform MapReduce in performance, highlighting trade‑offs and suggesting design insights for future systems.
There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
| Year | Citations | |
|---|---|---|
Page 1
Page 1