Publication | Closed Access
Hone
17
Citations
13
References
2013
Year
Cluster ComputingEngineeringMachine LearningData ScienceHadoop JarCloud ComputingData-intensive PlatformComputer ArchitectureParallel ProgrammingComputer SciencePrototype RuntimeMap-reduceParallel ComputingData ManagementData-intensive ComputingMassive Data ProcessingBig DataHigh-performance Data Analytics
The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale data-stores are increasingly common, it is unclear whether "typical" analytics tasks require more than a single high-end server. Additionally, we are seeing increased sophistication in analytics, e.g., machine learning, which generally operates over smaller and more refined datasets. To address these trends, we propose "scaling down" Hadoop to run on shared-memory machines. This paper presents a prototype runtime called Hone, intended to be both API and binary compatible with standard (distributed) Hadoop. That is, Hone can take an existing Hadoop jar and efficiently execute it, without modification, on a multi-core shared memory machine. This allows us to take existing Hadoop algorithms and find the most suitable run-time environment for execution on datasets of varying sizes. Our experiments show that Hone can be an order of magnitude faster than Hadoop pseudo-distributed mode (PDM); on dataset sizes that fit into memory, Hone can outperform a fully-distributed 15-node Hadoop cluster in some cases as well.
| Year | Citations | |
|---|---|---|
Page 1
Page 1