Spark: cluster computing with working sets

TLDR

MapReduce and its variants excel at large‑scale data‑intensive tasks on commodity clusters, but their acyclic data‑flow model limits suitability for iterative machine learning and interactive analytics. The authors propose Spark, a framework that supports applications reusing a working set of data across multiple parallel operations while preserving MapReduce‑style scalability and fault tolerance. Spark achieves this via resilient distributed datasets (RDDs), which are read‑only, partitioned collections that can be recomputed on failure. Spark outperforms Hadoop by up to tenfold on iterative machine learning tasks and delivers sub‑second query latency on a 39 GB dataset.

Abstract

MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

References

Page 1

	Year	Citations

Page 1