Shark - Concepedia

TLDR

Shark is a new data‑analysis system that combines query processing with complex analytics on large clusters, demonstrating that significant speedups can be achieved while preserving a MapReduce‑like execution engine and fine‑grained fault tolerance. It achieves this by leveraging a novel distributed memory abstraction, column‑oriented in‑memory storage, and dynamic mid‑query replanning to run SQL queries and iterative machine learning functions at scale, while efficiently recovering from failures mid‑query. Shark executes SQL queries up to 100× faster than Apache Hive and machine learning programs more than 100× faster than Hadoop, matching the performance of MPP analytic databases while offering fault tolerance and advanced analytics capabilities.

Abstract

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g. iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100X faster than Apache Hive, and machine learning programs more than 100X faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engine provides. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.

References

Page 1

	Year	Citations

Page 1