Concepedia

TLDR

DryadLINQ extends prior systems such as SQL, MapReduce, and Dryad by using strongly typed .NET objects and supporting both imperative and declarative dataset operations, building on Dryad’s proven, scalable execution platform. The authors present DryadLINQ, a system and language extensions that provide a new programming model for large‑scale distributed computing. DryadLINQ compiles LINQ expressions into a distributed execution plan that is automatically translated and executed by the Dryad platform, with a compiler and runtime that support debugging in standard .NET tools and are evaluated on diverse workloads. DryadLINQ achieves excellent absolute performance, sorting 1,012 GB of data in 319 s on a 240‑node cluster, and demonstrates near‑linear scaling across representative applications.

Abstract

DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale distributed computing. It generalizes previous execution environments such as SQL, MapReduce, and Dryad in two ways: by adopting an expressive data model of strongly typed .NET objects; and by supporting general-purpose imperative and declarative operations on datasets within a traditional high-level programming language.A DryadLINQ program is a sequential program composed of LINQ expressions performing arbitrary side-effect-free transformations on datasets, and can be written and debugged using standard .NET development tools. The DryadLINQ system automatically and transparently translates the data-parallel portions of the program into a distributed execution plan which is passed to the Dryad execution platform. Dryad, which has been in continuous operation for several years on production clusters made up of thousands of computers, ensures efficient, reliable execution of this plan.We describe the implementation of the DryadLINQ compiler and runtime. We evaluate DryadLINQ on a varied set of programs drawn from domains such as web-graph analysis, large-scale log mining, and machine learning. We show that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as we vary the number of computers used for a job.

References

YearCitations

Page 1