Managing data transfers in computer clusters with orchestra

TLDR

Cluster computing applications such as MapReduce and Dryad transfer massive amounts of data, and these transfers can account for more than half of job completion times, yet little work has focused on optimizing them beyond per‑flow traffic management. The study proposes a global management architecture and algorithms to improve transfer times for common patterns like broadcast and shuffle and enable transfer‑level scheduling. The authors design a global management system that coordinates data transfers across the cluster and implements algorithms for optimizing broadcast, shuffle, and transfer‑level scheduling. Prototype experiments show broadcast completion times improved by up to 4.5× and high‑priority transfers were reduced by 1.7× compared to the status quo in Hadoop.

Abstract

Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50% of job completion times. Despite this impact, there has been relatively little work on optimizing the performance of these data transfers, with networking researchers traditionally focusing on per-flow traffic management. We address this limitation by proposing a global management architecture and a set of algorithms that (1) improve the transfer times of common communication patterns, such as broadcast and shuffle, and (2) allow scheduling policies at the transfer level, such as prioritizing a transfer over other transfers. Using a prototype implementation, we show that our solution improves broadcast completion times by up to 4.5X compared to the status quo in Hadoop. We also show that transfer-level scheduling can reduce the completion time of high-priority transfers by 1.7X.

References

Page 1

	Year	Citations

Page 1