Improving MapReduce performance in heterogeneous environments

TLDR

MapReduce, especially Hadoop, is widely used for large‑scale data‑parallel tasks, but its scheduler assumes homogeneous nodes and linear task progress, assumptions that fail in heterogeneous settings such as Amazon EC2 virtualized data centers. This work proposes the Longest Approximate Time to End (LATE) scheduling algorithm to mitigate the performance degradation caused by heterogeneity in Hadoop clusters. LATE estimates each task’s remaining execution time and schedules tasks based on these estimates, thereby avoiding speculative re‑execution triggered by incorrect straggler detection. Experiments demonstrate that Hadoop’s scheduler can severely degrade performance in heterogeneous environments, whereas LATE reduces response times by a factor of two on a 200‑VM EC2 cluster.

Abstract

MapReduce is emerging as an important programming model for large-scale data-parallel applications such as web indexing, data mining, and scientific simulation. Hadoop is an open-source implementation of MapReduce enjoying wide adoption and is often used for short jobs where low response time is critical. Hadoop's performance is closely tied to its task scheduler, which implicitly assumes that cluster nodes are homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that appear to be stragglers. In practice, the homogeneity assumptions do not always hold. An especially compelling setting where this occurs is a virtualized data center, such as Amazon's Elastic Compute Cloud (EC2). We show that Hadoop's scheduler can cause severe performance degradation in heterogeneous environments. We design a new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity. LATE can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.

References

Page 1

	Year	Citations

Page 1