Concepedia

Publication | Closed Access

Effective straggler mitigation: attack of the clones

414

Citations

30

References

2013

Year

TLDR

Small interactive data‑analysis jobs in datacenters suffer from stragglers that are on average eight times slower than the median task, raising average job duration by 47%, and current mitigation relies on waiting and speculation while adding only marginal utilization overhead. The authors propose fully cloning small jobs to eliminate waiting and speculation. They implement this in Dolly, using a delay‑assignment strategy to schedule clones and avoid contention for intermediate data. Evaluation on production workloads shows Dolly speeds up small jobs by 34% to 46% while using only 5% additional resources.

Abstract

Small jobs, that are typically run for interactive data analyses in datacenters, continue to be plagued by disproportionately long-running tasks called stragglers. In the production clusters at Facebook and Microsoft Bing, even after applying state-of-the-art straggler mitigation techniques, these latency sensitive jobs have stragglers that are on average 8 times slower than themedian task in that job. Such stragglers increase the average job duration by 47%. This is because current mitigation techniques all involve an element of waiting and speculation. We instead propose full cloning of small jobs, avoiding waiting and speculation altogether. Cloning of small jobs only marginally increases utilization because workloads show that while the majority of jobs are small, they only consume a small fraction of the resources. The main challenge of cloning is, however, that extra clones can cause contention for intermediate data. We use a technique, delay assignment, which efficiently avoids such contention. Evaluation of our system, Dolly, using production workloads shows that the small jobs speedup by 34% to 46% after state-of-the-artmitigation techniques have been applied, using just 5% extra resources for cloning.

References

YearCitations

Page 1