Publication | Closed Access
Reining in the outliers in map-reduce clusters using Mantri
650
Citations
35
References
2010
Year
Software MaintenanceCluster ComputingEngineeringTask WorkloadComputer ArchitectureSoftware EngineeringMap-reduceCluster TechnologyData ScienceData MiningSystems EngineeringWorkload CharacterizationParallel ComputingJob SchedulerReal-time Operating SystemKnowledge DiscoveryComputer SciencePresent MantriCloud ComputingParallel ProgrammingCulls OutliersMap-reduce ClustersSystem SoftwareMassive Data ProcessingWorkload ManagementBig Data
Experience froman operational Map-Reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and, imbalance in task workload. We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. Mantri's strategies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on outliers early in their lifetime. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing's production clusters and trace-driven simulations show that Mantri improves job completion times by 32%.
| Year | Citations | |
|---|---|---|
Page 1
Page 1