Publication | Closed Access
Job Failure Analysis and Its Implications in a Large-Scale Production Grid
36
Citations
6
References
2006
Year
Cluster ComputingEngineeringIndustrial EngineeringLarge-scale Data-intensive GridMaintenance SchedulingInterarrival TimesOperations ResearchJob Failure AnalysisReliability EngineeringData ScienceFailure AnalysisSystems EngineeringLogisticsQuantitative ManagementFailure DetectionJob SchedulerLarge-scale Production GridSupply Chain ManagementComputer ScienceScheduling AnalysisJob FailuresPower System ReliabilityCloud ComputingBusinessProduction SchedulingIndustrial InformaticsFailure Prediction
In this paper we present an initial analysis of job failures in a large-scale data-intensive Grid. Based on three representative periods in production, we characterize the interarrival times and life spans of failed jobs. Different failure types are distinguished and the analysis is carried out further at the Virtual Organization (VO) level. The spatial behavior, namely where job failures occur in the Grid, is also examined. Cross-correlation structures, including how arrivals correlate with life spans of job failures, are analyzed and illustrated. We further investigate statistical models to fit the failure data and propose several failureaware scheduling strategies at the Grid level. Our results show that the overall failure rates in the Grid are quite significant, ranging from 25% to 33% of all submitted jobs. However, only 5% to 8% of the jobs failed after running on a certain Computing Element (CE). The rest of failed jobs are aborted or cancelled without running. A majority of failed jobs come from several large production VOs and a large amount of these failures are centered around several main CEs. The interarrival time processes of failed jobs are shown to be bursty, and the life spans exhibit strong autocorrelations. Based on the failure patterns we argue that it is important for the Grid resource brokers to track historical failure and take it into account in decision making. Some proactive measures and accountability issues are also discussed.
| Year | Citations | |
|---|---|---|
Page 1
Page 1