Publication | Closed Access
Exploit failure prediction for adaptive fault-tolerance in cluster computing
68
Citations
18
References
2006
Year
Unknown Venue
Software MaintenanceCluster ComputingEngineeringSoftware EngineeringFault ToleranceFault-tolerant MessagingSoftware AnalysisCluster TechnologyReliability EngineeringDynamic DecisionSystems EngineeringFault RecoveryParallel ComputingAdaptive Fault-toleranceComputer EngineeringComputer ScienceHigh Availability SoftwareFault-tolerant NetworkProgram AnalysisSoftware TestingCloud ComputingParallel ProgrammingFailure PredictionProduction Cluster
As the scale of cluster computing grows, it is becoming hard for long-running applications to complete without facing failures on large-scale clusters. To address this issue, checkpointing/restart is widely used to provide the basic fault-tolerant functionality, yet it suffers from high overhead and its reactive characteristic. In this work, we propose FT-Pro, an adaptive fault management mechanism that optimally chooses migration, checkpointing or no action to reduce the application execution time in the presence of failures based on the failure prediction. A cost-based evaluation model is presented for dynamic decision at run-time. Using the actual failure log from a production cluster at NCSA, we demonstrate that even with modest failure prediction accuracy, FT-Pro outperforms the traditional checkpointing/restart strategy by 13%-30% in terms of reducing the application execution time despite failures, which is a significant performance improvement for long-running applications.
| Year | Citations | |
|---|---|---|
Page 1
Page 1