Publication | Closed Access
Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example
17
Citations
11
References
2010
Year
Unknown Venue
Software MaintenanceEngineeringHpc SystemsComputer ArchitectureSoftware EngineeringSoftware AnalysisReliability EngineeringUncertainty QuantificationManagementFailure AnalysisSystems EngineeringModeling And SimulationFailure DetectionPerformance PredictionReliabilityPredictive AnalyticsComputer EngineeringComputer ScienceReliability PredictionFailure PredictorsReliability ModellingProgram AnalysisSoftware TestingEffective Failure PredictionProcess ControlOutlier BehaviorFault InjectionFailure Prediction
Effective failure prediction and mitigation strategies in high-performance computing systems could provide huge gains in resilience of tightly coupled large-scale scientific codes. These gains would come from prediction-directed process migration and resource servicing, intelligent resource allocation, and checkpointing driven by failure predictors rather than at regular intervals based on nominal mean time to failure. Given probabilistic associations of outlier behavior in hardware-related metrics with eventual failure in hardware, system software, and/or applications, this paper explores approaches for quantifying the effects of prediction and mitigation strategies and demonstrates these using actual production system data. We describe context-relevant methodologies for determining the accuracy and cost-benefit of predictors.
| Year | Citations | |
|---|---|---|
Page 1
Page 1