Publication | Open Access
Fail-Slow at Scale
113
Citations
39
References
2018
Year
Software MaintenanceCluster ComputingEngineeringComputer ArchitectureFault TolerancePerformance IssueHardware SecurityReliability EngineeringFail-slow HardwareFailure AnalysisSystems EngineeringPerformance TuningSystem SoftwareParallel ComputingFailure DetectionReliabilityFail-slow Hardware IncidentsHardware ReliabilityComputer EngineeringComputer ScienceSoftware TestingFault InjectionRoot Causes
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.
| Year | Citations | |
|---|---|---|
Page 1
Page 1