Publication | Closed Access
What Can We Learn from Four Years of Data Center Hardware Failures?
126
Citations
31
References
2017
Year
Unknown Venue
Software MaintenanceCluster ComputingAvailabilityEngineeringComputer ArchitectureSoftware EngineeringSoftware AnalysisReliability EngineeringData ScienceGreen Data CenterFailure AnalysisSystems EngineeringData ManagementHardware FailuresFailure DetectionReliabilityData Center SystemComputer EngineeringData CentersComputer ScienceData Center ManagementData Center SecuritySoftware TestingCloud ComputingHardware Failure ReportsFailure Prediction
Hardware failures have a big impact on the dependability of large-scale data centers. We present studies on over 290,000 hardware failure reports collected over the past four years from dozens of data centers with hundreds of thousands of servers. We examine the dataset statistically to discover failure characteristics along the temporal, spatial, product line and component dimensions. We specifically focus on the correlations among different failures, including batch and repeating failures, as well as the human operators' response to the failures. We reconfirm or extend findings from previous studies. We also find many new failure and recovery patterns that are the undesirable by-product of the state-of-the-art data center hardware and software design.
| Year | Citations | |
|---|---|---|
Page 1
Page 1