Publication | Closed Access
Predicting DRAM reliability in the field with machine learning
45
Citations
21
References
2017
Year
Unknown Venue
EngineeringMachine LearningFault ForecastingComputer ArchitectureHardware SecurityDram ReliabilityReliability EngineeringData ScienceData MiningFailure DetectionReliabilityHardware ReliabilityPredictive AnalyticsKnowledge DiscoveryComputer EngineeringComputer ScienceReliability PredictionForecastingUncorrectable ErrorsDevice ReliabilityHardware FailureFault ManagementIndustrial InformaticsFailure Prediction
Uncorrectable errors in dynamic random access memory (DRAM) are a common form of hardware failure in server clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on analyzing DRAM reliability in large production clusters, little has been reported on the automatic prediction of such errors ahead of time. In this paper, we present a highly accurate predictive model, based on daily event logs and sensor measurements, in a large fleet of commodity servers going back to 2014. By correlating correctable errors with sensor metrics, we can use ensemble machine learning techniques to predict uncorrectable errors weeks in advance.
| Year | Citations | |
|---|---|---|
Page 1
Page 1