Concepedia

Abstract

Uncorrectable errors in dynamic random access memory (DRAM) are a common form of hardware failure in server clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on analyzing DRAM reliability in large production clusters, little has been reported on the automatic prediction of such errors ahead of time. In this paper, we present a highly accurate predictive model, based on daily event logs and sensor measurements, in a large fleet of commodity servers going back to 2014. By correlating correctable errors with sensor metrics, we can use ensemble machine learning techniques to predict uncorrectable errors weeks in advance.

References

YearCitations

Page 1