On Workload-Aware DRAM Failure Prediction in Large-Scale Data Centers

Abstract

DRAM failures are one of the major hardware threats to the reliability of large-scale data centers since the uncorrectable errors in DRAMs may cause servers to shut down. Existing works try to solve this problem by predicting DRAM failures in advance with Machine Learning models. In these works, correctable errors (CEs) are generally deemed as the most important feature. The major reason behind CEs' emergence is the accumulated stress caused by intensive workloads. Moreover, defective DRAMs will not manifest themselves as system errors until the defective cells are accessed by some specific workloads. Therefore, the running workloads on a server are also important for DRAM failure prediction. In this paper, we focus on the impact of workloads on DRAM failures. We design the workload features from both macroscopical and microscopical aspects, i.e. node-level performance metrics and cell-level DRAM access pattern, respectively. Furthermore, we propose Hierarchical DRAM Error Code (HiDEC) to represent the DRAM access pattern. We leverage several Decision Tree-based models for DRAM failure prediction to highlight the generality of our designed features. Experiments are carried out based on the dataset collected from a real-world commercial data center. The results show that both macroscopic and microscopic features can bring significant improvements to the prediction performance.

References

Page 1

	Year	Citations

Page 1