Publication | Open Access
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
73
Citations
37
References
2018
Year
Unknown Venue
EngineeringMachine LearningGpu BenchmarkingMachine Learning ToolComputer ArchitectureGpu Error PredictionMachine Learning ModelsGpu ComputingData ScienceData MiningModeling And SimulationParallel ComputingPerformance PredictionPredictive AnalyticsComputer EngineeringComputer ScienceOperational Hpc SystemGpu ClusterComputational ScienceGpu ArchitectureGpu ErrorsParallel Programming
GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.
| Year | Citations | |
|---|---|---|
Page 1
Page 1