Publication | Open Access
Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities
60
Citations
39
References
2017
Year
Unknown Venue
EngineeringMachine LearningGpu BenchmarkingComputer ArchitectureLower System ReliabilityData Center NetworkGpu ComputingDatacenter-scale ComputingReliability EngineeringData ScienceGreen Data CenterSystems EngineeringSoft-error BehaviorsParallel ComputingData ManagementReliabilityHardware ReliabilityData Center SystemComputer EngineeringData CentersComputer ScienceDeep LearningGpu ClusterData Center SystemsPower ConsumptionGpu ArchitectureData Center ManagementCloud ComputingGpu Errors
GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks.
| Year | Citations | |
|---|---|---|
Page 1
Page 1