Publication | Open Access
Understanding and Mitigating Hardware Failures in Deep Learning Training Systems
40
Citations
35
References
2023
Year
Unknown Venue
Artificial IntelligenceConvolutional Neural NetworkDeep Neural NetworksTraining WorkloadsMachine LearningData ScienceMitigating Hardware FailuresEngineeringSparse Neural NetworkMachine Learning ModelAdversarial Machine LearningComputer EngineeringComputer ArchitectureComputer ScienceDeep LearningNeural Architecture SearchDeep Neural NetworkHardware Failures
Deep neural network (DNN) training workloads are increasingly susceptible to hardware failures in datacenters. For example, Google experienced "mysterious, difficult to identify problems" in their TPU training systems due to hardware failures [7]. Although these particular problems were subsequently corrected through significant efforts, they have raised the urgency of addressing the growing challenges emerging from hardware failures impacting many DNN training workloads.
| Year | Citations | |
|---|---|---|
Page 1
Page 1