Concepedia

Publication | Open Access

Understanding and Mitigating Hardware Failures in Deep Learning Training Systems

40

Citations

35

References

2023

Year

Abstract

Deep neural network (DNN) training workloads are increasingly susceptible to hardware failures in datacenters. For example, Google experienced "mysterious, difficult to identify problems" in their TPU training systems due to hardware failures [7]. Although these particular problems were subsequently corrected through significant efforts, they have raised the urgency of addressing the growing challenges emerging from hardware failures impacting many DNN training workloads.

References

YearCitations

Page 1