Concepedia

Publication | Closed Access

An empirical study on program failures of deep learning jobs

92

Citations

37

References

2020

Year

Abstract

Deep learning has made significant achievements in many application areas. To train and test models more efficiently, enterprise developers submit and run their deep learning programs on a shared, multi-tenant platform. However, some of the programs fail after a long execution time due to code/script defects, which reduces the development productivity and wastes expensive resources such as GPU, storage, and network I/O.

References

YearCitations

Page 1