Concepedia

Publication | Closed Access

An Automatic Artificial Intelligence Training Platform Based on Kubernetes

12

Citations

7

References

2020

Year

Abstract

For large-scale AI training, the manual allocation of GPU resources is too inefficient, and it faces the problems of task allocation and fault restart. In this paper, a fully automatic machine learning platform is designed, which manages server resources uniformly, and users describe the required resources through configuration files. The platform automatically performs AI task allocation and scheduling based on the cluster load, which solves the problems of low cluster resource utilization and uneven machine load distribution. The platform also provides an automatic release and continuous integration of the model, which greatly simplifies the configuration of the model's operating environment and external release process, enabling researchers to focus more on model adjustments. Finally, it is verified by experiments that the extra time spent on AI task training through this platform is negligible, which confirms the feasibility of the platform.

References

YearCitations

Page 1