Publication | Closed Access
Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters
32
Citations
19
References
2022
Year
Artificial IntelligenceEngineeringMachine LearningDistributed AlgorithmsModel ScaleData ScienceSparse Neural NetworkParallel ComputingTrillion ParametersLarge Ai ModelMachine Learning ModelComputer EngineeringComputer ScienceCold-start ProblemNeural Architecture SearchDeep LearningGroup RecommendersHybrid Training AlgorithmParallel LearningParallel ProgrammingHybrid AccelerationCollaborative Filtering
Recent years have witnessed an exponential growth of model scale in deep learning-based recommender systems---from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. We resolve this challenge by careful co-design of both optimization algorithm and distributed system architecture. Specifically, to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstrations and empirical studies with up to 100 trillion parameters have been conducted to justify the system design and implementation of Persia. We make Persia publicly available (at github.com/PersiaML/Persia) so that anyone can easily train a recommender model at the scale of 100 trillion parameters.
| Year | Citations | |
|---|---|---|
Page 1
Page 1