Software-hardware co-design for fast and scalable training of deep learning recommendation models

Abstract

Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training of large-scale DLRMs. Neo employs a novel 4D parallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism for training massive embedding operators in DLRMs. In addition, Neo enables extremely high-performance and memory-efficient embedding computations using a variety of critical systems optimizations, including hybrid kernel fusion, software-managed caching, and quality-preserving compression. Finally, Neo is paired with ZionEX, a new hardware platform co-designed with Neo's 4D parallelism for optimizing communications for large-scale DLRM training. Our evaluation on 128 GPUs using 16 ZionEX nodes shows that Neo outperforms existing systems by up to 40× for training 12-trillion-parameter DLRM models deployed in production.

References

Page 1

	Year	Citations
ImageNet classification with deep convolutional neural networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Communications of the ACM Convolutional Neural NetworkEngineeringMachine LearningNeural NetworkImagenet Classification	2017	75.5K
Going deeper with convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Image ClassificationDeep Neural NetworksImage AnalysisMachine LearningData Science	2015	46.2K
Xception: Deep Learning with Depthwise Separable Convolutions François Chollet Convolutional Neural NetworkEngineeringMachine LearningInception ModulesImage Analysis	2017	18.2K
PyTorch: An Imperative Style, High-Performance Deep Learning Library Adam Paszke, Sam Gross, Francisco Massa, arXiv (Cornell University) Artificial IntelligenceEngineeringMachine LearningData ScienceHardware Acceleration	2019	16.2K
Matrix Factorization Techniques for Recommender Systems Yehuda Koren, Robert Bell, Chris Volinsky Computer EngineeringMachine LearningMatrix Factorization ModelsInformation RetrievalData Science	2009	11.4K
Mastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Nature Artificial IntelligenceGame AiCognitive ScienceEngineeringReward Hacking	2017	9K
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John C. Duchi, Elad Hazan, Yoram Singer	2010	8.6K
Neural Collaborative Filtering Xiangnan He, Lizi Liao, Hanwang Zhang, Artificial IntelligenceDeep Neural NetworksEngineeringMachine LearningInformation Retrieval	2017	6.4K
Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, arXiv (Cornell University)	2015	4.6K
Deep Neural Networks for YouTube Recommendations Paul Covington, Jay Adams, Emre Sargin Natural Language ProcessingDeep Neural NetworksEngineeringInformation RetrievalMachine Learning	2016	3.3K

Page 1