Publication | Closed Access
ZeRO-infinity
193
Citations
26
References
2021
Year
Unknown Venue
Cluster ComputingGpu ArchitectureEngineeringMachine LearningData ScienceModel CompressionGpu Memory WallComputer ArchitectureComputer EngineeringParallel ProgrammingComputer ScienceAggregate Gpu MemoryParallel ComputingDeep LearningGpu ClusterNeural Scaling LawGpu MemoryGpu Computing
In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been supported primarily though system innovations that allow large models to fit in the aggregate GPU memory of multiple GPUs. However, we are getting close to the GPU memory wall. It requires 800 NVIDIA V100 GPUs just to fit a trillion parameter model for training, and such clusters are simply out of reach for most data scientists. In addition, training models at that scale requires complex combinations of parallelism techniques that puts a big burden on the data scientists to refactor their model.
| Year | Citations | |
|---|---|---|
Page 1
Page 1