Publication | Open Access
Varuna
64
Citations
11
References
2022
Year
Unknown Venue
Cluster ComputingMassively-parallel ComputingEngineeringMachine LearningData ScienceAdvanced ComputingGpu ClusterJob ParallelismScalability LimitsComputer ArchitectureMany-core ArchitectureResource FragmentationParallel ProgrammingComputer ScienceParallel ComputingDeep LearningNeural Architecture Search
Systems for training massive deep learning models (billions of parameters) today assume and require specialized "hyperclusters": hundreds or thousands of GPUs wired with specialized high-bandwidth interconnects such as NV-Link and Infiniband. Besides being expensive, such dependence on hyperclusters and custom high-speed inter-connects limits the size of such clusters, creating (a) scalability limits on job parallelism; (b) resource fragmentation across hyperclusters.
| Year | Citations | |
|---|---|---|
Page 1
Page 1