Publication | Closed Access
Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling
38
Citations
46
References
2021
Year
Unknown Venue
EngineeringComputer ArchitectureLayer ExecutionHigh-performance ArchitectureSparse Neural NetworkTemporal WasteEmbedded Machine LearningParallel ComputingNeurocomputersComputer EngineeringComputer ScienceDeep LearningNeural Architecture SearchDeep Neural NetworkLayer-wise SchedulingHardware AccelerationEdge ComputingNeural Processing UnitsCloud ComputingDomain-specific AcceleratorParallel ProgrammingNeuroscienceBrain-like ComputingResource OptimizationMaximizing Resource Utilization
To meet surging demands for deep learning inference services, many cloud computing vendors employ high-performance specialized accelerators, called neural processing units (NPUs). One important challenge for effective use of NPUs is to achieve high resource utilization over a wide spectrum of deep neural network (DNN) models with diverse arithmetic intensities. There is often an intrinsic mismatch between the compute-to-memory bandwidth ratio of an NPU and the arithmetic intensity of the model it executes, leading to under-utilization of either compute resources or memory bandwidth. Ideally, we want to saturate both compute TOP/s and DRAM bandwidth to achieve high system throughput. Thus, we propose Layerweaver, an inference serving system with a novel multi-model time-multiplexing scheduler for NPUs. Layerweaver reduces the temporal waste of computation resources by interweaving layer execution of multiple different models with opposing characteristics: compute-intensive and memory-intensive. Layerweaver hides the memory time of a memory-intensive model by overlapping it with the relatively long computation time of a compute-intensive model, thereby minimizing the idle time of the computation units waiting for off-chip data transfers. For a two-model serving scenario of batch 1 with 16 different pairs of compute- and memory-intensive models, Layerweaver improves the temporal utilization of computation units and memory channels by 44.0% and 28.7%, respectively, to increase the system throughput by 60.1% on average, over the baseline executing one model at a time.
| Year | Citations | |
|---|---|---|
Page 1
Page 1