Publication | Closed Access
Boost Vision Transformer with GPU-Friendly Sparsity and Quantization
30
Citations
46
References
2023
Year
Unknown Venue
EngineeringMachine LearningComputer ArchitectureBoost Vision TransformerAcceleration DeploymentGpu ComputingStructured PruningImage AnalysisImage CompressionPattern RecognitionParallel ComputingVideo TransformerMachine VisionComputer EngineeringComputer ScienceDeep LearningModel CompressionComputer VisionHardware AccelerationImage CodingVision Transformer
The transformer extends its success from the language to the vision domain. Because of the stacked self-attention and cross-attention blocks, the acceleration deployment of vision transformer on GPU hardware is challenging and also rarely studied. This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization. Specially, an original large model with dense weight parameters is first pruned into a sparse one by 2:4 structured pruning, which considers the GPU's acceleration of 2:4 structured sparse pattern with FP16 data type, then the floating-point sparse model is further quantized into a fixed-point one by sparse-distillation-aware quantization aware training, which considers GPU can provide an extra speedup of 2:4 sparse calculation with integer tensors. A mixed-strategy knowledge distillation is used during the pruning and quantization process. The proposed compression scheme is flexible to support supervised and unsupervised learning styles. Experiment results show GPUSQ-ViT scheme achieves state-of-the-art compression by reducing vision transformer models <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{6.4}-\mathbf{12.7}\times$</tex> on model size and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{30.3}-\mathbf{62} \times$</tex> on FLOPs with negligible accuracy degradation on ImageNet classification, COCO detection and ADE20K segmentation benchmarking tasks. Moreover, GPUSQ-ViT can boost actual deployment performance by <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{1.39}-\mathbf{1.79}\times$</tex> and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{3.22}-\mathbf{3.43}\times$</tex> of latency and throughput on A100 GPU, and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{1.57}-\mathbf{1.69}\times$</tex> and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{2.11}-\mathbf{2.51}\times$</tex> improvement of latency and throughput on AGX Orin.
| Year | Citations | |
|---|---|---|
Page 1
Page 1