High-Performance Tensor Learning Primitives Using GPU Tensor Cores

Abstract

Tensor learning is a powerful tool for big data analytics and machine learning, e.g., gene analysis and deep learning. However, tensor learning algorithms are compute-intensive since their time and space complexities grow exponentially with the order of tensors, which hinders their application. In this paper, we exploit the parallelism of tensor learning primitives using GPU tensor cores and develop high-performance tensor learning algorithms. First, we propose novel hardware-oriented optimization strategies for tensor learning primitives on GPU tensor cores. Second, for big data analytics, we employ the optimized tensor learning primitives to accelerate the CP tensor decomposition and then apply it for gene analysis. Third, we optimize the Tucker tensor decomposition and propose a novel Tucker tensor layer to compress deep neural networks. We employ natural gradients to train the neural networks, which only involve a forward pass without backpropagation and thus are suitable for GPU computations. Compared with TensorLab and TensorLy libraries on an A100 GPU, our third-order CP tensor decomposition achieves up to <inline-formula><tex-math notation="LaTeX">$16.32\times$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$32.25\times$</tex-math></inline-formula> speedups; and <inline-formula><tex-math notation="LaTeX">$6.09\times$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$6.72\times$</tex-math></inline-formula> speedups for our third-order Tucker tensor decomposition. The proposed fourth-order CP and Tucker tensor decompositions achieve up to <inline-formula><tex-math notation="LaTeX">$30.65\times$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$5.41\times$</tex-math></inline-formula> speedups over the TensorLab. Our CP tensor decomposition for gene analysis achieves up to <inline-formula><tex-math notation="LaTeX">$5.88\times$</tex-math></inline-formula> speedup over TensorLy. Compared with a conventional fully connected neural network, our Tucker tensor layer neural network achieves an accuracy of <inline-formula><tex-math notation="LaTeX">$97.9\%$</tex-math></inline-formula> , a speedup of <inline-formula><tex-math notation="LaTeX">$4.47\times$</tex-math></inline-formula> , and a compression ratio of <inline-formula><tex-math notation="LaTeX">$2.92$</tex-math></inline-formula> at the cost of <inline-formula><tex-math notation="LaTeX">$0.4\%$</tex-math></inline-formula> drop in accuracy.

References

Page 1

	Year	Citations

Page 1