Publication | Open Access
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
201
Citations
25
References
2018
Year
Unknown Venue
Numerical AnalysisGpu Tensor CoresEngineeringGpu BenchmarkingClassical Fp16 ArithmeticHardware AlgorithmComputer ArchitectureAx BGpu ComputingArray ComputingHigh-performance ArchitectureComputing SystemsGeneral Hpc ProblemParallel ComputingComputer EngineeringComputer ScienceGpu ClusterComputational ScienceGpu ArchitectureHardware AccelerationParallel ProgrammingFast Fp16 Arithmetic
Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16-FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic.
| Year | Citations | |
|---|---|---|
Page 1
Page 1