Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Abstract

Graphics Processing Units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their successor, Ampere. However, some microarchitecture features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In this paper, we study the clock cycles per instruction with various data types found in the instruction-set architecture (ISA) of Nvidia GPUs. We measure the clock cycles for PTX ISA instructions and their SASS ISA counterparts using microbench-marks. We further calculate the clock cycles needed to access each memory unit. Moreover, we demystify the new version of the tensor core unit found in the Ampere architecture using the WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. Our results should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely used by performance modeling simulators and tools to model and predict the performance of the hardware.