Publication | Closed Access
A coordinated tiling and batching framework for efficient GEMM on GPUs
74
Citations
24
References
2019
Year
Unknown Venue
Cluster ComputingEngineeringMachine LearningEfficient GemmComputer ArchitectureSingle Cuda KernelGpu ComputingGeneral Matrix MultiplicationCompute KernelData ScienceParallel ComputingComputational GeometryMassively-parallel ComputingComputer EngineeringComputer ScienceDeep LearningGpu ClusterComputational ScienceGpu ArchitectureThread HierarchyCoordinated TilingMany-core ArchitectureParallel ProgrammingBatching Framework
General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for each tile. However, in many real-world applications especially deep learning domain, the matrix size is small. To this end, prior work proposes batched GEMM to process a group of small independent GEMMs together by designing a single CUDA kernel for all of these GEMMs.
| Year | Citations | |
|---|---|---|
Page 1
Page 1