Publication | Closed Access
OpenMP to GPGPU
413
Citations
13
References
2009
Year
Unknown Venue
EngineeringGpu BenchmarkingComputer ArchitectureGpu ComputingHardware SecurityCompute KernelSystem SoftwareParallel ComputingCompilersImportant KernelsComputer EngineeringComputer ScienceGpu ClusterGpu ArchitectureProgram AnalysisParallel ProgrammingCompiler FrameworkCuda-based Gpgpu ApplicationsOpenmp
GPGPUs have emerged as powerful high‑performance computing platforms, yet programming them via CUDA remains complex and error‑prone. This paper introduces a compiler framework that automatically translates standard OpenMP applications into CUDA‑based GPGPU programs to enhance programmability. The framework performs source‑to‑source translation and applies compile‑time optimizations, including key transformation techniques that enable efficient GPU global memory access. Experiments on JACOBI, SPMUL, EP, and CG kernels demonstrate the translator achieves up to 50× speed‑ups over unoptimized OpenMP translations and up to 328× over serial execution.
GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone. This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In this paper, we have identified several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance. Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial).
| Year | Citations | |
|---|---|---|
Page 1
Page 1