Portable mapping of data parallel programs to OpenCL for heterogeneous systems

TLDR

GPU-based systems promise massive performance at low cost, yet their potential is limited by the difficulty of programming them. This work proposes a compiler that automatically translates data‑parallel OpenMP programs into optimized OpenCL code for GPUs. The compiler applies existing data‑transformations and a predictive model to generate OpenCL kernels, deciding whether to run on GPU or host, and was evaluated on the NAS benchmark suite on NVIDIA and AMD GPUs. On the NAS benchmarks, the generated code achieved average speedups of 4.51× and 4.20× over a sequential baseline and outperformed expert‑written OpenCL implementations by factors of 1.63 and 1.56.

Abstract

General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Re-alizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to automatically generate optimized OpenCL code from data-parallel OpenMP programs for GPUs. Such an approach brings together the benefits of a clear high levellanguage (OpenMP) and an emerging standard (OpenCL) for heterogeneous multi-cores. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses predictive modeling to automatically determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multi-core host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on two distinct GPU based systems: Core i7/NVIDIA GeForce GTX 580 and Core 17/AMD Radeon 7970. We achieved average (up to) speedups of 4.51x and 4.20x (143x and 67x) respectively over a sequential baseline. This is, on average, a factor 1.63 and 1.56 times faster than a hand-coded, GPU-specific OpenCL implementation developed by independent expert programmers.

References

Page 1

	Year	Citations

Page 1