Concepedia

Abstract

We design, implement and evaluate a parallel radix sort that simultaneously utilizes the CPU and GPU devices on the AMD Fusion APU. The parallel sort, referred to as Fusion Sort, partitions the sort keys between the CPU and GPU devices and utilizes the integrated memory system of the APU to avoid data copying between the devices. We identify three design issues that impact overhead and performance: the granularity of sharing between the two devices, the scheme of data partitioning and the allocation of data in memory regions accessible by each device. We present three variants of Fusion Sort that share data at coarse and fine granularities and with fixed and variable data partitioning schemes. In each variant, data is allocated to minimize the overhead of non-preferred memory accesses of each device. Our evaluation shows that fine-grain sharing with variable data partitioning performs the best. Further, Fusion Sort outperforms CPU-only and GPU-only parallel radix sorts by up to 1.8X and 1.9X respectively. These results demonstrate the viability of the integrated memory system of the APU in the context of sorting.

References

YearCitations

Page 1