Enabling coordinated register allocation and thread-level parallelism optimization for GPUs

Abstract

The key to high performance on GPUs lies in the massive threading to enable thread switching and hide the latency of function unit and memory access. However, running with the maximum thread-level parallelism (TLP) does not necessarily lead to the optimal performance due to the excessive thread contention for cache resource. As a result, thread throttling techniques are employed to limit the number of threads that concurrently execute to preserve the data locality. On the other hand, GPUs are equipped with a large register file to enable fast context switch between threads. However, thread throttling techniques that are designed to mitigate cache contention, lead to under utilization of registers. Register allocation is a significant factor for performance as it not just determines the single-thread performance, but indirectly affects the TLP.

References

Page 1

	Year	Citations

Page 1