The functional and performance tolerance of GPUs to permanent faults in registers

Abstract

Massively parallel many-core Graphics Processing Unit (GPU) architectures offer significant performance speedup in workloads with thread-level parallelism compared to contemporary multicore CPUs. For this reason, general-purpose computing using GPUs (GPGPU) is a rapidly expanding research direction in different contexts. Unlike graphics processing, GPGPU computing requires reliable operation in the presence of hardware faults whose occurrence probabilities in current and forthcoming advanced manufacturing technologies will be significant. In this paper, we focus on the aspect of tolerance of GPUs to permanent faults in their most critical storage elements: register files. By performing a comprehensive fault injection campaign on a cycle-accurate GPGPU architectural simulator, we first evaluate and classify the behavior of NVIDIA GPU CUDA kernels in the presence of permanent faults in registers. Moreover, we analyze the performance tolerance of GPUs when they operate in degraded mode (less hardware resources, less thread-level parallelism) due to the presence of multiple permanent faults in the registers of their streaming multiprocessors. Our findings confirm the intuitively expected tolerance of these architectures to faults and also quantify it in different configurations and modes.

References

Page 1

	Year	Citations

Page 1