Concepedia

Publication | Open Access

Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study

11

Citations

33

References

2024

Year

Abstract

GPU memory corruption and in particular double-bit errors (DBEs) remain one of the least understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to job termination and can potentially cost thousands of node-hours, either from wasted computations or as the overhead from regular checkpointing needed to minimize the losses. As supercomputers and their components simultaneously grow in scale, density, failure rates, and environmental footprint, the efficiency of HPC operations becomes both an imperative and a challenge.

References

YearCitations

Page 1