Publication | Open Access
Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study
11
Citations
33
References
2024
Year
Unknown Venue
GPU memory corruption and in particular double-bit errors (DBEs) remain one of the least understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to job termination and can potentially cost thousands of node-hours, either from wasted computations or as the overhead from regular checkpointing needed to minimize the losses. As supercomputers and their components simultaneously grow in scale, density, failure rates, and environmental footprint, the efficiency of HPC operations becomes both an imperative and a challenge.
| Year | Citations | |
|---|---|---|
Page 1
Page 1