Publication | Closed Access
Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units
99
Citations
31
References
2015
Year
EngineeringGpu BenchmarkingComputer Graphic TechniqueComputer ArchitectureGpu ComputingRadiation ProtectionHardware SecurityHardening SolutionsReliability EngineeringCompute KernelSystems EngineeringGraphics Processing UnitsRadiation-induced Soft ErrorsParallel ComputingRadiation ImagingHealth SciencesComputer EngineeringRadiation TransportComputer ScienceGpu ClusterGpu ReliabilityNeutron SensitivityGpu ArchitectureProgram AnalysisParallel Programming
Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern for both the automotive and aerospace markets and is becoming an issue also for supercomputers. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this paper, we aim at giving novel insights on GPU reliability by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences. Additionally, a wide set of parallel codes are exposed to controlled neutron beams to measure GPUs operative error rates. From experimental data and algorithm analysis we derive general insights on parallel algorithms and programming approaches reliability. Finally, error-correcting code, algorithm-based fault tolerance, and duplication with comparison hardening strategies are presented and evaluated on GPUs through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions.
| Year | Citations | |
|---|---|---|
Page 1
Page 1