Publication | Closed Access
Soft fault detection and correction for multigrid
13
Citations
22
References
2017
Year
Hardware SecuritySilent Data CorruptionFull Approximation SchemeReliability EngineeringSoft Fault DetectionEngineeringSmart GridVerificationFault AnalysisComputer EngineeringFormal MethodsSystems EngineeringFault RecoveryFault-tolerant ControlComputer ScienceFault DetectionFormal VerificationFault Injection
We introduce a novel algorithm-based fault-tolerance scheme to detect and repair soft transient faults (silent data corruption, bitflips) in multigrid solvers: by applying the full approximation scheme (FAS) variant of multigrid to linear systems, we prove invariants that enable fault detection and correction, and ultimately lead to a black-box protection of the smoothing stage. A statistical analysis for a wide range of prototypical problems demonstrates the efficiency of our approach, especially compared with full checksum protection. In particular, the overhead of our new method is negligible in the fault-free case, since we only employ readily available quantities.
| Year | Citations | |
|---|---|---|
Page 1
Page 1