Publication | Closed Access
Understanding the propagation of transient errors in HPC applications
83
Citations
38
References
2015
Year
Unknown Venue
EngineeringComputer ArchitectureFault ToleranceSoftware AnalysisHardware SecurityReliability EngineeringApplication-driven Error DetectionFault AnalysisSystems EngineeringFault RecoveryModeling And SimulationExascale SystemsFailure DetectionHardware ReliabilityComputer EngineeringComputer ScienceHpc ApplicationsProgram AnalysisSoftware TestingCircuit ReliabilityTransient ErrorsFault InjectionSystem Software
Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery.
| Year | Citations | |
|---|---|---|
Page 1
Page 1