Publication | Closed Access
Simplifying the Recovery Model of User-Level Failure Mitigation
21
Citations
17
References
2014
Year
Unknown Venue
Software MaintenanceEngineeringRecovery ModelComputer ArchitectureSoftware EngineeringFault ToleranceFault-tolerant MessagingSoftware AnalysisUlfm UsageReliability EngineeringFailure AnalysisSystems EngineeringFault RecoveryParallel ComputingReliabilityComputer EngineeringBatch QueueComputer ScienceHigh Availability SoftwareProgram AnalysisSoftware TestingReliability ManagementResilience ResearchBusinessParallel ProgrammingHigh AvailabilityCrisis ManagementFault Injection
As resilience research in high-performance computing has matured, so too have the tools, libraries, and languages that result from it. The Message Passing Interface (MPI) Forum is considering the addition of fault tolerance to a future version of the MPI standard, and a new chapter called User-Level Failure Mitigation (ULFM) has been proposed to fill this need. However, as ULFM usage has become more widespread, many potential users are concerned about its complexity and the need to rewrite existing codes. In this paper, we present a usage model that is similar to the usage already common in existing codes but that does not require the user to restart the application (thereby incurring the costs of re-entering the batch queue, startup costs, etc.). We use a new implementation of ULFM in MPICH, a popular open source MPI implementation, and demonstrate the ULFM usage using the Monte Carlo Communication Kernel, a proxy-app developed by the Center for Exascale Simulation of Advanced Reactors. Results show that the approach used incurs a level of intrusiveness into the code similar to that of existing checkpoint/restart models, but with less overhead.
| Year | Citations | |
|---|---|---|
Page 1
Page 1