Publication | Open Access
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
74
Citations
13
References
2014
Year
Unknown Venue
Software MaintenanceCluster ComputingEngineeringComputer ArchitectureSoftware EngineeringFault ToleranceSystem ReliabilityFault-tolerant MessagingLflr ModelReliability EngineeringSystems EngineeringFault RecoveryModeling And SimulationParallel ComputingReliabilityComputer EngineeringComputer ScienceCurrent System ReactionHigh Availability SoftwareReliability ModellingSoftware TestingReliability ManagementSingle Mpi ProcessHigh AvailabilityDisaster Risk ReductionSystem Software
The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that provides application developers with the ability to recover locally and continue application execution when a process is lost. We discuss the design of our software framework to enable the LFLR model using MPI-ULFM and demonstrate the resilient version of MiniFE that achieves a scalable recovery from process failures.
| Year | Citations | |
|---|---|---|
Page 1
Page 1