Publication | Closed Access
Lessons Learned Implementing User-Level Failure Mitigation in MPICH
17
Citations
3
References
2015
Year
Unknown Venue
Software MaintenanceEngineeringSoftware EngineeringFault ToleranceNew Api CallsFault-tolerant MessagingSoftware AnalysisReliability EngineeringSystems EngineeringParallel ComputingFailure DetectionReliabilityComputer EngineeringComputer ScienceRuntime SystemSoftware DesignHigh Availability SoftwareProgram AnalysisSoftware TestingUser-level Failure MitigationRuntime CostSystem Software
User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high-performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low.
| Year | Citations | |
|---|---|---|
Page 1
Page 1