Publication | Closed Access
Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing
54
Citations
16
References
2005
Year
Cluster ComputingEngineeringComputer ArchitectureFault ToleranceHigh Performance ComputingFault-tolerant MessagingFormal VerificationReliability EngineeringSystems EngineeringFault-tolerant MpiParallel ComputingCurrent MachinesConcurrent ProgrammingComputer EngineeringComputer ScienceHigh Availability SoftwareFault-tolerant NetworkDistributed ComputingProgram AnalysisFormal MethodsParallel ProgrammingConcurrent Data StructureSystem SoftwareLink FailuresProcess Fault Tolerance
With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the semantics of a fault-tolerant version of the message passing interface (MPI), the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well-defined way. We present the architecture of fault-tolerant MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.
| Year | Citations | |
|---|---|---|
Page 1
Page 1