Publication | Closed Access
Checkpointing Message-Passing Interface (MPI) parallel programs
16
Citations
9
References
2002
Year
Unknown Venue
Cluster ComputingEngineeringMessage-passing InterfaceComputer ArchitectureSoftware EngineeringFault ToleranceFault-tolerant MessagingSoftware AnalysisFormal VerificationConcurrency (Computer Science)Systems EngineeringCheckpointing ImplementationParallel ComputingMpi ProgramsMessage PassingConcurrent ProgrammingMany Scientific ProblemsComputer EngineeringDistributed SystemsComputer ScienceDistributed ComputingProgram AnalysisParallel ProgrammingSystem Software
Many scientific problems can be distributed on a large number of processes to take advantage of low cost workstations. In a parallel systems, a failure on any processor can halt the computation and requires restarting all applications. Checkpointing is a simple technique to recover the failed execution. Message Passing Interface (MPI) is a standard proposed for writing portable message-passing parallel programs. In this paper, we present a checkpointing implementation for MPI programs, which is transparent, and requires no changes to the application programs. Our implementation combines coordinated, uncoordinated and message logging techniques.
| Year | Citations | |
|---|---|---|
Page 1
Page 1