Concepedia

Abstract

Many scientific problems can be distributed on a large number of processes to take advantage of low cost workstations. In a parallel systems, a failure on any processor can halt the computation and requires restarting all applications. Checkpointing is a simple technique to recover the failed execution. Message Passing Interface (MPI) is a standard proposed for writing portable message-passing parallel programs. In this paper, we present a checkpointing implementation for MPI programs, which is transparent, and requires no changes to the application programs. Our implementation combines coordinated, uncoordinated and message logging techniques.

References

YearCitations

Page 1