Publication | Closed Access
Fault Tolerant Communication Library and Applications for High Performance Computing
13
Citations
10
References
2003
Year
Unknown Venue
With increasing numbers of processors on todays ma-chines, the probability for node or link failures is also in-creasing. Therefore, application level fault-tolerance is be-coming more of an important issue for both end-users and the institutions running the machines. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communication in scientific applications, which gives applications the pos-sibility to recover from a node or link error and continue ex-ecution in a well defined way. The architecture of FT-MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applica-tions are presented. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed. 1
| Year | Citations | |
|---|---|---|
Page 1
Page 1