Concepedia

Publication | Closed Access

Fault Tolerant Communication Library and Applications for High Performance Computing

13

Citations

10

References

2003

Year

Abstract

With increasing numbers of processors on todays ma-chines, the probability for node or link failures is also in-creasing. Therefore, application level fault-tolerance is be-coming more of an important issue for both end-users and the institutions running the machines. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communication in scientific applications, which gives applications the pos-sibility to recover from a node or link error and continue ex-ecution in a well defined way. The architecture of FT-MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applica-tions are presented. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed. 1

References

YearCitations

Page 1