Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems

Abstract

End-users and application developers of high performance computing systems have today access to larger machines and more processors than ever. Systems such as the Earth Simulator, the ASCI-Q machines or the IBM Blue Gene consist of thousands or even tens of thousand of processors. Machines comprising 100,000 processors are expected for the next years. A critical issue of systems consisting of such large numbers of processors is the ability of the machine to deal with process failures. Concluding from the current experiences on the top-end machines, a 100,000-processor machine will experience a process failure every few minutes[1]. While on earlier massively parallel processing systems (MPPs) crashing nodes often lead to a crash of the whole system, current architectures are more robust. Typically, the applications utilizing the failed processor will have to abort, the machine, as an entity is however not affected by the failure. This robustness has been the result of improvements at the hardware as well as on the level of system software. 1.2 Current Parallel Programming Paradigms Current parallel programming paradigms for high-performance computing systems are mainly relying on message passing, especially on the Message-Passing Interface (MPI) [12][13]

References

Page 1

	Year	Citations

Page 1