Publication | Closed Access
A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform
31
Citations
9
References
2004
Year
Unknown Venue
Cluster ComputingEngineeringScientific AlgorithmsComputer ArchitectureFault ToleranceSupercomputer ArchitectureFast Fourier TransformHardware SecurityClock RecoveryHigh-performance ArchitectureSystems EngineeringParallel ComputingMassively-parallel ComputingComputer EngineeringDistributed SystemsComputer ScienceSignal ProcessingScalable ComputingSuper-scale Fault-toleranceCloud ComputingParallel ProgrammingSuper-scale Architectures
This paper discusses the issue of fault-tolerance in distributedcomputer systems with tens or hundreds of thousandsof diskless processor units. Such systems, like theIBM BlueGene/L, are predicted to be deployed in the nextfive to ten years. Since a 100,000-processor system is goingto be less reliable, scientific applications need to beable to recover from occurring failures more efficiently.In this paper, we adapt the present technique of disklesscheckpointing to such huge distributed systems in orderto equip existing scientific algorithms with super-scalablefault-tolerance. First, we discuss the method of disklesscheckpointing, then we adapt this technique to super-scalearchitectures and finally we present results from an implementationof the Fast Fourier Transform that uses theadapted technique to achieve super-scale fault-tolerance.
| Year | Citations | |
|---|---|---|
Page 1
Page 1