Publication | Closed Access
Error recovery in shared memory multiprocessors using private caches
81
Citations
27
References
1990
Year
EngineeringError RecoveryComputer ArchitectureMemory Model (Programming)Hardware SystemsShared MemoryConcurrency (Computer Science)Systems EngineeringFault RecoveryParallel ComputingMemory ManagementPrivate CachesComputer EngineeringDistributed SystemsComputer ScienceOperating SystemsRecovery SchemeProcessor Transient FaultsParallel ProgrammingReal-time SystemsAsynchronous Systems
The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>
| Year | Citations | |
|---|---|---|
1974 | 1.5K | |
1987 | 826 | |
1985 | 722 | |
1978 | 620 | |
1986 | 577 | |
The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. Gregory F. Pfister, William C. Brantley, David George, Proceedings of the International Conference on Parallel Processing EngineeringParallel ProcessingParallel Performance EvaluationComputer EngineeringComputer Architecture | 1985 | 520 |
1998 | 405 | |
1988 | 267 | |
1987 | 236 | |
1984 | 223 |
Page 1
Page 1