Checkpointing and Rollback-Recovery for Distributed Systems

TLDR

The paper addresses how to restore a distributed system to consistency after transient failures. The study proposes distributed algorithms for creating consistent checkpoints and for rollback‑recovery to restore consistency. Each process keeps at most two checkpoints in stable storage, enabling coordinated checkpointing and rollback. The algorithms tolerate failures during execution, minimize the number of processes forced to checkpoint or rollback, and achieve a minimal two‑checkpoint storage requirement.

Abstract

We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions.

References

Page 1

	Year	Citations

Page 1