Fault-tolerant broadcasts and multicasts: the problem of inconsistency and contamination

Abstract

An increasingly important paradigm for designing fault-tolerant applications for distributed systems is based on processes that communicate exclusively via fault-tolerant broadcasts and multicasts. Most broadcasts that are described in the literature, such as reliable broadcast, causal broadcast, atomic broadcast and the corresponding multicasts, specify the behavior of correct processes, but do not impose requirements on the behavior of faulty processes. Such specifications allow a process that fails during a broadcast to reach an "inconsistent" state (e.g., by omitting the delivery of a message), and to continue execution from that state. This faulty process may later broadcast messages that "contaminate" the correct processes. In this thesis, we argue that such inconsistency and contamination can complicate the design of applications, and we present fault-tolerant broadcast and multicast protocols that prevent inconsistency and contamination. We begin by formally defining a hierarchy of different types of process inconsistency; these definitions are general, and hence are valid for any broadcast specification. Intuitively, contamination is the "spread" of inconsistency from faulty processes to correct processes. We formalize this concept and show that only two forms of contamination arise from our hierarchy of types of inconsistency. Atomic broadcast and atomic multicast are powerful communication abstractions that are central to many systems (e.g., Isis, and IBM's HAS), and to Lamport's state machine approach to fault-tolerance. Using our general definitions of inconsistency and contamination, we derive necessary and sufficient conditions to prevent inconsistency and/or contamination when processes communicate using atomic broadcast. We also derive similar conditions for atomic multicast. Based on these conditions, we develop atomic broadcast protocols that prevent inconsistency and/or contamination. In general, the prevention of inconsistency is a stronger requirement (and more difficult and more expensive to enforce) than the prevention of contamination. We characterize a class of problems for which the prevention of contamination is as good as the prevention of inconsistency. We show that an application that solves a problem in this class under the simplifying assumption that both inconsistency and contamination are prevented remains correct even if it uses a (less expensive) broadcast protocol that only prevents contamination.