Publication | Open Access
Reliable communication in the presence of failures
991
Citations
15
References
1987
Year
Cluster ComputingAvailabilityEngineeringVerificationReliable Multicast ProtocolsSoftware EngineeringFault ToleranceCommunicationFault-tolerant MessagingFormal VerificationReliability EngineeringSystems EngineeringParallel ComputingCommunication FacilityReliabilityDistributed SystemsComputer ScienceReliable CommunicationFault-tolerant NetworkDistributed ComputingCloud ComputingIsis SystemFormal MethodsSystem Software
The paper presents the design and correctness of a fault‑tolerant communication facility for distributed systems. The facility implements a family of reliable multicast protocols that support fault‑tolerant process groups, providing high concurrency, application‑specific ordering, and consistent event ordering across failures, recoveries, migrations, and dynamic group changes. A causal‑delivery protocol is introduced as a valuable alternative to conventional asynchronous communication, and its use in the ISIS system demonstrates significant simplification of higher‑level algorithms.
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.
| Year | Citations | |
|---|---|---|
Page 1
Page 1