Path-based faliure and evolution management

TLDR

The paper proposes a novel approach to managing failures and evolution in large, complex distributed systems by leveraging runtime request paths. The method uses runtime request paths as a core abstraction, recording component interactions and performance, and applies automated statistical analysis across many paths to detect failures, diagnose issues, and assess evolution, demonstrated on three real implementations, two handling millions of requests daily. The approach delivers enhanced failure detection, diagnosis, impact analysis, and evolution understanding, supported by a maintainable, extensible architecture and statistical engines, as validated on high‑volume production services over several years.

Abstract

We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as they move through the system as our core abstraction, and our macro approach focuses on component interactions rather than the details of the components themselves. Paths record component performance and interactions, are user- and request-centric, and occur in sufficient volume to enable statistical analysis, all in a way that is easily reusable across applications. Automated statistical analysis of multiple paths allows for the detection and diagnosis of complex failures and the assessment of evolution issues. In particular, our approach enables significantly stronger capabilities in failure detection, failure diagnosis, impact analysis, and understanding system evolution. We explore these capabilities with three real implementations, two of which service millions of requests per day. Our contributions include the approach; the maintainable, extensible, and reusable architecture; the various statistical analysis engines; and the discussion of our experience with a high-volume production service over several years.

References

Page 1

	Year	Citations

Page 1