Concepedia

TLDR

Distributed systems composed of many communicating components are notoriously difficult to debug, especially for performance issues, and this difficulty is amplified when the components are black-box vendors’ software lacking source code, a situation that often exceeds the expertise of typical solution-provider staff. The study aims to create tools that enable programmers of varying skill levels to isolate performance bottlenecks in black-box distributed systems by passively collecting message‑level traces without requiring internal knowledge. The approach uses two algorithms: one infers inter‑call causality from RPC timing data, and the other applies signal‑processing techniques to trace message activity. The algorithms successfully identify dominant causal paths and attribute delays to specific nodes, achieving this without requiring any modifications to applications, middleware, or messages.

Abstract

Many interesting large-scale systems are distributed systems of multiple communicating components. Such systems can be very hard to debug, especially when they exhibit poor performance. The problem becomes much harder when systems are composed of "black-box" components: software from many different (perhaps competing) vendors, usually without source code available. Typical solutions-provider employees are not always skilled or experienced enough to debug these systems efficiently. Our goal is to design tools that enable modestly-skilled programmers (and experts, too) to isolate performance bottlenecks in distributed systems composed of black-box nodes.We approach this problem by obtaining message-level traces of system activity, as passively as possible and without any knowledge of node internals or message semantics. We have developed two very different algorithms for inferring the dominant causal paths through a distributed system from these traces. One uses timing information from RPC messages to infer inter-call causality; the other uses signal-processing techniques. Our algorithms can ascribe delay to specific nodes on specific causal paths. Unlike previous approaches to similar problems, our approach requires no modifications to applications, middleware, or messages.

References

YearCitations

Page 1