Titian - Concepedia

TLDR

Debugging data processing logic in DISC systems is difficult and time‑consuming because current systems provide little tooling, forcing programmers to spend many hours collecting evidence from logs and trial‑and‑error debugging. The authors built Titian, a library that tracks data provenance through transformations in Apache Spark to aid debugging. Titian is integrated into Spark, offering interactive‑speed data provenance with less than 30% overhead compared to baseline job execution. Users can quickly identify the input data causing bugs or outliers, with Titian delivering interactive‑speed lineage and minimal performance impact.

Abstract

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence ( e.g. , from log files) and performing trial and error debugging. To aid this effort, we built Titian , a library that enables data provenance ---tracking data through transformations---in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds---orders-of-magnitude faster than alternative solutions---while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

References

Page 1

	Year	Citations

Page 1