A survey of data provenance in e-science

TLDR

Data management is increasingly complex due to large‑scale, loosely coupled grid applications and abundant storage, making metadata essential for disambiguating and reusing data products. This paper develops a taxonomy of data provenance characteristics and applies it to e‑science research, especially scientific workflow approaches. The taxonomy classifies provenance systems by their motivation, content, representation, storage, and dissemination methods. The survey identifies several open research problems in data provenance for e‑science.

Abstract

Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.

References

Page 1

	Year	Citations

Page 1