Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

TLDR

Archival systems are expanding beyond cheap tertiary storage to accommodate growing digital scientific, medical, and personal history data, yet designers often rely on outdated or unrelated workload models. The study aims to fill the knowledge gap by analyzing workload behavior across diverse archival systems to inform long‑term storage design. The authors examined a mix of scientific and historical archives with varying purposes, media types, and public or private access models. Private scientific archives show larger files but unchanged update rates, whereas public content archives exhibit frequent, often unnecessary modifications and heavy access by indexing services, highlighting opportunities to improve archival storage efficiency and performance.

Abstract

The scope of archival systems is expanding beyond cheap tertiary storage: scientific and medical data is increasingly digital, and the public has a growing desire to digitally record their personal histories. Driven by the increase in cost efficiency of hard drives, and the rise of the Internet, content archives have become a means of providing the public with fast, cheap access to long-term data. Unfortunately, designers of purpose-built archival systems are either forced to rely on workload behavior obtained from a narrow, anachronistic view of archives as simply cheap tertiary storage, or extrapolate from marginally related enterprise workload data and traditional library access patterns. To close this knowledge gap and provide relevant input for the design of effective long-term data storage systems, we studied the workload behavior of several systems within this expanded archival storage space. Our study examined several scientific and historical archives, covering a mixture of purposes, media types, and access models---that is, public versus private. Our findings show that, for more traditional private scientific archival storage, files have become larger, but update rates have remained largely unchanged. However, in the public content archives we observed, we saw behavior that diverges from the traditional “write-once, read-maybe” behavior of tertiary storage. Our study shows that the majority of such data is modified---sometimes unnecessarily---relatively frequently, and that indexing services such as Google and internal data management processes may routinely access large portions of an archive, accounting for most of the accesses. Based on these observations, we identify areas for improving the efficiency and performance of archival storage systems.

References

Page 1

	Year	Citations

Page 1