Concepedia

Publication | Closed Access

Infrastructure for supporting exploration and discovery in web archives

23

Citations

17

References

2014

Year

Abstract

Web archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. However, unlocking the potential of web archives requires tools that support exploration and discovery of captured content. These tools need to be scalable and responsive, and to this end we believe that modern "big data" infrastructure can provide a solid foundation. We present Warcbase, an open-source platform for managing web archives built on the distributed datastore HBase. Our system provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing. Relying on HBase for storage infrastructure simplifies the development of scalable and responsive applications. We describe a service that provides temporal browsing and an interactive visualization based on topic models that allows users to explore archived content.

References

YearCitations

Page 1