Publication | Closed Access
Infrastructure for supporting exploration and discovery in web archives
23
Citations
17
References
2014
Year
Unknown Venue
Personal Digital ArchivingEngineeringInformation RetrievalData ScienceDatabase SupportArchivingWeb Information SystemWeb ArchivesData IntegrationPresent WarcbaseWeb ScienceSemantic WebDistributed Data StoreData ManagementDigital ArchiveBig Data
Web archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. However, unlocking the potential of web archives requires tools that support exploration and discovery of captured content. These tools need to be scalable and responsive, and to this end we believe that modern "big data" infrastructure can provide a solid foundation. We present Warcbase, an open-source platform for managing web archives built on the distributed datastore HBase. Our system provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing. Relying on HBase for storage infrastructure simplifies the development of scalable and responsive applications. We describe a service that provides temporal browsing and an interactive visualization based on topic models that allows users to explore archived content.
| Year | Citations | |
|---|---|---|
Page 1
Page 1