Publication | Closed Access
Efficient document analytics on compressed data
45
Citations
29
References
2018
Year
Cluster ComputingEngineeringMap-reduceDocument VolumesText MiningInformation RetrievalData ScienceData MiningData IntegrationParallel ComputingData ManagementLossless CompressionEfficient Document AnalyticsHigh-performance Data AnalyticsKnowledge DiscoveryCompression AlgorithmComputer ScienceData-intensive ComputingData IndexingDocument AnalyticsParallel ProgrammingSearch Engine IndexingMassive Data ProcessingBig Data
Today's rapidly growing document volumes pose pressing challenges to modern document analytics, in both space usage and processing time. In this work, we propose the concept of compression-based direct processing to alleviate issues in both dimensions. The main idea is to enable direct document analytics on compressed data. We present how the concept can be materialized on Sequitur, a compression algorithm that produces hierarchical grammar-like representations. We discuss the major challenges in applying the idea to various document analytics tasks, and reveal a set of guidelines and also assistant software modules for developers to effectively apply compression-based direct processing . Experiments show that our proposed techniques save 90.8% storage space and 77.5% memory usage, while speeding up data processing significantly, i.e., by 1.6X on sequential systems, and 2.2X on distributed clusters, on average.
| Year | Citations | |
|---|---|---|
Page 1
Page 1