Publication | Closed Access
Low-Latency Analytics on Colossal Data Streams with SummaryStore
34
Citations
44
References
2017
Year
Unknown Venue
Cluster ComputingEngineeringMachine LearningData ScienceStreaming EngineColossal Data StreamsManagementApproximate Time-series StoreData IntegrationStreaming AlgorithmComputer ScienceData Stream ManagementData Streaming ArchitectureParallel ComputingTime-decayed SummariesStreaming DataData ManagementBig Data
SummaryStore is an approximate time-series store, designed for analytics, capable of storing large volumes of time-series data (~1 petabyte) on a single node; it preserves high degrees of query accuracy and enables near real-time querying at unprecedented cost savings. SummaryStore contributes time-decayed summaries, a novel abstraction for summarizing data streams, along with an ingest algorithm to continually merge the summaries for efficient range queries; in conjunction, it returns reliable error estimates alongside the approximate answers, supporting a range of machine learning and analytical workloads. We successfully evaluated SummaryStore using real-world applications for forecasting, outlier detection, and Internet traffic monitoring; it can summarize aggressively with low median errors, 0.1 to 10%, for different workloads. Under range-query microbenchmarks, it stored 1PB synthetic stream data (10241TB streams), on a single node, using roughly 10 TB (100x compaction) with 95%-ile error below 5% and median cold-cache query latency of 1.3s (worst case latency under 70s).
| Year | Citations | |
|---|---|---|
Page 1
Page 1