Understanding query performance in Accumulo

Abstract

Open-source, BigTable-like distributed databases provide a scalable storage solution for data-intensive applications. The simple key-value storage schema provides fast record ingest and retrieval, nearly independent of the quantity of data stored. However, real applications must support non-trivial queries that require careful key design and value indexing. We study an Apache Accumulo-based big data system designed for a network situational awareness application. The application's storage schema and data retrieval requirements are analyzed. We then characterize the corresponding Accumulo performance bottlenecks. Queries are shown to be communication-bound and server-bound in different situations. Inefficiencies in the open-source communication stack and filesystem limit network and I/O performance, respectively. Additionally, in some situations, parallel clients can contend for server-side resources. Maximizing data retrieval rates for practical queries requires effective key design, indexing, and client parallelization.

References

Page 1

	Year	Citations
Bigtable Fay W. Chang, Sanjay Ghemawat, Wilson C. Hsieh, ACM Transactions on Computer Systems Cluster ComputingEngineeringData ScienceDatabase SupportGoogle Store Data	2008	3.4K
A comparison of approaches to large-scale data analysis Andrew Pavlo, Erik K. Paulson, Alexander Rasin, Cluster ComputingEngineeringParallel DbmssComputer ArchitectureMap-reduce	2009	1.1K
YCSB++ Swapnil Patil, Milo Polte, Kai Ren, Cluster ComputingEngineeringComputer ArchitectureSoftware AnalysisData Science	2011	176
Dynamic distributed dimensional data model (D4M) database and computation system Jeremy Kepner, William Arcand, William Bergeron,	2012	105
Efficiency matters! Eric Anderson, Joseph Tucek ACM SIGOPS Operating Systems Review Cluster ComputingEngineeringComputer ArchitectureParallel StorageData-intensive Platform	2010	50
Driving big data with big compute Chansup Byun, William Arcand, David Bestor, Cluster ComputingEngineeringComputer ArchitectureMap-reduceBig Data Infrastructure	2012	42

Page 1