A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

TLDR

Similarity search in high‑dimensional vector spaces relies mainly on data‑space partitioning, yet its performance degrades with dimensionality, and quantitative analysis of this dimensional curse is scarce. The study analyzes partitioning and clustering techniques for HDVS similarity search and proposes an approximation‑based organization to accelerate the inevitable sequential scan. We introduce the VA‑file vector approximation scheme, evaluate it against R*-tree and X-tree indices, and describe the experimental setup for these comparisons. Our results show that partitioning methods exhibit linear complexity at high dimensionality and are outperformed by a simple scan beyond about ten dimensions, while the VA‑file scheme outperforms the tree‑based indices in practice.

Abstract

For similarity search in high-dimensional vector spaces (or ‘HDVSs’), researchers have proposed a number of new methods (or adaptations of existing methods) based, in the main, on data-space partitioning. However, the performance of these methods generally degrades as dimensionality increases. Although this phenomenon-known as the ‘dimensional curse’-is well known, little or no quantitative a.nalysis of the phenomenon is available. In this paper, we provide a detailed analysis of partitioning and clustering techniques for similarity search in HDVSs. We show formally that these methods exhibit linear complexity at high dimensionality, and that existing methods are outperformed on average by a simple sequential scan if the number of dimensions exceeds around 10. Consequently, we come up with an alternative organization based on approximations to make the unavoidable sequential scan as fast as possible. We describe a simple vector approximation scheme, called VA-file, and report on an experimental evaluation of this and of two tree-based index methods (an R*-tree and an X-tree).

References

Page 1

	Year	Citations

Page 1