An Approximation- Based Data Structure for Similarity Search

Abstract

Many similarity measures for multimedia retrieval, decision support, and data mining are based on underlying vector spaces of high dimensionality. Data-partitioning index methods for such spaces (e.g. grid-files, quad-trees, R-trees, X-trees, etc.) generally work well for low-dimensional spaces, but perform poorly as dimensionality increases---a phenomenon which has become known as the `dimensional curse&apos;. In this paper, we first provide an analysis of the nearest-neighbor search problem in highdimensional vector spaces. Under the assumptions of uniformity and independence, we establish bounds on the average performance of three important classes of data-partitioning techniques. We then introduce the vector-approximation file (VA-File), a method which overcomes the difficulties of high dimensionality by following not the data-partitioning approach of conventional index methods, but rather a filter-based approach. A VA-File contains a compact, geometric approximation for each vector. By...