On the "dimensionality curse" and the "self-similarity blessing"

TLDR

Spatial queries in high‑dimensional spaces, especially nearest‑neighbor searches, have long been considered infeasible due to the curse of dimensionality, yet this view assumes uniform, independent data, whereas real datasets are skewed and exhibit much lower intrinsic dimensionality. We demonstrate that for R‑tree‑like structures, search performance depends on the data’s intrinsic dimensionality, and that Hausdorff and correlation fractal dimensions provide accurate I/O predictions within one standard deviation across real and synthetic datasets.

Abstract

Spatial queries in high-dimensional spaces have been studied extensively. Among them, nearest neighbor queries are important in many settings, including spatial databases (Find the k closest cities) and multimedia databases (Find the k most similar images). Previous analyses have concluded that nearest-neighbor search is hopeless in high dimensions due to the notorious "curse of dimensionality". We show that this may be overpessimistic. We show that what determines the search performance (at least for R-tree-like structures) is the intrinsic dimensionality of the data set and not the dimensionality of the address space (referred to as the embedding dimensionality). The typical (and often implicit) assumption in many previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic ("fractal") dimensionalities that are much lower than their embedding dimension, e.g. due to subtle dependencies between attributes. We show how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets.

References

Page 1

	Year	Citations

Page 1