Substructure similarity search in graph databases

TLDR

Advanced database systems must efficiently search massive, complex structural data—especially in bioinformatics and chem‑informatics—because exact matching is too restrictive, making similarity search a vital operation. The paper investigates substructure similarity search in graph databases, aiming to design an effective feature‑set selection strategy that improves filtering and search performance, and shows that Grafil can be extended to other complex structures. Grafil transforms the edge relaxation ratio into a maximum allowed missing feature threshold and employs a multi‑filter composition strategy that uses complementary feature subsets to filter graphs without pairwise similarity computations. The study shows that both too few and too many features degrade filtering performance, but selecting feature sets with similar size and selectivity significantly improves filtering and search performance.

Abstract

Advanced database systems face a great challenge raised by the emergence of massive, complex structural data in bioinformatics, chem-informatics, and many other applications. The most fundamental support needed in these applications is the efficient search of complex structured data. Since exact matching is often too restrictive, similarity search of complex structures becomes a vital operation that must be supported efficiently.In this paper, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed missing features, our structural filtering algorithm, called Grafil, can filter many graphs without performing pairwise similarity computations. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy for filtering. By examining the effect of different feature selection mechanisms, we develop a multi-filter composition strategy, where each filter uses a distinct and complementary subset of the features. We identify the criteria to form effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly. Moreover, the concept presented in Grafil can be applied to searching approximate non-consecutive sequences, trees, and other complicated structures as well.

References

Page 1

	Year	Citations

Page 1