Concepedia

Abstract

Similarity join finds all pairs of objects (i, j) with similarity score s(i, j) greater than some specified threshold θ. This is a fundamental query problem in the database research community, and is used in many practical applications, such as duplicate detection, merge/purge, record linkage, object matching, and reference conciliation. In this paper, we propose a scalable approximation algorithm with an arbitrary accuracy for the similarity join problem with the SimRank similarity measure. The algorithm consists of two phases: filter and verification. The filter phase enumerates similar pair candidates, and the similarity of each candidate is then assessed in the verification phase. The scalability of the proposed algorithm is experimentally verified for large real networks. The complexity depends only on the number of similar pairs, but does not depend on all pairs O(n <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ). The proposed algorithm scales up to the network of 5M vertices and 70M edges. By comparing the state-of-the-art algorithms, it is about 10 times faster and it requires about 10 times smaller memory.

References

YearCitations

2003

6.4K

1973

5K

1969

2.4K

2002

1.9K

2002

1.7K

2003

1.2K

1995

801

2002

639

2005

522

2003

431

Page 1