Scalable SimRank join algorithm

Abstract

Similarity join finds all pairs of objects (i, j) with similarity score s(i, j) greater than some specified threshold θ. This is a fundamental query problem in the database research community, and is used in many practical applications, such as duplicate detection, merge/purge, record linkage, object matching, and reference conciliation. In this paper, we propose a scalable approximation algorithm with an arbitrary accuracy for the similarity join problem with the SimRank similarity measure. The algorithm consists of two phases: filter and verification. The filter phase enumerates similar pair candidates, and the similarity of each candidate is then assessed in the verification phase. The scalability of the proposed algorithm is experimentally verified for large real networks. The complexity depends only on the number of similar pairs, but does not depend on all pairs O(n <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ). The proposed algorithm scales up to the network of 5M vertices and 70M edges. By comparing the state-of-the-art algorithms, it is about 10 times faster and it requires about 10 times smaller memory.

References

Page 1

	Year	Citations
Video Google: a text retrieval approach to object matching in videos Sivic, Zisserman Vector QuantizationEngineeringMachine LearningImage RetrievalBiometrics	2003	6.4K
Co‐citation in the scientific literature: A new measure of the relationship between two documents Henry Small Journal of the American Society for Information Science EngineeringNew MeasureBibliometricsImpact FactorJournalism	1973	5K
A Theory for Record Linkage Ivan P. Fellegi, A. B. Sunter Journal of the American Statistical Association Optimal Linkage RuleEngineeringVerificationLink PredictionInformation Retrieval	1969	2.4K
SimRank Glen Jeh, Jennifer Widom Ranking AlgorithmEngineeringSimilarity MeasureLearning To RankSemantic Web	2002	1.9K
On the resemblance and containment of documents Arndt Bröder EngineeringSimilarity MeasureSemantic WebSemanticsCorpus Linguistics	2002	1.7K
Scaling personalized web search Glen Jeh, Jennifer Widom Ranking AlgorithmPagerank AlgorithmMachine LearningEngineeringLearning To Rank	2003	1.2K
The merge/purge problem for large databases Mauricio A. Hernández, Salvatore J. Stolfo Cluster ComputingEngineeringData AggregationBusiness IntelligenceOptimization-based Data Mining	1995	801
Similarity indexing with the SS-tree David A. White, Ramesh Jain EngineeringInformation RetrievalData ScienceData MiningPattern Recognition	2002	639
Reference reconciliation in complex information spaces Xin Luna Dong, Alon Halevy, Jayant Madhavan EngineeringReference ModelSemanticsSemantic WebText Mining	2005	522
The origin of power laws in Internet topologies revisited Qian Chen, Hyunseok Chang, Ramesh Govindan, EngineeringInternet ScienceInternet TopologyNetwork AnalysisEducation	2003	431

Page 1