Publication | Closed Access
Efficient set joins on similarity predicates
326
Citations
23
References
2004
Year
Unknown Venue
EngineeringIntersect SizeCorpus LinguisticsText MiningNatural Language ProcessingString-searching AlgorithmInformation RetrievalData ScienceData MiningGeneral AlgorithmString ProcessingComputational LinguisticsKnowledge DiscoveryComputer ScienceDatabase TheorySet JoinsQuery OptimizationAutomated ReasoningCombinatorial Pattern MatchingSimilarity SearchSemantic SimilarityEfficient Set Joins
In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.
| Year | Citations | |
|---|---|---|
Page 1
Page 1