Publication | Closed Access
Efficient Similarity Join and Search on Multi-Attribute Data
35
Citations
29
References
2015
Year
Unknown Venue
EngineeringSimilarity MeasurePrefix Tree IndexSemantic WebText MiningString-searching AlgorithmInformation RetrievalData ScienceData MiningString ProcessingManagementData IntegrationData ManagementEfficient Similarity JoinPrefix TreeKnowledge DiscoveryComputer ScienceBig Data SearchQuery OptimizationSimilarity JoinSimilarity SearchBig Data
In this paper we study similarity join and search on multi- attribute data. Traditional methods on single-attribute data have pruning power only on single attributes and cannot efficiently support multi-attribute data. To address this problem, we propose a prefix tree index which has holis- tic pruning ability on multiple attributes. We propose a cost model to quantify the prefix tree which can guide the prefix tree construction. Based on the prefix tree, we devise a filter-verification framework to support similarity search and join on multi-attribute data. The filter step prunes a large number of dissimilar results and identifies some candi- dates using the prefix tree and the verification step verifies the candidates to generate the final answer. For similar- ity join, we prove that constructing an optimal prefix tree is NP-complete and develop a greedy algorithm to achieve high performance. For similarity search, since one prefix tree cannot support all possible search queries, we extend the cost model to support similarity search and devise a budget-based algorithm to construct multiple high-quality prefix trees. We also devise a hybrid verification algorithm to improve the verification step. Experimental results show our method significantly outperforms baseline approaches.
| Year | Citations | |
|---|---|---|
Page 1
Page 1