Publication | Closed Access
A Primitive Operator for Similarity Joins in Data Cleaning
552
Citations
13
References
2006
Year
Unknown Venue
EngineeringSimilarity MeasurePrimitive OperatorSemantic WebData CleaningText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningString ProcessingData IntegrationData ManagementTextual SimilarityKnowledge DiscoveryComputer ScienceData CleansingDatabase TheoryNew Primitive OperatorSimilarity SearchSemantic Similarity
Data cleaning via similarity joins identifies close tuples using diverse similarity functions, but existing efficient implementations are tightly coupled to each specific function. The paper proposes a new primitive operator that serves as a foundation for implementing similarity joins across a wide range of string similarity functions and beyond textual similarity. Efficient implementations of this operator are then developed to support these diverse similarity joins. Experiments on real datasets show that similarity joins built with the operator are comparable to, and often substantially better than, prior customized implementations for specific similarity functions.
Data cleaning based on similarities involves identification of "close" tuples, where closeness is evaluated using a variety of similarity functions chosen to suit the domain and application. Current approaches for efficiently implementing such similarity joins are tightly tied to the chosen similarity function. In this paper, we propose a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity. We then propose efficient implementations for this operator. In an experimental evaluation using real datasets, we show that the implementation of similarity joins using our operator is comparable to, and often substantially better than, previous customized implementations for particular similarity functions.
| Year | Citations | |
|---|---|---|
Page 1
Page 1