Robust and efficient fuzzy match for online data cleaning

TLDR

High‑quality data warehouses must validate and cleanse incoming tuples, often requiring fuzzy matching against reference tables, a task that demands efficient and accurate algorithms. This study proposes a new similarity function that overcomes limitations of existing measures and develops an efficient fuzzy‑match algorithm. The authors design the similarity function and implement a fast fuzzy‑match algorithm that leverages this function to improve matching accuracy. Experiments on real datasets demonstrate that the proposed techniques effectively enhance fuzzy matching performance.

Abstract

To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.

References

Page 1

	Year	Citations

Page 1