The merge/purge problem for large databases

TLDR

Commercial organizations routinely collect large databases for marketing and analysis, and the challenge of correlating individuals across these inconsistent sources defines the merge/purge problem. This study seeks to merge data from multiple sources efficiently while maximizing accuracy, and introduces the sorted neighborhood method as a solution. We describe the sorted neighborhood method, an alternative clustering approach, and a multi‑pass transitive‑closure technique to enhance accuracy. Experiments show the sorted neighborhood method performs well but incurs high cost, while clustering offers a comparable alternative and the multi‑pass approach further improves accuracy.

Abstract

Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.

References

Page 1

	Year	Citations

Page 1