Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce

Abstract

Record Linkage is the task of identifying which records in a database refer to the same entity. A standard machine learn-ing approach to this problem is to train a model that assigns scores to pairs of records where pairs scoring above a thresh-old are said to represent the same entity. However, it is too expensive to make pairwise comparisons among all records in large databases. \\Blocking &quot; is the process of grouping similar-seeming records into blocks that a machine learning component then explores exhaustively. In many blocking ap-proaches, records are grouped together into blocks by shared properties that are indicators of duplication. However, when dealing with very large data sources, it is nearly impossible to determine any xed set of properties at training time that will be optimal for the Zipan distribution of values for these properties that we will encounter at run time. In this paper, we propose a novel Dynamic Blocking algorithm which au-tomatically chooses the blocking properties in a data-driven way at execution time to eciently determine which pairs of records in a data set should be examined as potential du-plicates without creating the same pair across blocks. We demonstrate the viability of this algorithm for large data sets. We have scaled this system up to work on billions of records on an 80-node Hadoop cluster. 1.

References

Page 1

	Year	Citations

Page 1