Interactive deduplication using active learning

TLDR

Deduplication, essential for integrating multi‑source data, faces the challenge of designing functions that correctly identify duplicate records amid data inconsistencies, yet most current systems rely on hand‑coded rules. The authors aim to design a learning‑based deduplication system that interactively discovers challenging training pairs via active learning to overcome manual search of data inconsistencies. The system trains a classifier to distinguish duplicates from non‑duplicates, relying on an active‑learning strategy that iteratively selects challenging training pairs to cover subtle inconsistencies, while addressing design issues for interactive response, rapid convergence, and interpretability. Experiments on real‑life datasets demonstrate that active learning markedly reduces the number of training instances required to reach high accuracy.

Abstract

Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.