Iterative Automated Record Linkage Using Mixture Models

TLDR

Record linkage seeks to quickly and accurately link records belonging to the same person, and patterns of agreement and disagreement can be modeled as a mixture of matches and nonmatches, allowing mixture models to partition pairs into probable links and nonlinks. The authors propose a method that uses marginal database information to select mixture models, identify records for clerical review, incorporate reviewed data into parameter estimates, and classify pairs as links, nonlinks, or needing further review. The procedure, illustrated on five U.S. Census datasets, selects mixture models from marginal data, assigns records to clerical review sets, updates model parameters with reviewed data, and classifies pairs accordingly.

Abstract

AbstractThe goal of record linkage is to link quickly and accurately records that correspond to the same person or entity. Whereas certain patterns of agreements and disagreements on variables are more likely among records pertaining to a single person than among records for different people, the observed patterns for pairs of records can be viewed as arising from a mixture of matches and nonmatches. Mixture model estimates can be used to partition record pairs into two or more groups that can be labeled as probable matches (links) and probable nonmatches (nonlinks). A method is proposed and illustrated that uses marginal information in the database to select mixture models, identifies sets of records for clerks to review based on the models and marginal information, incorporates clerically reviewed data, as they become available, into estimates of model parameters, and classifies pairs as links, nonlinks, or in need of further clerical review. The procedure is illustrated with five datasets from the U.S. Bureau of the Census. It appears to be robust to variations in record-linkage sites. The clerical review corrects classifications of some pairs directly and leads to changes in classification of others through reestimation of mixture models.KEY WORDS: Administrative recordsCensusExpectation–maximizationExpectation-conditional maximizationFile matchingLatent-class modelsPost-enumeration survey

References

Page 1

	Year	Citations

Page 1