Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

TLDR

Multivariate imputation by chained equations (MICE) is widely used in epidemiology, but its default parametric models may miss nonlinearities that random forest imputation can capture without requiring a specified regression form. The study compared parametric MICE to a random forest–based MICE algorithm across two simulation studies. In the first simulation, 1,000 random samples of 2,000 CALIBER patients were generated with variables made missing at random to compare bias and efficiency of the two imputation methods; in the second, a nonlinear relationship between partially observed and fully observed variables was simulated to assess bias and confidence‑interval coverage. Both methods produced unbiased hazard‑ratio estimates, but random forest MICE was more efficient, yielding narrower confidence intervals, lower bias, and better coverage in the nonlinear simulation, indicating its usefulness for complex epidemiologic datasets.

Abstract

Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

References

Page 1

	Year	Citations

Page 1