Missing value imputation methods for TCM medical data and its effect in the classifier accuracy

Abstract

Objective: Medical data mining is a research hotspot. But medical data often contains missing values, which brings difficulties to the medical data analysis. This work evaluates the performance of several imputation methods. Methods: In this paper, we first simulate the missing data set by completely deleting some data from the complete data set, and use the Euclidean distance KNN, the correlation coefficient KNN and the mean to fill several algorithms to estimate the exact data and compare the accuracy of different algorithm estimation. Then we use these filling algorithms to fill clinical data which has missing values and get complete data. Then we construct a predict model of patient disease by random forest algorithm and classification and regression trees algorithm. By comparing the observed values with the predicted values, we examined the effect of different filling algorithms on the prediction accuracy. Results: The accuracy of the three algorithms is compared under different missing rates. In the filling experiment, the performance of KNN based Pearson correlation coefficient is obviously better than KNN based Euclidean metric and mean imputation. And in the predict model, the performance of these three filling algorithms is the same as in the filling experiment. But the gap is not very significant.

References

Page 1

	Year	Citations

Page 1