Feature Selection and Prediction Model for Type 2 Diabetes in the Chinese Population with Machine Learning

Abstract

Diabetes is a chronic disease characterized by hyperglycemia. Based on the rising incidence of the disease in recent years, diabetes is affecting more and more families. In 2017 alone, it caused 5 million deaths and cost $850 billion in global healthcare. In this paper, we proposed a method to predict the prevalence of diabetes based on a selected set of features from physical examination data. We used Fisher's score, RFE and decision tree to select features. Random forest, logistic regression, SVM and MLP were used to predict the prevalence of diabetes. EA and Fisher' s score helped us to reduce dimensions. We used random forest to classify diabetes accurately. Our results show that the highest accuracy (0.987) can be achieved by using random forest with 85 features. The prediction accuracy using Fisher's Score with 19 features also reached 0.986. We finally selected 5 features based on our method to form a new dataset for diabetes prediction. The 5 features are fasting plasma glucose, HbA1c, HDL, total cholesterol level and hypertension. The values of accuracy, precision, sensitivity, F1 score, MCC and AUC were 0.977, 0.968, 0.812, 0.883, 0.875, and 0.905, respectively. Results show that our method can be successfully used to select features for diabetes classifier and improve its performance, which will provide support for clinicians to quickly identify diabetes.

References

Page 1

	Year	Citations

Page 1