New boosting approaches for improving cluster-based undersampling in problems with imbalanced data

Abstract

Class unbalanced datasets are frequently encountered in a variety of areas including health, security, and finance. Often these datasets create bias in the supervised learning models trained for the prediction task. One of the most successful techniques to handle imbalanced data is undersampling. Experiments demonstrate that cluster-based undersampling improves over random undersampling in many cases. In this paper, we propose three new boosting approaches to improve the performance of cluster-based undersampling technique: (i) inject unlabelled data into training data for improved clustering; (ii) keep the instances close to cluster boundary and centroid while undersampling and (iii) remove the majority samples in the neighborhood of minority data in each cluster. We experimented with our boosting methods over 49 standard benchmark datasets and analyzed the performances in terms of standard evaluation metrics. Experimental results suggest these boosting techniques are promising and significantly improve over cluster-based undersampling strategies.

References

Page 1

	Year	Citations

Page 1