Learning from imbalanced data sets with boosting and data generation

TLDR

Learning from imbalanced data sets, where one class dominates, is a major challenge for machine learning. This paper proposes a new approach that combines boosting with data generation to improve classifier performance on two‑class imbalanced data sets. The DataBoost‑IM method identifies hard examples during boosting, generates synthetic examples separately for majority and minority classes, adds them to the training set, and rebalances class weights before training decision trees. Experiments on seventeen highly and moderately imbalanced data sets show that DataBoost‑IM achieves competitive F‑measures, G‑mean, and overall accuracy, outperforming a base classifier, standard boosting, and three advanced boosting algorithms without sacrificing either class.

Abstract

Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. Traditional machine learning algorithms may be biased towards the majority class, thus producing poor predictive accuracy over the minority class. In this paper, we describe a new approach that combines boosting, an ensemble-based learning algorithm, with data generation to improve the predictive power of classifiers against imbalanced data sets consisting of two classes. In the DataBoost-IM method, hard examples from both the majority and minority classes are identified during execution of the boosting algorithm. Subsequently, the hard examples are used to separately generate synthetic examples for the majority and minority classes. The synthetic data are then added to the original training set, and the class distribution and the total weights of the different classes in the new training set are rebalanced. The DataBoost-IM method was evaluated, in terms of the F-measures, G-mean and overall accuracy , against seventeen highly and moderately imbalanced data sets using decision trees as base classifiers. Our results are promising and show that the DataBoost-IM method compares well in comparison with a base classifier, a standard benchmarking boosting algorithm and three advanced boosting-based algorithms for imbalanced data set. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions against both minority and majority classes.

References

Page 1

	Year	Citations

Page 1