Publication | Closed Access
C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling
833
Citations
11
References
2003
Year
Unknown Venue
C4.5, combined with sampling schemes, is becoming the community standard for evaluating cost‑sensitive learning algorithms. The study reexamines over‑sampling and under‑sampling as strategies for handling class imbalance and misclassification costs in machine learning. The authors employ cost‑curve analysis to assess how over‑sampling and under‑sampling affect C4.5’s performance. They find that C4.5 with under‑sampling provides a solid baseline, though a cheapest‑class classifier may outperform it at modest costs, while over‑sampling shows minimal cost sensitivity.
This paper takes a new look at two sampling schemes commonly used to adapt machine learning algorithms to imbalanced classes and misclassification costs. It uses a performance analysis technique called cost curves to explore the interaction of over and undersampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becoming the community standard when evaluating new cost sensitive learning algorithms. This paper shows that using C4.5 with undersampling establishes a reasonable standard for algorithmic comparison. But it is recommended that the cheapest class classifier be part of that standard as it can be better than under-sampling for relatively modest costs. Over-sampling, however, shows little sensitivity, there is often little dierence in performance when misclassification costs are changed.
| Year | Citations | |
|---|---|---|
Page 1
Page 1