Using Coding-Based Ensemble Learning to Improve Software Defect Prediction

TLDR

Software defect prediction using static code attributes has attracted attention, but class imbalance makes prediction difficult, and conventional techniques such as sampling, cost‑sensitive learning, Bagging, and Boosting can lose important information, introduce errors, and overfit by altering the data distribution. This study proposes a novel approach that transforms imbalanced binary defect data into balanced multiclass data and trains a defect predictor using a specific coding scheme. The authors evaluated this approach across 14 NASA datasets using four classification algorithms, three coding schemes, and six conventional imbalance‑handling methods in a comprehensive experiment. Results indicate that the one‑against‑one coding scheme outperforms conventional methods on average.

Abstract

Using classification methods to predict software defect proneness with static code attributes has attracted a great deal of attention. The class-imbalance characteristic of software defect data makes the prediction much difficult; thus, a number of methods have been employed to address this problem. However, these conventional methods, such as sampling, cost-sensitive learning, Bagging, and Boosting, could suffer from the loss of important information, unexpected mistakes, and overfitting because they alter the original data distribution. This paper presents a novel method that first converts the imbalanced binary-class data into balanced multiclass data and then builds a defect predictor on the multiclass data with a specific coding scheme. A thorough experiment with four different types of classification algorithms, three data coding schemes, and six conventional imbalance data-handling methods was conducted over the 14 NASA datasets. The experimental results show that the proposed method with a one-against-one coding scheme is averagely superior to the conventional methods.

References

Page 1

	Year	Citations

Page 1