Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning

TLDR

Feature‑selection algorithms are broadly divided into wrappers and filters; filters are faster and more practical for large databases, but most existing filter methods only handle discrete classification problems. The paper proposes a fast, correlation‑based filter algorithm applicable to both continuous and discrete classification problems. The algorithm uses correlation measures to rank features, enabling rapid filtering for both continuous and discrete data. When used as a preprocessing step, the algorithm outperforms ReliefF for naive Bayes, instance‑based learning, decision trees, locally weighted regression, and model trees, reduces dimensionality by about fifty percent, and yields significantly smaller decision and model trees.

Abstract

Algorithms for feature selection fall into two broad categories: wrappers that use the learning algorithm itself to evaluate the usefulness of features and filters that evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems. This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. The algorithm often outperforms the well-known ReliefF attribute estimator when used as a preprocessing step for naive Bayes, instance-based learning, decision trees, locally weighted regression, and model trees. It performs more feature selection than ReliefF does—reducing the data dimensionality by fifty percent in most cases. Also, decision and model trees built from the preprocessed data are often significantly smaller.

References

Page 1

	Year	Citations

Page 1