Correcting Noisy Data

Abstract

Inductive learning aims at constructing a generalized description of a given set of data, so that future similar instances can be clas-sified correctly. The performance on this task depends crucially on the quality of the data. We investigate here an approach to handling noise in the training data by iden-tifying possible noisy attributes and/or class in each instance, and replacing such values with more appropriate ones. The resulting data set would preserve much of the original information, but conform more to the ideal noise-free case. A classifier built from this corrected data should have a higher predic-tive power. We make use of the interdepen-dence among attributes and between the at-tributes and the class to predict the value of one attribute using the rest of the attributes together with the class value. These predic-tions serve as candidates for possible adjust-ments to a training instance that has been misclassified. We selectively adjust some of the attribute values accordingly to obtain a better fit of the polished instance to the clas-sifier. Preliminary experimentation suggests that this is a viable approach to noise reduc-tion and correction. 1