Prediction error estimation: a comparison of resampling methods

TLDR

Genomic studies collect thousands of features from few samples, aiming to build classifiers, and the process involves feature selection, model selection, and prediction assessment. The authors compare methods for estimating the true prediction error of a model when feature selection is performed. In small studies with thousands of candidate features, resubstitution and simple split‑sample estimates are severely biased. In small samples, LOOCV, 10‑fold CV, and the .632+ bootstrap give the lowest bias for diagonal discriminant analysis, nearest‑neighbor, and classification trees, while LOOCV and 10‑fold CV are best for linear discriminant analysis; all three methods also achieve the lowest mean‑square error, though the .632+ bootstrap is biased when signal‑to‑noise is high, and performance differences diminish as sample size increases. The technical report and code are available at http://linus.nci.nih.gov/brb/TechReport.htm (2005).

Abstract

In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the 'true' prediction error of a prediction model in the presence of feature selection.For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal-to-noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase.A complete compilation of results and R code for simulations and analyses are available in Molinaro et al. (2005) (http://linus.nci.nih.gov/brb/TechReport.htm).

References

Page 1

	Year	Citations

Page 1