Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification

TLDR

Training models requires balancing sufficient data to avoid under‑training while preventing over‑fitting, and accuracy must be maximized without excessive time, leading to k‑fold cross‑validation for small datasets and hold‑out validation for large ones. This study compares k‑fold cross‑validation and hold‑out validation and explores whether k‑fold can replace hold‑out on large datasets. The authors evaluated both schemes on four large datasets to assess their suitability for quality classification. Results indicate that up to a certain data‑size threshold, k‑fold cross‑validation with appropriately chosen k can outperform hold‑out validation for quality classification.

Abstract

While training a model with data from a dataset, we have to think of an ideal way to do so. The training should be done in such a way that while the model has enough instances to train on, they should not over-fit the model and at the same time, it must be considered that if there are not enough instances to train on, the model would not be trained properly and would give poor results when used for testing. Accuracy is important when it comes to classification and one must always strive to achieve the highest accuracy, provided there is not trade off with inexcusable time. While working on small datasets, the ideal choices are k-fold cross-validation with large value of k (but smaller than number of instances) or leave-one-out cross-validation whereas while working on colossal datasets, the first thought is to use holdout validation, in general. This article studies the differences between the two validation schemes, analyzes the possibility of using k-fold cross-validation over hold-out validation even on large datasets. Experimentation was performed on four large datasets and results show that till a certain threshold, k-fold cross-validation with varying value of k with respect to number of instances can indeed be used over hold-out validation for quality classification.

References

Page 1

	Year	Citations

Page 1