A Reality Check for Data Snooping

TLDR

Data snooping occurs when the same data are reused for inference or model selection, risking chance findings; it is unavoidable in time‑series analysis and widely recognized as a dangerous yet endemic practice, yet practical methods to assess its danger are lacking. The authors aim to provide simple methods for assessing data snooping by specifying a procedure to test whether the best model in a search outperforms a benchmark. The procedure tests the null hypothesis that the best model has no predictive superiority over a benchmark model. This permits data snooping to be undertaken with some degree of confidence that one will not mistake results that could have been generated by chance for genuinely good results.

Abstract

Data snooping occurs when a given set of data is used more than once for purposes of inference or model selection. When such data reuse occurs, there is always the possibility that any satisfactory results obtained may simply be due to chance rather than to any merit inherent in the method yielding the results. This problem is practically unavoidable in the analysis of time-series data, as typically only a single history measuring a given phenomenon of interest is available for analysis. It is widely acknowledged by empirical researchers that data snooping is a dangerous practice to be avoided, but in fact it is endemic. The main problem has been a lack of sufficiently simple practical methods capable of assessing the potential dangers of data snooping in a given situation. Our purpose here is to provide such methods by specifying a straightforward procedure for testing the null hypothesis that the best model encountered in a specification search has no predictive superiority over a given benchmark model. This permits data snooping to be undertaken with some degree of confidence that one will not mistake results that could have been generated by chance for genuinely good results.

References

Page 1

	Year	Citations

Page 1