Publication | Open Access
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning\n Research
56
Citations
0
References
2021
Year
Benchmark datasets play a central role in the organization of machine\nlearning research. They coordinate researchers around shared research problems\nand serve as a measure of progress towards shared goals. Despite the\nfoundational role of benchmarking practices in this field, relatively little\nattention has been paid to the dynamics of benchmark dataset use and reuse,\nwithin or across machine learning subcommunities. In this paper, we dig into\nthese dynamics. We study how dataset usage patterns differ across machine\nlearning subcommunities and across time from 2015-2020. We find increasing\nconcentration on fewer and fewer datasets within task communities, significant\nadoption of datasets from other tasks, and concentration across the field on\ndatasets that have been introduced by researchers situated within a small\nnumber of elite institutions. Our results have implications for scientific\nevaluation, AI ethics, and equity/access within the field.\n