Large-Scale Datasets for Going Deeper in Image Understanding

TLDR

Computer vision and machine learning have advanced through big data, yet progress remains constrained by the limited volume, versatility, and diversity of existing datasets. The study aims to address four key computer vision tasks: human‑centered scene classification, attribute‑based zero‑shot learning, human keypoint detection, and image Chinese captioning. The authors introduce four large‑scale datasets, each richly annotated with labels, bounding boxes, attributes, keypoints, and captions, to bridge the semantic gap between low‑level images and high‑level concepts. Baseline experiments demonstrate that the tasks remain challenging on the new datasets.

Abstract

Recently, extensive efforts have been devoted to computer vision and machine learning by exploiting big data to explore many practical applications. However, these research fields are still quite limited not only by the sheer volume, but also the versatility and diversity, of the available datasets. In this paper, we target at four challenging and yet important computer vision tasks, namely, human-centered scene classification, attribute based zero-shot learning (recognition), human keypoint detection and image Chinese captioning. Four novel large-scale datasets are collected and annotated to facilitate these tasks of deeper image understanding. Labels, bounding boxes, attributes, keypoints and captions are annotated in corresponding datasets. These rich annotations bridge the semantic gap between low-level images and high-level concepts. Extensive experiments on baseline methods have been implemented and compared, which show that these learning tasks on our datasets are still challenging.

References

Page 1

	Year	Citations

Page 1