HCP: A Flexible CNN Framework for Multi-Label Image Classification

TLDR

CNNs have achieved strong performance on single‑label image classification, yet handling multi‑label images remains challenging due to complex object layouts and limited multi‑label training data. This work introduces Hypotheses‑CNN‑Pooling (HCP), a flexible deep‑CNN framework that accepts arbitrary object segment hypotheses, processes each with a shared CNN, and aggregates the outputs via max pooling to produce multi‑label predictions. HCP requires no bounding‑box annotations, tolerates noisy or redundant hypotheses, does not need explicit hypothesis labels, can be pre‑trained on ImageNet, and naturally outputs multi‑label predictions. On Pascal VOC2007/2012, HCP achieves 84.2 % mAP alone and 90.3 % after fusing with a complementary hand‑crafted feature method, surpassing state‑of‑the‑art results by more than 7 %.

Abstract

Convolutional Neural Network (CNN) has demonstrated promising performance in single-label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the complex underlying object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), where an arbitrary number of object segment hypotheses are taken as the inputs, then a shared CNN is connected with each hypothesis, and finally the CNN output results from different hypotheses are aggregated with max pooling to produce the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include: 1) no ground truth bounding box information is required for training; 2) the whole HCP infrastructure is robust to possibly noisy and/or redundant hypotheses; 3) no explicit hypothesis label is required; 4) the shared CNN may be well pre-trained with a large-scale single-label image dataset, e.g. ImageNet; and 5) it may naturally output multi-label prediction results. Experimental results on Pascal VOC2007 and VOC2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts. In particular, the mAP reaches 84.2% by HCP only and 90.3% after the fusion with our complementary result in [47] based on hand-crafted features on the VOC2012 dataset, which significantly outperforms the state-of-the-arts with a large margin of more than 7%.

References

Page 1

	Year	Citations

Page 1