X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Abstract

3D dense captioning aims to describe individual objects in 3D scenes by natural language, where 3D scenes are usually represented as RGB-D scans or point clouds. However, only exploiting single modal information, e.g., point cloud, previous approaches fail to produce faithful descriptions. Though aggregating 2D features into point clouds may be beneficial, it introduces an extra computational burden, especially in the inference phase. In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, namely X-Trans2Cap. Our proposed X-Trans2Cap effectively boost the performance of single-modal 3D captioning through the knowledge distillation enabled by a teacher-student framework. In practice, during the training phase, the teacher network exploits auxiliary 2D modality and guides the student network that only takes point clouds as input through the feature consistency constraints. Owing to the well-designed cross-modal feature fusion module and the feature alignment in the training phase, X-Trans2Cap acquires rich appearance information embedded in 2D images with ease. Thus, a more faithful caption can be generated only using point clouds during the inference. Qualitative and quantitative results confirm that X-Trans2Cap outperforms previous state-of-the-art by a large margin, i.e., about +21 and +16 CIDEr points on ScanRefer and Nr3D datasets, respectively.

References

Page 1

	Year	Citations

Page 1