Not All Pixels Are Matched

Abstract

Visible-Infrared Person Re-Identification (VI-ReID) has become an emerging task for night-time surveillance systems. In order to reduce the cross-modality discrepancy, previous works either align the features via metric learning or generate synthesized cross-modality images by Generative Adversary Network. However, feature-level alignment ignores the heterogeneous data itself while generative framework suffers from the low generation quality, limiting their applications. In this paper, we propose a dense contrastive learning framework (DCLNet), which performs pixel-to-pixel dense alignment acting on the intermediate representations, rather than the final deep feature. It is a new loss function that brings views of positive pixels with same semantic information closer in shallow representation space, whilst pushing views of negative pixels apart. It naturally provides additional dense supervision and captures fine-grained pixel correspondence, reducing the modality gap from a new perspective. To implement it, a Part Aware Parsing (PAP) module and a Semantic Rectification Module (SRM) are introduced to learn and refine a semantic-guided mask, allowing us to efficiently find positive pairs only requiring instance-level supervision. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the superiority of our pipeline over state-of-the-arts. Code is available at https://github.com/sunhz0117/DCLNet.

References

Page 1

	Year	Citations

Page 1