A Baseline Investigation: Transformer-based Cross-view Baseline for Text-based Person Search

Abstract

This paper investigates a baseline approach for text-based person search by using a transformer-based framework. Existing methods usually treat the visual and textual features as independent entities for speeding up the model inference process. However, the attention to the same images should be changed according to different texts. In this paper, we use a commonly employed framework with a fused feature as the baseline, which overcomes the misalignment problem introduced by fixed features. A thorough investigation is conducted in this paper. Moreover, we propose Cross-View Matching (CVM) to provide challenging, positive text-image pairs that enable the model to learn cross-view meta-information. Furthermore, we suggest a novel evaluation process to reduce the inference time and GPU memory demand. The experiments are conducted on CUHK-PEDES, ICFG-PEDES, and RSTPReid benchmarks. Through extensive parameter analysis, the potentials of a transformer-based framework are fully explored. Although the proposed scheme is a simple framework, it achieves significant performance improvements compared with other state-of-the-art methods.

References

Page 1

	Year	Citations

Page 1