Publication | Closed Access
CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval
32
Citations
26
References
2022
Year
Unknown Venue
Image AnalysisInformation RetrievalData ScienceMachine LearningPattern RecognitionDynamic Weighting StrategyEngineeringText-to-image RetrievalComputer ScienceVideo UnderstandingMultimedia SearchDeep LearningFast RetrievalVideo RetrievalPerceptual HashingComputer VisionUnsupervised Deep HashingCross-modal Video-text Retrieval
With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing algorithms have difficulties in seeking or constructing a well-defined joint semantic space. In this paper, an unsupervised deep cross-modal video-text hashing approach (CLIP4Hashing) is proposed, which mitigates the difficulties in bridging between different modalities in the Hamming space through building a single hashing net by employing the pre-trained CLIP model. The approach is enhanced by two novel techniques, the dynamic weighting strategy and the design of the min-max hashing layer, which are found to be the main sources of the performance gain. Compared with conventional deep cross-modal hashing algorithms, CLIP4Hashing does not require data-specific hyper-parameters. With evaluation using three challenging video-text benchmark datasets, we demonstrate that CLIP4Hashing is able to significantly outperform existing state-of-the-art hashing algorithms. Additionally, with larger bit sizes (e.g., 2048 bits), CLIP4Hashing can even deliver competitive performance compared with the results based on non-hashing features.
| Year | Citations | |
|---|---|---|
Page 1
Page 1