CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network

Abstract

Cross-modal retrieval has become a highlighted research topic for retrieval across multimedia data such as image and text. A two-stage learning framework is widely adopted by most existing methods based on deep neural network (DNN): <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">The first learning stage is to generate separate representation for each modality and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">the second learning stage is to get the cross-modal common representation. However the existing methods have three limitations: 1) In <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">the first learning stage they only model intramodality correlation but ignore intermodality correlation with rich complementary context. 2) In <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">the second learning stage they only adopt shallow networks with single-loss regularization but ignore the intrinsic relevance of intramodality and intermodality correlation. 3) Only original instances are considered while the complementary fine-grained clues provided by their patches are ignored. For addressing the above problems this paper proposes a cross-modal correlation learning (CCL) approach with multigrained fusion by hierarchical network and the contributions are as follows: 1) In <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">the first learning stage CCL exploits multilevel association with joint optimization to preserve the complementary context from intramodality and intermodality correlation simultaneously. 2) In <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">the second learning stage a multitask learning strategy is designed to adaptively balance the intramodality semantic category constraints and intermodality pairwise similarity constraints. 3) CCL adopts multigrained modeling which fuses the coarse-grained instances and fine-grained patches to make cross-modal correlation more precise. Comparing with 13 state-of-the-art methods on 6 widely-used cross-modal datasets the experimental results show our CCL approach achieves the best performance.

References

Page 1

	Year	Citations

Page 1