Publication | Closed Access
Textual Context-Aware Dense Captioning With Diverse Words
63
Citations
35
References
2023
Year
Natural Language ProcessingComplex Visual ScenesMachine LearningEngineeringDense Captioning ArchitectureTextual Context-aware DenseStandard DenseComputational LinguisticsMulti-modal SummarizationText-to-image RetrievalVisual GroundingVision Language ModelVisual Question AnsweringDeep LearningComputer VisionMachine TranslationSpeech Recognition
Dense captioning generates more detailed spoken descriptions for complex visual scenes. Despite several promising leads, existing methods still have two broad limitations: 1) The vast majority of prior arts only consider visual contextual clues during captioning but ignore potentially important textual context; 2) current imbalanced learning mechanisms limit the diversity of vocabulary learned from the dictionary, thus giving rise to low language-learning efficiency. To alleviate these gaps, in this paper, we propose an end-to-end enhanced dense captioning architecture, namely Enhanced Transformer Dense Captioner (ETDC), which obtains textual context from surrounding regions and dynamically diversifies the vocabulary bank during captioning. Concretely, we first propose the Textual Context Module (TCM), which is integrated into each self-attention layer of the Transformer decoder, to capture the surrounding textual context. Moreover, we take full advantage of the class information of object context and propose a Dynamic Vocabulary Frequency Histogram (DVFH) re-sampling strategy during training to balance words with different frequencies. The proposed method is tested on the standard dense captioning datasets and surpasses the state-of-the-art methods in terms of mean Average Precision (mAP).
| Year | Citations | |
|---|---|---|
Page 1
Page 1