Publication | Closed Access
Text-Guided Neural Network Training for Image Recognition in Natural Scenes and Medicine
40
Citations
61
References
2019
Year
Convolutional Neural NetworkEngineeringMachine LearningNeural Network TrainingNatural Language ProcessingMultimodal LlmImage ClassificationImage AnalysisSemantic InformationData ScienceVisual GroundingPattern RecognitionText RecognitionVisual Question AnsweringNatural ScenesRadiologyHealth SciencesImage RecognitionMachine VisionMedical ImagingVisual DiagnosisVision Language ModelComputer ScienceDeep LearningMedical Image ComputingComputer VisionBiomedical ImagingConvolutional Neural NetworksComputer-aided DiagnosisMedical Image Analysis
Convolutional neural networks (CNNs) are widely recognized as the foundation for machine vision systems. The conventional rule of teaching CNNs to understand images requires training images with human annotated labels, without any additional instructions. In this article, we look into a new scope and explore the guidance from text for neural network training. We present two versions of attention mechanisms to facilitate interactions between visual and semantic information and encourage CNNs to effectively distill visual features by leveraging semantic features. In contrast to dedicated text-image joint embedding methods, our method realizes asynchronous training and inference behavior: a trained model can classify images, irrespective of the text availability. This characteristic substantially improves the model scalability to multiple (multimodal) vision tasks. We also apply the proposed method onto medical imaging, which learns from richer clinical knowledge and achieves attention-based interpretable decision-making. With comprehensive validation on two natural and two medical datasets, we demonstrate that our method can effectively make use of semantic knowledge to improve CNN performance. Our method performs substantial improvement on medical image datasets. Meanwhile, it achieves promising performance for multi-label image classification and caption-image retrieval as well as excellent performance for phrase-based and multi-object localization on public benchmarks.
| Year | Citations | |
|---|---|---|
Page 1
Page 1