Visual to Text: Survey of Image and Video Captioning

Abstract

Visual data such as images and videos are easily accessible nowadays, and they play critical roles in many real-world applications like surveillance. This raises a series of technological demands for automatic visual understanding and content summarization, which has guided the research community to move towards a better achievement of such capabilities. Meanwhile, it presents the big challenge of semantic understanding of video content and automatically translating them into human language. When developing such automatic translation systems, one critical issue is how to bridge the gap between low level features and high level semantic information. Furthermore, as a large amount of videos are captured under unconstrained conditions by nonprofessional users, this issue becomes even more serious. Therefore, brand new sets of technologies are required to address these difficulties and narrow the semantic gap effectively. These thoughts drive us to survey the complete state-of-the-art techniques in the visual to text topic. Existing methods, popular datasets, technical difficulties, and promising future directions are discussed systematically. In particular, we classify existing methods by their mechanism to link visual information (including both images and videos) and text descriptions, and emphasize the latest advances on deep learning based approaches. The quantitative evaluations of representative approaches on benchmark dataset are also presented and discussed. Finally, we provide with the promising research directions on this topic.

References

Page 1

	Year	Citations

Page 1