Video Sentiment Analysis with Bimodal Information-augmented Multi-Head\n Attention

Abstract

Humans express feelings or emotions via different channels. Take language as\nan example, it entails different sentiments under different visual-acoustic\ncontexts. To precisely understand human intentions as well as reduce the\nmisunderstandings caused by ambiguity and sarcasm, we should consider\nmultimodal signals including textual, visual and acoustic signals. The crucial\nchallenge is to fuse different modalities of features for sentiment analysis.\nTo effectively fuse the information carried by different modalities and better\npredict the sentiments, we design a novel multi-head attention based fusion\nnetwork, which is inspired by the observations that the interactions between\nany two pair-wise modalities are different and they do not equally contribute\nto the final sentiment prediction. By assigning the acoustic-visual,\nacoustic-textual and visual-textual features with reasonable attention and\nexploiting a residual structure, we attend to attain the significant features.\nWe conduct extensive experiments on four public multimodal datasets including\none in Chinese and three in English. The results show that our approach\noutperforms the existing methods and can explain the contributions of bimodal\ninteraction in multiple modalities.\n

References

Page 1

	Year	Citations

Page 1