Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification

TLDR

Social network interactions are generally positive, yet users can encounter hate speech, bullying, and verbal abuse, prompting platforms to enforce policies that aim to prevent intimidation, exclusion, and potential real‑world violence. This paper investigates whether deep multimodal techniques that combine text, images, and videos can automatically detect hate speech, extending prior text‑only research. The authors propose multiple fusion strategies to integrate textual and photographic signals. Adding image embeddings to text immediately improves detection accuracy, and applying attention‑based fusion yields further gains.

Abstract

Interactions among users on social network platforms are usually positive, constructive and insightful. However, sometimes people also get exposed to objectionable content such as hate speech, bullying, and verbal abuse etc. Most social platforms have explicit policy against hate speech because it creates an environment of intimidation and exclusion, and in some cases may promote real-world violence. As users’ interactions on today’s social networks involve multiple modalities, such as texts, images and videos, in this paper we explore the challenge of automatically identifying hate speech with deep multimodal technologies, extending previous research which mostly focuses on the text signal alone. We present a number of fusion approaches to integrate text and photo signals. We show that augmenting text with image embedding information immediately leads to a boost in performance, while applying additional attention fusion methods brings further improvement.

References

Page 1

	Year	Citations

Page 1