HP-BERT: A Fine-Tuned BERT Model for Detecting Hinduphobia and Sentiment Analysis

Abstract

HP-BERT is a specialized BERT-based language model fine-tuned to detect Hinduphobic content on Twitter. The model was developed using the "Hinduphobic COVID-19 X (Twitter) Dataset" which includes over 8,000 tweets collected during the COVID-19 pandemic (November 2019 to December 2022). This dataset features 2,000 manually labeled tweets and additional annotations generated using GPT-3.5 Turbo API. HP-BERT employs a multi-stage fine-tuning strategy, incorporating additional training on the SenWave dataset to enhance its sentiment analysis capabilities. The model is further adapted for analyzing Hinglish (Hindi-English) data, making it highly effective for Indian social media content. HP-BERT is designed to identify Hinduphobic discourse, analyze sentiment polarity, and provide nuanced insights into the emotional tone and context of online discussions. Its applications include detecting toxic language, understanding user behavior, and studying the propagation of Hinduphobia during and post-COVID-19. HP-BERT has been rigorously tested on multiple datasets, including the Global COVID-19 Twitter dataset, capturing trends across six countries (Australia, Brazil, India, Indonesia, Japan, and the United Kingdom). The model offers robust performance in detecting Hinduphobia and abusive language while also contributing to the study of social media dynamics and hate speech detection. HP-BERT is available for public use, fostering further research and development in the fields of sentiment analysis, hate speech detection, and computational social science.