EthosAI@CHiPSAL2026: Hate and Sentiment Understanding in Low-Resource Memes Using a Multimodal Approach

Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)

Abstract

Memes have become a popular way for people to share opinions and emotions on social media, but they are also often used to spread hate and negative sentiments. In this paper, we present our multimodal approach to the CHiPSAL 2026 shared task on multimodal hate and sentiment detection in Nepali memes, which includes two subtasks: hate detection and sentiment analysis. Since memes usually combine both text and images, we first experimented with different unimodal models for text and images separately. After identifying the top two best-performing text and image models, combined them using different fusion techniques. The results show that multimodal models outperform unimodal ones, highlighting that both textual and visual information are important for understanding the context of memes. The multi- modal model, which combines sentence-transformers/LaBSE for text and ResNet-18 for image using weighted Fusion technique, achieved a macro F1 score of 0.6614 for Subtask A and sentence-Transformers/LaBSE for text and deit- Base for image using simple Fusion technique, achieved a macro F1 score of 0.4839 for SubTask B, on the test dataset.