linus@CHiPSAL 2026: Multimodal Hate Speech and Sentiment Detection in Low-Resource Memes Using Late-Fusion Hybrid Architecture
Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)
Abstract
The increased sharing of memes on social media creates serious challenges for automated moderation, especially in low-resource and code-mixed languages such as Nepali. In this paper, we present our system for the CHiPSAL 2026 Shared Task on Multimodal Hate and Sentiment Understanding in Low-Resource Memes. We propose a late-fusion hybrid architecture that combines OpenAI’s Vision Transformer (CLIP ViT-B/32) with a domain-specific Nepali language model (NepBERTa) to capture both visual features and linguistic information. To address data scarcity, we introduce a cross-task label mapping and data augmentation strategy between the hate speech and sentiment datasets. By applying controlled hyperparameter settings and balanced loss optimization, our framework achieved a Macro F1 score of 0.8052 on Subtask A (Hate Speech Detection) and 0.6881 on Subtask B (Sentiment Analysis) in the official CodaBench evaluation, demonstrating the effectiveness of the proposed multimodal approach.