HasNat@CHiPSAL 2026: Multimodal Hate Speech Detection in Low-Resource Nepali Memes Using Aligned Vision–Language Models
Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)
Abstract
Memes are widely used for communication on social media but are increasingly exploited to spread hate and harmful stereotypes. Detecting hate speech in memes is particularly challenging because meaning is conveyed jointly through images and embedded text, and the problem becomes more complex in low-resource languages such as Nepali. In this work, we participate in Subtask A of the CHiPSAL 2026 Shared Task, focusing on hate speech detection in Nepali-only memes. We benchmark three multimodal vision language backbones, ViT-B-32 (OpenCLIP), AltCLIP, and BLIP2+mT5, under controlled preprocessing and augmentation settings. Our best-performing system uses AltCLIP to extract aligned text and image representations, followed by a late-fusion classifier trained with stratified 5-fold cross-validation to address class imbalance. The proposed model achieves a macro F1-score of 0.66 on the validation set. Experimental results highlight the effectiveness of aligned vision language representations and demonstrate that preprocessing and augmentation strategies have model-dependent effects in low-resource multimodal hate speech detection.