Unigoa@CHiPSAL 2026: Early vs Late Fusion for Multimodal Hate and Sentiment Detection in Nepali Memes

Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)

Abstract

Internet memes pose significant challenges for automatic content moderation due to the interaction of visual and textual cues, sarcasm, and cultural context. In this work, we participate in the CHiPSAL 2026 shared task on multimodal hate and sentiment understanding in Nepali memes. The task consists of two subtasks: binary hate speech detection and three-class sentiment classification. We investigate both early-fusion and late-fusion multimodal architectures. Our primary system employs a late-fusion dual-encoder architecture combining XLM-RoBERTa for multilingual text representation and CLIP for visual encoding. We further evaluate an early-fusion ViLT-based joint vision–language transformer using NepBERTa tokenization as a baseline. Experimental results show that late-fusion models consistently outperform early-fusion architectures, particularly for code-mixed memes containing Devanagari Nepali and Roman-script English text. Our best system achieves a Macro-F1 of 0.6564 for hate speech detection and 0.4859 for sentiment classification. We provide analysis highlighting the challenges of multilingual code-mixing, sarcasm, and implicit sentiment in low-resource multimodal settings.