HomeLREC 2026WorkshopsCHIPSALlrec2026-ws-chipsal-22
Back to CHIPSAL 2026
LREC 2026workshop

Unigoa@CHiPSAL 2026: Early vs Late Fusion for Multimodal Hate and Sentiment Detection in Nepali Memes

Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)

DOI:10.63317/37ooapnyzhbx

Abstract

Internet memes pose significant challenges for automatic content moderation due to the interaction of visual and textual cues, sarcasm, and cultural context. In this work, we participate in the CHiPSAL 2026 shared task on multimodal hate and sentiment understanding in Nepali memes. The task consists of two subtasks: binary hate speech detection and three-class sentiment classification. We investigate both early-fusion and late-fusion multimodal architectures. Our primary system employs a late-fusion dual-encoder architecture combining XLM-RoBERTa for multilingual text representation and CLIP for visual encoding. We further evaluate an early-fusion ViLT-based joint vision–language transformer using NepBERTa tokenization as a baseline. Experimental results show that late-fusion models consistently outperform early-fusion architectures, particularly for code-mixed memes containing Devanagari Nepali and Roman-script English text. Our best system achieves a Macro-F1 of 0.6564 for hate speech detection and 0.4859 for sentiment classification. We provide analysis highlighting the challenges of multilingual code-mixing, sarcasm, and implicit sentiment in low-resource multimodal settings.

Details

Paper ID
lrec2026-ws-chipsal-22
Pages
pp. 229-236
BibKey
fondekar-etal-2026-unigoa
Editors
Kengatharaiyer Sarveswaran, Ashwini Vaidya
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • AF

    Ashweta Fondekar

  • MS

    Milind Shivolkar

  • JP

    Jyoti Pawar

Links