Cuet Yet Another Baseline@CHiPSAL LREC 2026: Shared Task on Multimodal Sentiment Understanding in Low-Resource Memes

Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)

Abstract

Memes serve as a method to express feelings such as humor, sarcasm, and diverse viewpoints. The task of identifying sentiment in memes is becoming increasingly complex, particularly in low-resource languages like Nepali where memes often combine images, texts, and code-mixed language. However, multimodal methods for sentiment analysis in Nepali memes seem to be insufficient. In this paper, we present our system for the Subtask B(Sentiment Analysis) for Shared Task on Multimodal Hate and Sentiment Understanding in Low-Resource Memes@CHiPSAL LREC 2026. We implement various unimodal models, such as XLM-RoBERTa-large,MuRIL-base, Twitter-XLM-R for text. Moreover, we incorporate BLIP-2 captions to enhance visual-text understanding and adopted a multimodal approach that fuses textual embeddings, image embeddings, caption embeddings, and similarity scores. The fused features process through cross-attention and a dense neural network for classification, with focal loss and class weighting used to improve performance. Our approach achieved a macro F1 score of 0.50 securing 7th place and highlighting the importance of cross-modal interaction and large-scale pretrained vision-language models for robust meme understanding in sentiment analysis.