Team Oryu@CHiPSAL 2026: Integrating Text and Vision Transformers for Multimodal Hate Speech Detection in Memes

Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)

Abstract

With the proliferation of multimodal content on various social media platforms, automated hate speech detection has emerged as a challenge, especially in meme-based communication, where meaning arises from interactions between text and images. In these situations, unimodal techniques are inadequate in capturing semantics. In order to address such issues, a late-fusion-based multimodal hate speech detection framework has been proposed and implemented for the CHiPSAL shared task. In the proposed framework, multimodal content is processed by utilizing XLM-RoBERTa for multilingual text representation and a Vision Transformer (ViT) for visual representation. Both modal representations are fused using a fully connected classification head and are used for binary hate speech detection. The findings suggest that multimodal content effectively captures features from individual modalities and helps improve hate speech detection accuracy by obtaining a Macro F1-score of 0.66 and ranking 5th on the leaderboard. Also, transformer-based multimodal fusion performs effectively and acts as a reliable baseline for hate speech detection in low-resource multilingual meme-based communication scenarios.