Digilians at NakbaVirality Shared Task: Bidirectional Cross-Attention for Multimodal Virality Prediction
Proceedings of the 2nd International Workshop on Nakba Narratives as Language Resources @ LREC 2026
Abstract
The NakbaVirality shared task focuses on multimodal virality prediction using a dataset of 2,600 multilingual posts collected from X and Reddit. In this work, we propose a multimodal architecture that combines XLM-RoBERTa for text encoding and a Vision Transformer (ViT) for image representation. The extracted features are aligned through bidirectional cross-attention to capture interactions between textual and visual modalities. To address the class imbalance present in the dataset, we apply focal loss, class weighting, and targeted data augmentation for the minority class. Additionally, layer-wise learning rate scheduling is used to stabilize fine-tuning of the pretrained encoders. Experimental results show that the proposed system achieves an accuracy of 0.6009 on the hidden test set, ranking 4th among 29 participating teams (107 total submissions). These results highlight the effectiveness of cross-modal attention mechanisms for modeling multimodal signals in high-stakes discourse.