Human-Centered Multimodal Fusion for Sexism Detection in Memes with Eye-Tracking, Heart Rate, and EEG Signals

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

The automated detection of sexism in memes is a notoriously challenging task due to multimodal ambiguity, cultural nuance, and the use of humor to provide plausible deniability. As a result, content-only models often fail to capture the complexity of human perception. To address this fundamental limitation, we introduce and validate a human-centered paradigm that augments standard content features with rich physiological data. We created a novel resource by recording Eye-Tracking (ET), Heart Rate (HR), and Electroencephalography (EEG) from 16 subjects (8 per experiment) while they viewed 3,984 memes from the EXIST 2025 dataset. Our statistical analysis reveals significant physiological differences in how subjects process sexist versus non-sexist content. Sexist memes were associated with higher cognitive load (evidenced by increased fixation counts and longer reaction times), and with differences in EEG spectral power across the Alpha, Beta, and Gamma frequency bands. This pattern, commonly linked in previous research to increased attentional engagement and cognitive effort during visual processing, suggests that sexist memes may elicit more demanding neural activity compared to non-sexist ones. Building on these findings, we propose a novel multimodal fusion model that integrates these physiological signals with enriched textual-visual features derived from a Vision-Language Model (VLM). Our final model achieves an AUC of 0.794 in binary sexism detection, a statistically significant 3.4% improvement over a powerful VLM-based baseline. The fusion of physiological data proves particularly effective for nuanced and ambiguous cases, boosting the F1-score for the most challenging fine-grained category, *Misogyny and Non-Sexual Violence*, by an unprecedented 26.3%. Our work demonstrates that human physiological responses provide a robust, objective signal of perception that can significantly enhance the accuracy and human-awareness of automated systems for countering online sexism.