HomeLREC 2026WorkshopsCASlrec2026-ws-cas-11
Back to CAS 2026
LREC 2026workshop

Multi-Source Emotion Annotation in Children’s Language: When LLM Consensus Diverges from Human Judgment

Proceedings of Computational Affective Science (CAS) @ LREC 2026

DOI:10.63317/39tk69v6ca4p

Abstract

Automated emotion annotation increasingly relies on inter-LLM agreement as a proxy for label quality. We test this assumption on 2,106 clause-level segments from interviews with French-speaking children (ages 6-11) about parental roles, a setting where affect is often implicit rather than lexically explicit. Using a 500-segment expert gold standard, we show that internal consensus can be seriously misleading: Dawid-Skene, a probabilistic label aggregation method, estimates GPT-5.2 valence accuracy at 90.7%, whereas evaluation against human gold yields 71.0%, revealing substantial overestimation driven by shared neutralization bias. Conversely, Dawid-Skene underestimates Claude Sonnet 4, reversing model ranking. Majority Vote, Dawid-Skene, and MACE produce near-identical consensus labels, suggesting that the main source of error lies in shared annotator bias rather than in the aggregation rule itself. We release the expert gold subset and the probabilistic corpus to support future work. Our results show that high inter-LLM agreement cannot replace external human validation for affect annotation.

Details

Paper ID
lrec2026-ws-cas-11
Pages
pp. 125-135
BibKey
said-etal-2026-multi
Editors
Christopher Bagdon, Krishnapriya Vishnubhotla, Kristen A. Lindquist, Lyle Ungar, Roman Klinger, Saif M. Mohammad
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of Computational Affective Science (CAS) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • FS

    Farida Said

  • JV

    Jeanne Villaneau

Links