LLMs as Annotators: Evaluating Model–Human Alignment in Detecting Contentious Language in Historical Corpora

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Historical texts often contain terminology that reflects outdated or harmful social values. Identifying such contentious terms is essential for the Galleries, Libraries, Archives, and Museums (GLAM) community, but manual annotation requires cultural expertise and is difficult to scale. This study evaluates whether large language models (LLMs) can support this process by aligning with human judgments of contentiousness in historical Dutch corpora. Using the Dutch Contentious Contexts Corpus (ConConCor), we formalize the task as context-dependent binary classification and compare two LLMs across multiple prompt configurations and evaluation scenarios. The models achieve near-human-level agreement on explicit cases but diverge when contextual or historical reasoning is required. Analysis of disagreement patterns shows that LLMs capture overtly harmful expressions yet tend to over-predict contentiousness for identity-related and colonial terms and under-predict for semantically shifted or figurative uses. These findings suggest that LLMs can act as auxiliary annotators for sensitive language detection in historical materials, provided that human oversight and contextual interpretation remain central to annotation workflows.