HomeLREC 2026WorkshopsSIGULlrec2026-ws-sigul-03
Back to SIGUL 2026
LREC 2026workshop

Corpus-Linguists’ Little Helpers? Evaluating LLMs for Linguistic Annotation: The Case of Sensationalist Headlines Corpus

Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages

DOI:10.63317/3zefdrpowzjr

Abstract

Manual annotation of pragmastylistic features in sensationalist media is a resource-intensive bottleneck for corpus- based research, particularly for lower-resource languages. This paper evaluates whether Large Language Models (LLMs) can reliably automate this process. We benchmark two proprietary models, OpenAI’s GPT-5 and Google’s Gemini 2.5 Pro, on annotating eight sensationalist linguistic and orthographic features within a corpus of 508 Serbian celebrity magazine headlines. Our methodology involves a systematic comparison of five prompting strategies: zero-shot, few-shot (1, 3, and 5 examples), and chain-of-thought. Results demonstrate that LLMs can achieve high alignment with a manually curated gold standard, reaching a peak macro-F1 score of 98.76%. Notably, the most effective and cost-efficient configuration was GPT-5 using a simple zero-shot prompt. Qualitative error analysis reveals that remaining inaccuracies are systematic, primarily involving pragmatic conventions, discourse scope, and quoted speech. We conclude that LLMs are viable for first-pass annotation of well-defined features in Serbian, though implicit and genre-dependent cues require further study. To support reproducibility and future research on underrepresented languages, we provide our full prompting setup, evaluation procedures, and a detailed cost comparison.

Details

Paper ID
lrec2026-ws-sigul-03
Pages
pp. 33-41
BibKey
bago-etal-2026-corpus
Editors
Atul Kr. Ojha, Sakriani Sakti, Claudia Soria, Maite Melero, John P. McCrae, Constantine Lignos, Chao-Hong Liu, German Rigau Claramunt, Georg Rehm
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • PB

    Petra Bago

  • VK

    Virna Karlić

Links