Evaluating the Adaptability of Large Language Models to Linguistic Variation
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Large language models (LLMs) are often assumed to generalize easily across linguistic contexts, yet their ability to adapt to genre variation remains underexplored. This study examines that question through a French Named Entity Recognition (NER) task conducted on NEM.fr, a multi-genre corpus annotated with gold named entities (NEs) spanning 11 text types, from juridical and encyclopedic prose to poetry, political speech, and online discourse. We evaluate the reasoning-oriented model DeepSeek R1 across six prompting configurations (zero-, one-, and few-shot, with and without chain-of-thought reasoning), while keeping the annotation scheme, prompting format, and evaluation pipeline constant to isolate the role of genre. Performance is measured using both strict and fuzzy F1-based metrics. The results show that prompting choices have little effect once the model has learned the task format, but that genre differences strongly influence outcomes: fuzzy F1 scores range from about 0.85 in formal genres to below 0.20 in informal ones. Even under tightly controlled conditions, LLM behaviour proves highly sensitive to textual regularity and stylistic variation, highlighting genre as a key factor in assessing model robustness.