Large Language Models Are Good Term Extractors: A Systematic Evaluation

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

This paper systematically evaluates modern large language models for automatic term extraction (ATE), examining GPT-5 and Mistral across four domains and three languages using the ACTER corpus. The study compares model sizes, evaluates reasoning-enhanced variants, and tests prompting strategies aligned with human annotation guidelines. Beyond extracting term lists, models provide term labels, confidence scores, and terminology management remarks. Current large language models achieve F1 scores of .36-.72; while seemingly low, this is competitive with supervised approaches and approaches the human inter-annotator agreement ceiling of 0.59. Larger models outperform smaller variants, with reasoning-enhanced models showing modest improvements. Qualitative error analysis reveals that evaluation methodology partly misrepresents model capabilities: many extractions classified as errors represent defensible boundary judgements, and apparent hallucinations are predominantly (though not exclusively) valid normalisations. Limitations remain in fine-grained categorisation and handling overly general expressions. However, the convergence of model scores with each other and with human inter-annotator agreement suggests that, for high-resource languages, basic ATE may no longer be the bottleneck in terminology management pipelines, and research should shift toward downstream tasks such as definition generation and ontology construction.