Back to Main Conference 2024
LREC-COLING 2024main

Evaluating the Quality of a Corpus Annotation Scheme Using Pretrained Language Models

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/2zy52xp7m2v5

Abstract

Pretrained language models and large language models are increasingly used to assist in a great variety of natural language tasks. In this work, we explore their use in evaluating the quality of alternative corpus annotation schemes. For this purpose, we analyze two alternative annotations of the Turkish BOUN treebank, versions 2.8 and 2.11, in the Universal Dependencies framework using large language models. Using a suitable prompt generated using treebank annotations, large language models are used to recover the surface forms of sentences. Based on the idea that the large language models capture the characteristics of the languages, we expect that the better annotation scheme would yield the sentences with higher success. The experiments conducted on a subset of the treebank show that the new annotation scheme (2.11) results in a successful recovery percentage of about 2 points higher. All the code developed for this work is available at https://github.com/boun-tabi/eval-ud .

Details

Paper ID
lrec2024-main-0577
Pages
pp. 6504-6514
BibKey
akkurt-etal-2024-evaluating
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • FA

    Furkan Akkurt

  • OG

    Onur Gungor

  • BM

    Büşra Marşan

  • TG

    Tunga Gungor

  • BO

    Balkiz Ozturk Basaran

  • Arzucan Özgür

  • SU

    Susan Uskudarli

Links