HomeLREC 2026WorkshopsSIGULlrec2026-ws-sigul-14
Back to SIGUL 2026
LREC 2026workshop

Quality and Appropriateness of Large Text Datasets for Irish NLP

Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages

DOI:10.63317/3sxe9j64u492

Abstract

The value of high-quality datasets for training essential language tools has long been recognised for NLP research. Despite the importance of such datasets, most language data available for training consists of large, automatically curated corpora, often scraped from web content. The quality of such datasets is often an unknown factor. This presents a problem for already low-resourced languages (such as Irish), as existing datasets may not provide adequate, representative language data for training effective models. This paper examines existing monolingual and parallel Irish text corpora to evaluate the quality of the language data, through manual review, automatic metrics, and LLMs as judges.

Details

Paper ID
lrec2026-ws-sigul-14
Pages
pp. 126-142
BibKey
walsh-etal-2026-quality
Editors
Atul Kr. Ojha, Sakriani Sakti, Claudia Soria, Maite Melero, John P. McCrae, Constantine Lignos, Chao-Hong Liu, German Rigau Claramunt, Georg Rehm
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • AW

    Abigail Walsh

  • MA

    Mark Andrade

  • JA

    Jane Lauren Adkins

  • OO

    Ornait O'Connell

  • ÉO

    Éanna O'Connor

  • ER

    Ellen Rushe

  • BD

    Brian Davis

Links