HomeLREC 2026WorkshopsSIGULlrec2026-ws-sigul-16
Back to SIGUL 2026
LREC 2026workshop

LLM-Assisted Spanish Dialect Corpus Construction

Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages

DOI:10.63317/3zod9jyperib

Abstract

This study presents a multi-dialect, pragmatically annotated Spanish corpus designed to address persistent gaps in the representation of regional varieties and communicative functions in existing linguistic and NLP resources. The corpus focuses exclusively on Spanish dialects spoken in the Americas, selecting one representative dialect per country and incorporating a single neutral Castilian variety for comparative purposes. Dialects are organized into five regional groups: Mexican, Central American, Caribbean, South American, and Rioplatense Spanish. Corpus development follows a multi-stage workflow in which a seed lexicon composed of openly licensed material from sources such as Wikipedia, Project Gutenberg, and curated random and synthetic data is used to initiate the LLM-based text generation. Each base sentence is expanded into dialect-specific variants and annotated with pragmatic and domain labels, producing a fully parallel dataset that supports cross dialect comparison. A multi-stage correction pipeline combining automated scripts, controlled LLM-based editing, and manual review ensures syntactic well-formedness and dialectal authenticity while eliminating language-switching and hallucination errors. The final version of the corpus covers 20 dialects and contains, 40,000 annotated sentences, released in both JSON and plain-text formats for use in a wide range of NLP tasks.

Details

Paper ID
lrec2026-ws-sigul-16
Pages
pp. 153-159
BibKey
ramirezvidal-etal-2026-llm
Editors
Atul Kr. Ojha, Sakriani Sakti, Claudia Soria, Maite Melero, John P. McCrae, Constantine Lignos, Chao-Hong Liu, German Rigau Claramunt, Georg Rehm
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • JR

    Jessica Claribel RAMIREZ VIDAL

  • HO

    Hiroki Ouchi

  • SS

    Sakriani Sakti

Links