HomeLREC 2026WorkshopsCAWLlrec2026-ws-cawl-07
Back to CAWL 2026
LREC 2026workshop

Evaluating Data Augmentation Strategies for Training Spanish Misspelling Detection Models

Proceedings of the Third Workshop on Computation and Written Language (CAWL 2026) @ LREC 2026

DOI:10.63317/3mw3y4ovzwsz

Abstract

This paper evaluates three data augmentation strategies for training misspelling detection models in Spanish. Using the Spanish CORRSIC corpus of naturally occurring misspellings, we compare three misspelling generation methods: random perturbations, keyboard-based errors, and a statistical model derived from empirical edit patterns encoded as weighted finite-state transducers. We also analyze two word selection strategies (random and length-based) and two augmentation configurations designed to balance data diversity and reduce spurious correlations. This study shows that the statistical model produces misspellings most similar to real data, showing the lowest Jensen–Shannon divergence (0.148 nats) with the empirical distribution. In downstream detection experiments, performance improves with training size, and differences between word selection strategies remain minimal. Overall, the results highlight the value of statistically grounded misspelling generation for realistic and effective data augmentation in spell-checking tasks in Spanish.

Details

Paper ID
lrec2026-ws-cawl-07
Pages
pp. 71-78
BibKey
castillosancho-etal-2026-evaluating
Editors
Kyle Gorman
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Third Workshop on Computation and Written Language (CAWL 2026) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • MC

    Manuel Castillo-Sancho

  • JP

    Jordi Porta

  • AG

    Asunción Gómez-Pérez

Links