Register Sensitivity in Scalar MT Evaluation: Evidence from Spanish–Basque Informal Discourse
Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages
Abstract
Automatic scalar metrics are widely used for machine translation (MT) evaluation, yet their behavior under sociolinguistic variation remains underexplored, particularly in under-resourced and minority-language contexts. We present a small, controlled empirical analysis of reference-based evaluation in Spanish–Basque informal discourse. Register is operationalized as indexical density, capturing dialectal forms, informal lexicon, code-switching and orthographic stylization. Across two MT systems and prompting conditions, sentence-level scores from chrF++, COMET-DA, and XCOMET-XL show a consistent negative association with indexical density under the original informal reference. In a reference-perturbation design that holds MT outputs constant while replacing the informal reference with a standardized Batua version, scores increase systematically, particularly for high-density items, and the density–score association weakens. These results provide controlled evidence that evaluation outcomes in this setting depend in part on reference register configuration. In minority-language and informal domains, reference design choices may influence how translation quality is measured and interpreted.