Mining Naturally Romanized Seed Corpora without Romanizations

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

While the Latin script is used informally by speakers of many languages with different native scripts, high quality Latin script corpora for such languages that reflect actual natural romanizations are scarce and often difficult to collect. In this work, we propose a method for mining romanized language corpora in languages for which we do not have any pre-existing samples of naturally romanized text, focusing on Tigrinya as a test case. First we examine the efficacy of learning romanizations for a language based on observed romanizations in other languages that use the same native script. We then extrinsically assess such methods by using a romanization model trained on Amharic data to bootstrap coverage of romanized Tigrinya in a language identification system. Manual evaluation by two L1 and one L2 Tigrinya speakers suggests our method extracts romanized Tigrinya text with acceptably high precision.