Request Correction
Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.
Correction Guidelines
- Click the edit button next to a field to report a correction.
- Fill in the suggested correction value for each field you want to correct.
- Provide your name and email so we can contact you if needed.
Paper Information
Confusable Characters as Endangered Language Markers: The Case of North Caucasus Writing Systems
Paper Fields
Click the edit button next to a field to report a correction.
Confusable Characters as Endangered Language Markers: The Case of North Caucasus Writing Systems
The Abkhaz-Adyghe and Nakh-Daghestanian language families encompass 35 living languages that possess arguably the most complex modern Cyrillic orthographies due to their very sophisticated phonology. The relevant online data displays idiosyncratic patterns among which the use of confusable characters in input methods is the most prevalent. This work studies one such character—letter palochka—that is shared by most of the writing systems in question. We investigate whether patterns including variants of this character can act as markers of these languages in large-scale web-crawled data. We use GlotLID, a wide-coverage off-the-shelf language identification (LID) model, to label paragraph-level web text that contains a palochka confusable, and estimate the effect of confusable character normalization on the quality of GlotLID’s predictions in 14 supported North Caucasian languages. According to GlotLID, the normalization significantly increases the recall (discovery of new language data) for some languages, while degrading it for others. However, manual evaluation reveals that overall, only 41% of ensuing wins and 46% of losses are accurate due to GlotLID prediction errors. We argue that, despite finding useful signals, higher precision LID approaches tailored to these long-tail languages are needed to improve the quality of mined data.
Authors
Expand an author to correct their information. Use the remove button to request author removal, or add a new author.
PDF Attachment
You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.
Your Information
Author Declaration *
Select at least one field to correct using the edit buttons above.