Request Correction
Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.
Correction Guidelines
- Click the edit button next to a field to report a correction.
- Fill in the suggested correction value for each field you want to correct.
- Provide your name and email so we can contact you if needed.
Paper Information
Dialectometry and Evaluation of the ePark Corpus for Low-Resource Formosan Language Dialects
Paper Fields
Click the edit button next to a field to report a correction.
Dialectometry and Evaluation of the ePark Corpus for Low-Resource Formosan Language Dialects
Formosan languages are a critically endangered branch of the Austronesian family spoken in Taiwan, and many of their dialects remain poorly understood and computationally understudied. Subgrouping relationships in these languages are often contested and unresolved. We provide the first evaluation of the ePark corpus as a dialectal NLP resource, identifying its strengths and gaps for future NLP work, and present the first large-scale corpus-based computational analysis of dialect similarity across all officially recognized Formosan languages. We use the ePark corpus to analyze 42 dialects in 16 Formosan languages, and through word-level TF-IDF cosine similarity, Jaccard similarity over shared vocabulary, and Levenshtein distance, we quantify pairwise dialectal relationships within the Amis, Atayal, Seediq, Bunun, Paiwan, Rukai, and Puyuma languages. We find that simple lexical similarity methods can recover and confirm linguistically established dialectal subgroupings. We find that in multiple cases the two metrics diverge, offering insights on contested subgroupings such as Mantauran Rukai. This work establishes a scalable methodological framework for dialectometry in low-resource languages, demonstrates the value of the ePark corpus for Formosan NLP research, and encourages future work in NLP on Formosan dialects.
Authors
Expand an author to correct their information. Use the remove button to request author removal, or add a new author.
PDF Attachment
You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.
Your Information
Author Declaration *
Select at least one field to correct using the edit buttons above.