Dialectometry and Evaluation of the ePark Corpus for Low-Resource Formosan Language Dialects

Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective

Abstract

Formosan languages are a critically endangered branch of the Austronesian family spoken in Taiwan, and many of their dialects remain poorly understood and computationally understudied. Subgrouping relationships in these languages are often contested and unresolved. We provide the first evaluation of the ePark corpus as a dialectal NLP resource, identifying its strengths and gaps for future NLP work, and present the first large-scale corpus-based computational analysis of dialect similarity across all officially recognized Formosan languages. We use the ePark corpus to analyze 42 dialects in 16 Formosan languages, and through word-level TF-IDF cosine similarity, Jaccard similarity over shared vocabulary, and Levenshtein distance, we quantify pairwise dialectal relationships within the Amis, Atayal, Seediq, Bunun, Paiwan, Rukai, and Puyuma languages. We find that simple lexical similarity methods can recover and confirm linguistically established dialectal subgroupings. We find that in multiple cases the two metrics diverge, offering insights on contested subgroupings such as Mantauran Rukai. This work establishes a scalable methodological framework for dialectometry in low-resource languages, demonstrates the value of the ePark corpus for Formosan NLP research, and encourages future work in NLP on Formosan dialects.