Request Correction
Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.
Correction Guidelines
- Click the edit button next to a field to report a correction.
- Fill in the suggested correction value for each field you want to correct.
- Provide your name and email so we can contact you if needed.
Paper Information
Wancho Dialectometry: Community-created data and the Living Dictionaries project
Paper Fields
Click the edit button next to a field to report a correction.
Wancho Dialectometry: Community-created data and the Living Dictionaries project
Community-created lexical resources for under-documented languages represent an underexplored data source for computational dialectology. This study evaluates the viability of such data for dialectometric analysis, using the Wancho (Glottocode: wanc1238) LivingDictionaries project as a case study. Wancho is a Tibeto-Burman language of the Southwestern Patkaian branch, spoken primarily in Longding District, Arunachal Pradesh, India. The dictionary is notable for being entirely community-built and speaker-facing, and uniquely among resources for Northeast India, it incorporates dialect-specific forms spanning village-level geolects and clanlects. We extract dialectal data via automated web scraping and apply a series of preprocessing steps to address inconsistencies in transcription, language labelling, and concept assignment. Pairwise linguistic distances are then computed using Sound Class Alignment (SCA, List 2010), which captures phonological similarity more accurately than raw edit distance by incorporating articulatory feature structure. The resulting distance matrix is analysed through UPGMA hierarchical clustering and NeighborNet split network inference. Despite the dataset’s uneven dialect coverage and absence of systematic cognate coding, SCA-based distances recover the traditional Upper/Lower/Middle Wancho distinction and correctly situate transitional varieties. These results hold even for dialects with as few as a dozen attested forms. We show that unlike Bayesian phylogenetic inference which is poorly suited to data of this density and distribution, SCA proves to be a reliable metric. Our findings suggest that SCA distance is robust to the kinds of noise and sparsity characteristic of community-generated lexical data, and that such resources constitute a viable, if imperfect, input for automated dialectometric workflows — particularly in contexts where fieldwork-based data collection is not currently feasible.
Authors
Expand an author to correct their information. Use the remove button to request author removal, or add a new author.
PDF Attachment
You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.
Your Information
Author Declaration *
Select at least one field to correct using the edit buttons above.