Request Correction

Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.

Correction Guidelines

Click the edit button next to a field to report a correction.
Fill in the suggested correction value for each field you want to correct.
Provide your name and email so we can contact you if needed.

View all submitted correction requests

Paper Information

lrec2026-ws-dialres-12

Wancho Dialectometry: Community-created data and the Living Dictionaries project

View lrec2026-ws-dialres-12.pdf

Paper Fields

Click the edit button next to a field to report a correction.

Title

Wancho Dialectometry: Community-created data and the Living Dictionaries project

Abstract

Community-created lexical resources for under-documented languages represent an underexplored data source for computational dialectology. This study evaluates the viability of such data for dialectometric analysis, using the Wancho (Glottocode: wanc1238) LivingDictionaries project as a case study. Wancho is a Tibeto-Burman language of the Southwestern Patkaian branch, spoken primarily in Longding District, Arunachal Pradesh, India. The dictionary is notable for being entirely community-built and speaker-facing, and uniquely among resources for Northeast India, it incorporates dialect-specific forms spanning village-level geolects and clanlects. We extract dialectal data via automated web scraping and apply a series of preprocessing steps to address inconsistencies in transcription, language labelling, and concept assignment. Pairwise linguistic distances are then computed using Sound Class Alignment (SCA, List 2010), which captures phonological similarity more accurately than raw edit distance by incorporating articulatory feature structure. The resulting distance matrix is analysed through UPGMA hierarchical clustering and NeighborNet split network inference. Despite the dataset’s uneven dialect coverage and absence of systematic cognate coding, SCA-based distances recover the traditional Upper/Lower/Middle Wancho distinction and correctly situate transitional varieties. These results hold even for dialects with as few as a dozen attested forms. We show that unlike Bayesian phylogenetic inference which is poorly suited to data of this density and distribution, SCA proves to be a reliable metric. Our findings suggest that SCA distance is robust to the kinds of noise and sparsity characteristic of community-generated lexical data, and that such resources constitute a viable, if imperfect, input for automated dialectometric workflows — particularly in contexts where fieldwork-based data collection is not currently feasible.

Authors

Expand an author to correct their information. Use the remove button to request author removal, or add a new author.

PDF Attachment

You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.

Drag & drop a PDF here, or click to select

Your Information

Name

Comment

Author Declaration *

I declare that I have notified all co-authors of the proposed corrections and obtained their consent, and that all modifications adhere to research ethics standards and the LREC correction policy.

Select at least one field to correct using the edit buttons above.