Evaluating Phonetically Weighted and Unweighted Distance Measures in Dialectometry

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

This paper compares phonetically weighted and unweighted string distance measures in dialectometry, examining how explicit phonetic modeling affects the quantitative representation of linguistic similarity. Using narrow IPA transcriptions from the German REDE corpus, we evaluate nine measures–Levenshtein distance, bigram and trigram overlap, cosine distance, Jaro-Winkler, Jaccard similarity, the Herrgen-Schmidt measure, and the Relative Identity Value–through correlational analysis, distributional comparison, stabilization testing, and multidimensional scaling. The phonetically weighted Herrgen-Schmidt measure consistently achieves the most balanced distance dispersion, earliest stabilization, and highest linguistic plausibility. Unweighted edit-based measures reproduce the same topological structure in compressed form; distributional and overlap-based metrics introduce systematic scale distortions through exaggeration or compression. These findings establish explicit phonetic weighting as a principled and analytically efficient extension of standard dialectometric procedures. Explicit phonetic weighting enhances resolution and interpretive precision without altering the underlying relational geometry of dialect classifications.