Back to Main Conference 2026
LREC 2026main

LexiPhon: A Collection of Phonetically Transcribed Lexicons from Wikipedia

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2ju5wvz6x3mw

Abstract

We introduce LexiPhon, an open-source dataset of phonetically transcribed lexicons for 87 languages derived from Wikipedia data with automated grapheme-to-phoneme (G2P) transcription, along with the open-source software used to create it. Each lexicon provides transcriptions generated by up to three G2P methods, crowdsourced transcriptions from WikiPron (Lee et al., 2020) where available, word frequencies calculated from Wikipedia, along with word lengths and phonological neighborhood densities. We introduce an internal validation metric based on phonological feature edit distance to ensure transcriptions are consistent within languages, as manual validation is not possible. This dataset fills a gap in the existing space of phonetic lexicons, with a much larger set of words per language than existing multilingual word lists, and more languages than existing lexicon datasets. The dataset, along with the software used to create it, are freely available on OSF at https://osf.io/rd9ma/overview?view_only=398802df19ad488ab7da7e7798cd7aca.

Details

Paper ID
lrec2026-main-448
Pages
pp. 5689-5700
BibKey
doucette-etal-2026-lexiphon
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • AD

    Amanda Doucette

  • TO

    Timothy J. O'Donnell

  • MS

    Morgan Sonderegger

Links