LexiPhon: A Collection of Phonetically Transcribed Lexicons from Wikipedia
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We introduce LexiPhon, an open-source dataset of phonetically transcribed lexicons for 87 languages derived from Wikipedia data with automated grapheme-to-phoneme (G2P) transcription, along with the open-source software used to create it. Each lexicon provides transcriptions generated by up to three G2P methods, crowdsourced transcriptions from WikiPron (Lee et al., 2020) where available, word frequencies calculated from Wikipedia, along with word lengths and phonological neighborhood densities. We introduce an internal validation metric based on phonological feature edit distance to ensure transcriptions are consistent within languages, as manual validation is not possible. This dataset fills a gap in the existing space of phonetic lexicons, with a much larger set of words per language than existing multilingual word lists, and more languages than existing lexicon datasets. The dataset, along with the software used to create it, are freely available on OSF at https://osf.io/rd9ma/overview?view_only=398802df19ad488ab7da7e7798cd7aca.