Back to Main Conference 2026
LREC 2026main

Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/5e6qapi74t97

Abstract

The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity (e.g. missing feature types, incomplete language entries, and limited genealogical coverage) remains prevalent. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, we extend URIEL+ by introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These improvements reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and boost imputation quality metrics by up to 35%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups.

Details

Paper ID
lrec2026-main-863
Pages
pp. 11045-11059
BibKey
shipton-etal-2026-simple
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • MS

    Mason Shipton

  • YN

    York Hay Ng

  • AK

    Aditya Khan

  • PH

    Phuong H. Hoang

  • XL

    Xiang Lu

  • AD

    A. Seza Dogruoz

  • AL

    Annie En-Shiun Lee

Links