Glossed Data in Northern Interior Salish
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
The Northern Interior subgroup of the Salish language family, spoken in the Pacific Northwest of North America, comprises three languages: St’át’imcets, nɬeʔkepmxcín, and Secwepemctsín. Each has a small number of first-language (L1) speakers remaining due to the effects of colonization, though language revitalization efforts are ongoing. This work introduces the first compiled and cleaned language datasets in these languages, useable in natural language processing (NLP) projects. This data is in glossed format, with transcriptions in the language, translations into English, and linguistic segmentations and glosses that provide a detailed breakdown of meaning. In order to achieve consistently formatted data within and across each language, extensive data cleaning was conducted. This paper provides the glossed data standards that were developed and recounts the cleaning process. Scripts that help to automate parts of the data preparation processes are included. Finally, this work strives to keep the interconnectedness of language and community as a central consideration.