Back to Main Conference 2026
LREC 2026main

Glossed Data in Northern Interior Salish

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2isngefy6ags

Abstract

The Northern Interior subgroup of the Salish language family, spoken in the Pacific Northwest of North America, comprises three languages: St’át’imcets, nɬeʔkepmxcín, and Secwepemctsín. Each has a small number of first-language (L1) speakers remaining due to the effects of colonization, though language revitalization efforts are ongoing. This work introduces the first compiled and cleaned language datasets in these languages, useable in natural language processing (NLP) projects. This data is in glossed format, with transcriptions in the language, translations into English, and linguistic segmentations and glosses that provide a detailed breakdown of meaning. In order to achieve consistently formatted data within and across each language, extensive data cleaning was conducted. This paper provides the glossed data standards that were developed and recounts the cleaning process. Scripts that help to automate parts of the data preparation processes are included. Finally, this work strives to keep the interconnectedness of language and community as a central consideration.

Details

Paper ID
lrec2026-main-278
Pages
pp. 3490-3495
BibKey
stacey-2026-glossed
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • AS

    Anna Stacey

Links