Evaluating Large Language Models for Text-to-Gloss Translation in Kazakh-Russian Sign Language: A Pilot Study

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Conceptual glossing involves a systematic linguistic transformation in which the models must preserve meaning, grammatical integrity, and punctuation while turning the real language into a more structured structure. The purpose of this study is to assess the accuracy and dependability of glosses produced by these models by juxtaposing them with human-annotated standards, investigating whether the models maintain essential linguistic characteristics. By identifying the strengths and weaknesses of each model, we want to determine which architectures are most suitable for organized language tasks, such as glossing. This may reduce the manual labor required for linguistic annotation by experts while maintaining superior quality outcomes. And help deaf signers with weak reading skills interpret written paragraphs into glosses, making them more comprehensible and naturally looking to them. Text-to-gloss translation converts written or spoken language into sign language glosses, enhancing accessibility for the Deaf and Hard of Hearing (DHH) community. This pilot study evaluates four large language models (LLMs): GPT-4-turbo, Grok 3, Deepseek-V3, and Gemini 20 Flash to generate conceptual glosses in Kazakh-Russian Sign Language (K-RSL), still an under-resourced sign language. Using a dataset of 250 Russian sentences with expert-annotated K-RSL glosses, we assess performance across METEOR, BLEU, BERTScore, and WER. Results show Deepseek-V3 excels on complex texts (METEOR: 0.426 for K-RSL word order, 0.377 for fairytale paragraphs), while Gemini 20 Flash performs strongly on short sentences (METEOR: 0.602). These findings demonstrate LLMs’ potential to automate gloss production, reducing manual annotation and aiding DHH individuals with reading comprehension. Challenges include K-RSL’s unique grammar and limited datasets. This is the first study to apply LLMs to K-RSL glossing and examine the potential efficacy of autonomous gloss production.