Back to Main Conference 2026
LREC 2026main

Automatic Segmentation of Classical Tibetan Texts into Autochthonous and Allochthonous Regions

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2iyfjjv9boc6

Abstract

We introduce a new computational framework for segmenting Classical Tibetan texts into autochthonous and allochthonous regions, distinguishing between indigenous Tibetan compositions and translated materials, primarily from Sanskrit sources. To support this task, we release the first annotated Tibetan corpus for ALLO/AUTO segmentation and evaluate several multilingual encoders, including mBERT and XLM-R, fine-tuned for sequence labeling. Our best model achieves strong alignment with expert annotations, showing that multilingual representations can effectively capture philological boundaries in low-resource settings. This work contributes new resources and methods for computational philology and sheds light on the linguistic markers that trace the intercultural transmission of Buddhist thought in Tibet.

Details

Paper ID
lrec2026-main-079
Pages
pp. 1017-1030
BibKey
bilitski-etal-2026-automatic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • GB

    Guy Bilitski

  • LS

    Lev Shechter

  • SJ

    Sonam Jamtsho

  • NM

    Nir Marciano

  • NB

    Nicola Bajetta

  • RS

    Rebecca Sunden

  • OD

    Omri Drori

  • KH

    Kai Golan Hashiloni

  • OZ

    Orr Zwebner

  • AS

    Asaf Shina

  • OA

    Orna Almogi

  • DW

    Dorji Wangchuk

  • KB

    Kfir Bar

Links