Back to Main Conference 2026
LREC 2026main

FormosanMT: A Multilingual Parallel Corpus of the Formosan Language Family

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/3ut33xim33nt

Abstract

While the quality of machine translation (MT) between widely-spoken languages has improved dramatically in recent years, training robust MT systems for languages with fewer resources remains a challenge. Endangered languages, which often lack the speaker population and written tradition needed to create text resources, are at a particular disadvantage. Developing robust MT architectures for very low-resource settings is hampered by the lack of suitable parallel corpora. To address this challenge, we introduce FormosanMT, a set of MT-ready parallel corpora for the Formosan family of endangered languages indigenous to Taiwan. Together the corpora total nearly 500,000 Formosan-Mandarin and Formosan-English sentence pairs. We share scripts for extracting these corpora from public sources, along with customizable tools for filtering, normalizing, and partitioning the data. In addition, we provide a new tokenizer for Traditional Chinese writing compatible with the popular No Language Left Behind (NLLB) MT architecture, along with updated and improved code for fine-tuning NLLB for any low-resource language pair. Finally we distribute our fully trained NLLB and OpenNMT models for the Formosan languages to and from both Mandarin and English. In addition to serving as a valuable resource for the Formosan language speaker communities, our data, code, and models will be available to NLP researchers working on endangered and low-resource language MT.

Details

Paper ID
lrec2026-main-273
Pages
pp. 3445-3455
BibKey
scheppat-etal-2026-formosanmt
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • HS

    Hunter Scheppat

  • JH

    Joshua K. Hartshorne

  • SK

    Sema Koc

  • ÉF

    Éric Le Ferrand

  • EP

    Emily Prud'hommeaux

Links