Back to Main Conference 2026
LREC 2026main

MaitH 1.0: A Parallel Corpus and Baseline for Low-Resource Maithili-Hindi Translation

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4otutrpimz7y

Abstract

Maithili is one of the 22 official languages recognized in the Indian Constitution. The literature of Maithili is rich; however, due to current socio-political changes, the language is on the verge of extinction. Therefore, it is crucial to develop a corpus for low-resource Indic languages like Maithili to ensure that the dream of “No Language Left Behind" (NLLB) is realized. With this in mind, we contribute a corpus (1,05,600 sentences) containing both manually curated and synthetically generated. Additionally, we propose a strong baseline on the Maithali-Hindi pair using multilingual pretrained models such as IndicTrans2, mBART50, mT5, and NLLB-200 distilled. We evaluate the translation systems using standard performance metrics, including BLEU, CHRF2, TER, COMET, METEOR, and BERTScore. Comparative experiments conducted against the existing NLLB dataset (5,50,300 sentence pairs) demonstrate that our proposed dataset consistently yields superior translation quality. Finally, these results demonstrate that, even with a smaller corpus size, high-quality, task-specific data significantly enhance translation accuracy for low-resource Indian languages, such as Maithili.

Details

Paper ID
lrec2026-main-676
Pages
pp. 8567-8576
BibKey
dubey-etal-2026-maith
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • KD

    Kamanksha Prasad Dubey

  • CM

    Chandresh Maurya

  • KP

    Kumar Padmanabh

Links