Back to Main Conference 2026
LREC 2026main

AssamLegalTrans: A Parallel Corpus, Benchmark and Analysis for English-Assamese Machine Translation of Legal Judgments

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/5q53i6shk3nm

Abstract

In India, the official language for writing judgments in higher courts is English, which creates a language barrier for citizens not proficient in English. Machine Translation (MT) provides a scalable solution, but its progress for low-resource languages like Assamese is significantly limited due to the lack of legal domain data. To address this gap, we introduce the first-of-its-kind English-Assamese parallel corpus for the translation of Indian court judgments. This dataset consists of over 55,000 manually translated and validated sentence pairs from over 500 judgments of the Gauhati High Court and the Supreme Court of India. Using this dataset, we perform a comprehensive evaluation of state-of-the-art multilingual models, including NLLB-200 and Sarvam-Translate, in both zero-shot and fine-tuned settings, comparing their performance against commercial systems. Our experiments show that fine-tuning on our legal-domain dataset significantly improves the translation quality. We also conduct a thorough error analysis that points out important issues in legal translation. These include precisely translating legal terms, properly transliterating named entities, expanding abbreviations, and transforming sentence structures, such as changing passive voice to active voice, when translating from English to Assamese. By creating a publicly available dataset and examining the specific challenges, this work offers a reproducible foundation and a clear way to develop more accurate and reliable legal machine translation systems. This will help improve access to justice for Assamese speakers.

Details

Paper ID
lrec2026-main-386
Pages
pp. 4921-4930
BibKey
singh-etal-2026-assamlegaltrans
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • TS

    Telem Joyson Singh

  • HB

    Hemanta Baruah

  • SS

    Sanasam Ranbir Singh

  • AT

    Anindita Talukdar

  • NS

    Nasrin Shahnaz

  • OS

    Okram Jimmy Singh

  • PS

    Priyankoo Sarmah

  • PD

    Pallav Kumar Dutta

  • SN

    Sukumar Nandi

  • PD

    Pranab Duara

Links