Back to Main Conference 2026
LREC 2026main

AYN: A Tiny Yet Competitive Indian Legal Language Model Pretrained from Scratch

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/3c62f5f9g7q2

Abstract

Decoder-only Large Language Models (LLMs) are currently the model of choice for many Natural Language Processing (NLP) applications. Through instruction fine-tuning and prompting approaches, such LLMs have been efficiently used to solve both general and domain-specific tasks. However, they are costly to train and, to a certain extent, costly to use as well, and one can wonder whether LLMs can be replaced by domain-specific Tiny Language Models (TLMs), which typically contain less than 100M parameters. We address this question in this study by comparing the performance of an 88M TLM pretrained from scratch for 185 A100 hours on a specific domain with a domain-specific tokenizer (here, the Indian legal domain) with LLMs of various sizes between 1B and 8B for solving domain-specific tasks. We show in particular that our legal TLM, Ayn, can indeed outperform LLMs up to 80 times larger on the legal case judgment prediction task, rival LLMs up to 30 times larger on the summarization task, and still be competitive with these larger LLMs on general tasks.

Details

Paper ID
lrec2026-main-839
Pages
pp. 10699-10722
BibKey
niyogi-etal-2026-ayn
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • MN

    Mitodru Niyogi

  • EG

    Eric Gaussier

  • AB

    Arnab Bhattacharya

Links