Back to Main Conference 2026
LREC 2026main

Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/3kg84h4wnshx

Abstract

Despite the widespread adoption of Large Language Models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Québec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with around 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. To support reproducibility and broaden access, we release the first Québec French LLMs on Hugging Face.

Details

Paper ID
lrec2026-main-840
Pages
pp. 10723-10734
BibKey
khan-etal-2026-low
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • EK

    Eeham Khan

  • FS

    Firas Saidani

  • OE

    Owen Van Esbroeck

  • RK

    Richard Khoury

  • LK

    Leila Kosseim

Links