Back to Main Conference 2026
LREC 2026main

Benchmark Data Contamination in Underrepresented Languages: A Comprehensive Analysis Using Brazilian Data

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/39wbjvajnh7t

Abstract

Large Language Models (LLMs) are typically evaluated using standardized benchmarks to enable consistent performance measurement and model comparison. However, the reliability of these benchmarks can be undermined by data contamination, which occurs when evaluation items are inadvertently included in training corpora. While this issue has been investigated primarily in high-resource languages such as English and Chinese, its impact on underrepresented languages — such as Brazilian Portuguese — remains understudied. In this paper, we present one of the first systematic investigations of benchmark data contamination (BDC) in an underrepresented language setting, using Brazilian Portuguese as a case study. Using validated methodologies from the literature, we evaluate specialized and multilingual models across four benchmarks: BLUEX, ENEM Challenge, OAB Exams, and HealthQA-BR. Our approach applyes TS-Guessing to detect contamination via memorized knowledge, alongside a 50-character n-gram similarity strategy to identify benchmark items leaked into training data. Our results provide consistent evidence of contamination, revealing that models with stronger memorization and retrieval abilities tend to achieve artificially inflated benchmark scores. Our contributions include: (i) classifying models according to their contamination risk, (ii) identifying the benchmarks most affected by data leakage, and (iii) reporting contaminated training corpora.

Details

Paper ID
lrec2026-main-374
Pages
pp. 4765-4777
BibKey
vilar-etal-2026-benchmark
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • IV

    Iriedson Souto Maior de Moraes Vilar

  • DM

    David Candeia Maia

  • JB

    João Brunet

  • FM

    Fabio Morais

  • LM

    Leandro Balby Marinho

Links