Back to Main Conference 2026
LREC 2026main

How Many Samples Do We Need? A Toolkit for Power-Aware Evaluation Design

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4j37zxirsi26

Abstract

If datasets are the telescopes of our field, then statistical power is their resolution, i.e., their ability to reveal a true difference in model performance when one exists. Many NLP evaluations are underpowered, leading to overstated claims of improvement. This paper introduces sk-power, an open-source Python library that helps researchers and practitioners design well-powered evaluations. Built with familiar scikit-learn-style abstractions, sk-power enables users to simulate evaluation scenarios, estimate minimum detectable effects, and assess the reliability of reported gains. We also illustrate what can go wrong when power analysis isn’t carried out. Our goal is to position power analysis as a first-class, practical step in evaluation planning.

Details

Paper ID
lrec2026-main-353
Pages
pp. 4507-4513
BibKey
basile-etal-2026-how
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • AB

    Angelo Basile

  • AS

    Areg Mikael Sarvazyan

  • JG

    José Ángel González

Links