Back to Main Conference 2026
LREC 2026main

Dynaword: From One-shot to Continuously Developed Datasets

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4x9cdkge22vb

Abstract

Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over five times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry, the public sector and research institutions. The repository includes light-weight tests to ensure data formatting, quality, and documentation, establishing a sustainable framework for ongoing community contributions and dataset evolution.

Details

Paper ID
lrec2026-main-301
Pages
pp. 3782-3793
BibKey
enevoldsen-etal-2026-dynaword
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • KE

    Kenneth Enevoldsen

  • KJ

    Kristian Nørgaard Jensen

  • JK

    Jan Kostkan

  • BS

    Balázs Szabó

  • MK

    Márton Kardos

  • KV

    Kirsten Vad

  • JH

    Johan Heinsen

  • AN

    Andrea Blasi Núñez

  • GB

    Gianluca Barmina

  • JN

    Jacob Nielsen

  • RL

    Rasmus Larsen

  • RG

    Rob van der Goot

  • PV

    Peter Vahlstrup

  • PD

    Per Møldrup Dalum

  • DE

    Desmond Elliott

  • LP

    Lukas Galke Poech

  • PS

    Peter Schneider-Kamp

  • KN

    Kristoffer Nielbo

Links