Back to Main Conference 2026
LREC 2026main

Evaluating Embedding Models on Danish Historical Newspapers: A Corpus and Benchmark Resource

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2hzf85aou6dw

Abstract

We present an enriched dataset of almost five million Danish historical newspaper articles from the late seventeenth to nineteenth century, augmented with semantic embeddings and an annotated subset, to enable semi-automated classification as well as thematic and linguistic exploration. Through three historical benchmark tasks that evaluate the performance of Danish and multilingual embedding models on this historical Danish corpus, we discuss how the choice for an embedding model depends on the type of task, and enrich our corpus with embeddings from the overall best performing model. As a showcase experiment, we look at the distribution of article categories in the three subgenres that can be observed in the corpus. This experiment highlights the corpus and article-level embeddings’ potential for further exploration and analysis of the Danish historical mediascape. The resource is freely available for research use and aims to foster reproducible, data-driven studies of language and culture in the Danish nineteenth century.

Details

Paper ID
lrec2026-main-287
Pages
pp. 3577-3589
BibKey
lassche-etal-2026-evaluating
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • AL

    Alie Lassche

  • PF

    Pascale Feldkamp

  • YB

    Yuri Bizzoni

  • KB

    Katrine Baunvig

  • KN

    Kristoffer Nielbo

  • JH

    Johan Heinsen

Links