HomeLREC 2026WorkshopsCMLClrec2026-ws-cmlc-06
Back to CMLC 2026
LREC 2026workshop

Optimized for AI: Curating the Icelandic Gigaword Corpus for Stable LLM Training

Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora

DOI:10.63317/3uatbht8mdrf

Abstract

The Icelandic Gigaword Corpus (IGC) is a primary resource for Icelandic NLP, with its current version containing 2.7 billion words of curated text. The IGC is traditionally distributed in a TEI-XML format, a hierarchical structure that allows for rich linguistic annotation and metadata. However, this format introduces significant friction for modern machine learning workflows. Even high-quality curated corpora have been found to contain "unwanted" text sequences – such as fragmented lists or repetitive boilerplate that may trigger instabilities during training of large language models. In this paper, we present a new processing pipeline designed to optimize the IGC for AI development. We describe a filtering approach focusing on training stability, including fuzzy deduplication to reduce the risk of data leakage, with the aim to provide high-quality data for stable model convergence. Furthermore, we introduce a new JSONL distribution format that bridges the gap between TEI-XML and machine-actionable data, facilitating easier access and safer training for models aiming to work with Icelandic.

Details

Paper ID
lrec2026-ws-cmlc-06
Pages
pp. 49-56
BibKey
daason-etal-2026-optimized
Editors
Piotr Bański, Dawn Knight, Marc Kupietz, Andreas Witt, Alina Wróblewska
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • JD

    Jón Friðrik Daðason

  • SS

    Steinþór Steingrímsson

Links