Back to Main Conference 2022
LREC 2022main

Evolving Large Text Corpora: Four Versions of the Icelandic Gigaword Corpus

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/26n296jsgidz

Abstract

The Icelandic Gigaword Corpus was first published in 2018. Since then new versions have been published annually, containing new texts from additional sources as well as from previous sources. This paper describes the evolution of the corpus in its first four years. All versions are made available under permissive licenses and with each new version the texts are annotated with the latest and most accurate tools. We show how the corpus has grown almost 50% in size from the first version to the fourth and how it was restructured in order to better accommodate different meta-data for different subcorpora. Furthermore, other services have been set up to facilitate usage of the corpus for different use cases. These include a keyword-in-context concordance tool, an n-gram viewer, a word frequency database and pre-trained word embeddings.

Details

Paper ID
lrec2022-main-254
Pages
pp. 2371-2381
BibKey
barkarson-etal-2022-evolving
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • SB

    Starkaður Barkarson

  • SS

    Steinþór Steingrímsson

  • HH

    Hildur Hafsteinsdóttir

Links