Back to Main Conference 2012
LREC 2012main

The Icelandic Parsed Historical Corpus (IcePaHC)

Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012)

DOI:10.63317/2icn3f8r7r2r

Abstract

We describe the background for and building of IcePaHC, a one million word parsed historical corpus of Icelandic which has just been finished. This corpus which is completely free and open contains fragments of 60 texts ranging from the late 12th century to the present. We describe the text selection and text collecting process and discuss the quality of the texts and their conversion to modern Icelandic spelling. We explain why we choose to use a phrase structure Penn style annotation scheme and briefly describe the syntactic anno-tation process. We also describe a spin-off project which is only in its beginning stages: a parsed historical corpus of Faroese. Finally, we advocate the importance of an open source policy as regards language resources.

Details

Paper ID
lrec2012-main-228
Pages
pp. 1977-1984
BibKey
rognvaldsson-etal-2012-icelandic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-7-7
Conference
Eighth International Conference on Language Resources and Evaluation
Location
Istanbul, Turkey
Date
21 May 2012 27 May 2012

Authors

  • ER

    Eiríkur Rögnvaldsson

  • AI

    Anton Karl Ingason

  • ES

    Einar Freyr Sigurðsson

  • JW

    Joel Wallenberg

Links