Back to Main Conference 2008
LREC 2008main

The JOS Morphosyntactically Tagged Corpus of Slovene

Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008)

DOI:10.63317/2ihb3qk4dmmi

Abstract

The JOSmorphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpora: jos100k, a 100,000 word balanced monolingual sampled corpus annotated with hand validated morphosyntactic descriptions (MSDs) and lemmas, and jos1M, the 1 million-word partially hand validated corpus. The two corpora have been sampled from the 600M-word Slovene reference corpus FidaPLUS. The JOS resources have a standardised encoding, with the MULTEXT-East-type morphosyntactic specifications and the corpora encoded according to the Text Encoding Initiative Guidelines P5. JOS resources are available as a dataset for research under the Creative Commons licence and are meant to facilitate developments of HLT for Slovene.

Details

Paper ID
lrec2008-main-451
Pages
N/A
BibKey
erjavec-krek-2008-jos
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-4-0
Conference
Sixth International Conference on Language Resources and Evaluation
Location
Marrakech, Morocco
Date
28 May 2008 30 May 2008

Authors

  • TE

    Tomaž Erjavec

  • SK

    Simon Krek

Links