Back to Main Conference 2000
LREC 2000main

Rarity of Words in a Language and in a Corpus

Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000)

DOI:10.63317/2xpyqb3hdx6n

Abstract

A simple method was presented last year (Hlavacova & Rychly, 1999) allowing to distinguish automatically between rare and common words having the same frequency in a language corpus. The method operates with two new terms: reduced frequency and rarity. The rarity was proposed as a measure of word rareness or commonness in a language. This article deals with the rarity a bit more deeply. Its value was calculated for several different corpora and compared. Two experiments were done on the real data taken from the Czech National Corpus. Results of the first one prove that reordering of texts in the corpus does not influence the rarity of words with a high frequency in the corpus. In the second experiment, rarity of the same words in two corpora of different sizes is compared.

Details

Paper ID
lrec2000-main-222
Pages
N/A
BibKey
hlavacova-2000-rarity
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Second International Conference on Language Resources and Evaluation
Location
Athens, Greece
Date
31 May 2000 2 June 2000

Authors

  • JH

    Jaroslava Hlaváčová

Links