Back to Main Conference 2002
LREC 2002main
The Hungarian National Corpus
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)
Abstract
The paper reports on the development of the Hungarian National Corpus, which was completed at the end of 2001 after four years' effort. The HNC is designed to be a balanced reference corpus of current written Hungarian consisting of 150 million words. The paper first discusses basic design issues concerning the composition of the corpus. The HNC adopts a fairly pragmatic approach, focusing on five major text types. The second half of the paper contains details of the annotation and tagging system used.