Back to Main Conference 2000
LREC 2000main

Principled Hidden Tagset Design for Tiered Tagging of Hungarian

Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000)

DOI:10.63317/57gcid77wihh

Abstract

For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementation issues of a tagger to work with such a large tagsets to the more theory-based difficulty of sparseness of training data. Tiered tagging is one way to alleviate this problem by reformulating it in the following way: starting from a large set of MSDs, design a reduced tagset, Ctag-set, manageable for the current tagging technology. We describe the details of the reduced tagset design for Hungarian, where the MSD-set cardinality is several thousand. This means that designing a manageable C-tagset calls for severe reduction in the number of the MSD features, a process that requires careful evaluation of the features.

Details

Paper ID
lrec2000-main-188
Pages
N/A
BibKey
tufis-etal-2000-principled
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Second International Conference on Language Resources and Evaluation
Location
Athens, Greece
Date
31 May 2000 2 June 2000

Authors

  • DT

    Dan Tufiş

  • PD

    Péter Dienes

  • CO

    Csaba Oravecz

  • TV

    Tamás Váradi

Links