HomeLREC 2026WorkshopsCMLClrec2026-ws-cmlc-14
Back to CMLC 2026
LREC 2026workshop

Managing Growth in a National Corpus: The Hungarian National Corpus 3.0 (MNSZ3)

Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora

DOI:10.63317/46aafswuwv4m

Abstract

The third generation of the Hungarian National Corpus (MNSZ3) aims to provide a large-scale, curated, and well-described corpus resource needed for the sustainable digital presence of Hungarian. Building on the domain structure and proportions of MNSZ2 (v2.0.5; 1.04 billion running words), the project targets a substantial increase in scale while also strengthening the coverage and metadata description of Hungarian language use outside Hungary. MNSZ3 retains the six traditional domains of the earlier corpus—press, fiction, scientific, official, personal, and transcribed spoken language—and is planned to reach approximately 10 billion tokens. This paper presents the motivation and design principles of the project, outlines the practical decisions and procedures used in data collection and cleaning, and discusses the annotation strategy developed for large-scale processing. In planning the linguistic analysis, we build on the complementary strengths of HuSpaCy and e-magyar: HuSpaCy provides the unified and efficient UD-oriented processing backbone, while e-magyar (emMorph) is preserved as an explicit additional layer for morphology and lemmatisation.

Details

Paper ID
lrec2026-ws-cmlc-14
Pages
pp. 84-90
BibKey
ligetinagy-etal-2026-managing
Editors
Piotr Bański, Dawn Knight, Marc Kupietz, Andreas Witt, Alina Wróblewska
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • NL

    Noémi Ligeti-Nagy

  • EH

    Enikő Héja

  • ÁB

    Ágnes Bánfi

  • FF

    Flóra Földesi

  • BS

    Bence Sárossy

  • BS

    Boglárka Skrabák

  • TV

    Tamás Váradi

  • GP

    Gábor Prószéky

Links