HomeLREC 2026WorkshopsCMLClrec2026-ws-cmlc-07
Back to CMLC 2026
LREC 2026workshop

Hellenic National Corpus: The Current State

Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora

DOI:10.63317/492hm739pspc

Abstract

The Hellenic National Corpus (HNC) is an integrated online environment offering access to standard Modern Greek language material and to related analysis tools. The HNC corpus has been developed in two main phases, and currently comprises over 97 million words exclusively of written language, sourced from printed resources or scraped from the internet. The material has been automatically lemmatized and morphologically annotated, while a subset of 100,000 words has been further manually corrected, in order to produce a freely downloadable error-free corpus. Through the dedicated platform, the users have access to concordances, morphological analysis of words and statistical information (frequency) at word, lemma, part of speech and n-gram levels. Future steps include the expansion of the material in both historical and coverage dimensions: the inclusion of material from older phases of the language is foreseen, as well as the addition of dialectal material besides standards language.

Details

Paper ID
lrec2026-ws-cmlc-07
Pages
pp. 57-62
BibKey
gavriilidou-etal-2026-hellenic
Editors
Piotr Bański, Dawn Knight, Marc Kupietz, Andreas Witt, Alina Wróblewska
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • MG

    Maria Gavriilidou

  • NS

    Nikolaos Sidiropoulos

Links