Back to Main Conference 2026
LREC 2026main

CorEGe-PT: Compiling a Large Corpus of Academic Texts in Portuguese

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/3wm6ywh8gzxm

Abstract

This paper describes the creation of a large-scale corpus of academic texts in Portuguese, dubbed CorEGe-PT, extracted from the institutional repository of a Portuguese university. Its compilation methodology, which combined automatic and manual procedures, is detailed, together with challenges faced and proposed solutions. The process included a thorough analysis of the metadata, which will be publicly released together with the documents, extracted in a markdown format. CorEGe-PT covers five areas of knowledge and, with over 34,000 documents and 1B tokens, is the largest of corpus of its kind in Portuguese, which will enable in-depth linguistic studies while providing data for adapting Large Language Models to academic Portuguese and related tasks.

Details

Paper ID
lrec2026-main-118
Pages
pp. 1533-1543
BibKey
kuhn-etal-2026-corege
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • TK

    Tanara Zingano Kuhn

  • JM

    José Matos

  • BN

    Bruno Neves

  • DP

    Daniela Pereira

  • EC

    Elisabete Cação

  • IS

    Ivo Simões

  • JE

    Jacinto Estima

  • DL

    Delfim Leão

  • HO

    Hugo Goncalo Oliveira

Links