Back to Main Conference 2004
LREC 2004main

Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-compliant Corpus of Newspaper Italian

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/5ih5p32xxoy6

Abstract

This paper describes the La Repubblica corpus, currently being developed at the SSLMIT of the University of Bologna. The corpus is a very large collection of newspaper text, currently amounting to 175 million words, but expected to grow to 400 million before the end of 2004. When completed, it will contain all the articles published between 1985 and 2000 by the national daily La Repubblica. The paper discusses the techniques used to extract the text, tokenize it and annotate it (basic TEI annotation, POS tagging, genre/topic categorization), it presents examples of how it can be used, and gives details of the ways in which interested users can access it. The paper concludes with a discussion of current and future developments, and of weak and strong points of this resource.

Details

Paper ID
lrec2004-main-128
Pages
N/A
BibKey
baroni-etal-2004-introducing
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • MB

    Marco Baroni

  • SB

    Silvia Bernardini

  • FC

    Federica Comastri

  • LP

    Lorenzo Piccioni

  • AV

    Alessandra Volpi

  • GA

    Guy Aston

  • MM

    Marco Mazzoleni

Links