Back to Main Conference 2016
LREC 2016main

CodE Alltag: A German-Language E-Mail Corpus

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/5e86cjc7xaj5

Abstract

We introduce CODE ALLTAG, a text corpus composed of German-language e-mails. It is divided into two partitions: the first of these portions, CODE ALLTAG_XL, consists of a bulk-size collection drawn from an openly accessible e-mail archive (roughly 1.5M e-mails), whereas the second portion, CODE ALLTAG_S+d, is much smaller in size (less than thousand e-mails), yet excels with demographic data from each author of an e-mail. CODE ALLTAG, thus, currently constitutes the largest E-Mail corpus ever built. In this paper, we describe, for both parts, the solicitation process for gathering e-mails, present descriptive statistical properties of the corpus, and, for CODE ALLTAG_S+d, reveal a compilation of demographic features of the donors of e-mails.

Details

Paper ID
lrec2016-main-404
Pages
pp. 2543-2550
BibKey
krieg-holz-etal-2016-code
Editors
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 - 28 May 2016

Authors

  • UK

    Ulrike Krieg-Holz

  • CS

    Christian Schuschnig

  • FM

    Franz Matthies

  • BR

    Benjamin Redling

  • UH

    Udo Hahn

Links