Back to Main Conference 2010
LREC 2010main

Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/54p45ohhda9x

Abstract

After three years of work the Dutch Parallel Corpus (DPC) project has reached an end. The finalized corpus is a ten-million-word high-quality sentence-aligned bidirectional parallel corpus of Dutch, English and French, with Dutch as central language. In this paper we present the corpus and try to formulate some basic data collection principles, based on the work that was carried out for the project. Building a corpus is a difficult and time-consuming task, especially when every text sample included has to be cleared from copyrights. The DPC is balanced according to five text types (literature, journalistic texts, instructive texts, administrative texts and texts treating external communication) and four translation directions (Dutch-English, English-Dutch, Dutch-French and French-Dutch). All the text material was cleared from copyrights. The data collection process necessitated the involvement of different text providers, which resulted in drawing up four different licence agreements. Problems such as an unknown source language, copyright issues and changes to the corpus design are discussed in close detail and illustrated with examples so as to be of help to future corpus compilers.

Details

Paper ID
lrec2010-main-137
Pages
N/A
BibKey
de-clercq-perez-2010-data
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • OD

    Orphée De Clercq

  • MP

    Maribel Montero Perez

Links