Back to Main Conference 2010
LREC 2010main

MultiUN: A Multilingual Corpus from United Nation Documents

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/5co5op3ipz42

Abstract

This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set for machine translation. We present the results of a French-Chinese machine translation experiment performed on this corpus.

Details

Paper ID
lrec2010-main-473
Pages
N/A
BibKey
eisele-chen-2010-multiun
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • AE

    Andreas Eisele

  • YC

    Yu Chen

Links