Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)
Abstract
This paper introduces the Prague Czech-English Dependency Treebank (PCEDT), a new Czech-English parallel resource suitable for experiments in structural machine translation. We describe the process of building the core parts of the resources - a bilingual syntactically annotated corpus and translation dictionaries. A part of the Penn Treebank has been translated to Czech and its annotation tranformed into dependency annotation scheme. The annotation of Czech was done automatically from plain text. A subset of corresponding Czech and English sentences has been annotated by humans. The resources being created at Charles University in Prague are scheduled for release as Linguistic Data Consortium data collection in 2004. First experiments in Czech-English machine translation using these data were already carried out.