Back to Main Conference 2006
LREC 2006main

Multilingual parallel treebanking: a lean and flexible approach

Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006)

DOI:10.63317/5jvmnd7v6y2y

Abstract

We propose a bootstrapping approach to creating a phrase-level alignment over a sentence-aligned parallel corpus, reporting concrete treebank annotation work performed on a sample of sentence tuples from the Europarl corpus, currently for English, French, German, and Spanish. The manually annotated seed data will be used as the basis for automatically labelling the rest of the corpus. Some preliminary experiments addressing the bootstrapping aspects are presented.The representation format for syntactic correspondence across parallel text that we propose as the starting point for a process of successive refinement emphasizes correspondences of major constituents that realize semantic arguments or modifiers; language-particular details of morphosyntactic realization are intentionally left largely unlabelled. We believe this format is a good basis for training NLPtools for multilingual application contexts in which consistency across languages is more central than fine-grained details in specific languages (in particular, syntax-based statistical Machine Translation).

Details

Paper ID
lrec2006-main-383
Pages
N/A
BibKey
kuhn-jellinghaus-2006-multilingual
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-2-4
Conference
Fifth International Conference on Language Resources and Evaluation
Location
Genoa, Italy
Date
24 May 2006 26 May 2006

Authors

  • JK

    Jonas Kuhn

  • MJ

    Michael Jellinghaus

Links