Back to Main Conference 2004
LREC 2004main

MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/3jcg5oyv4jwz

Abstract

The paper presents the third edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe word-level syntactic annotations; medium scale morphosyntactic lexica; and annotated parallel, comparable, and speech corpora. The most important component is the linguistically annotated corpus consisting of Orwell's novel "1984" in the English original and translations. The resources are the results of several EU projects: MULTEXT-East (produced linked resources for Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian and English), TELRI (added resources for Lithuanian, Croatian, Serbian, and Russian; first release), and CONCEDE (validation, re-encoding; partial re-release). This paper presents the third release of the resources, which brings together the first two, makes them available in TEI P4 XML, and introduces further extensions, e.g. the specification for Resian, a dialect of Slovene. This dataset, unique in terms of languages and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper presents the component resources, reviews some research undertaken on the basis of the first two editions, and discusses future plans.

Details

Paper ID
lrec2004-main-078
Pages
N/A
BibKey
erjavec-2004-multext
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • TE

    Tomaž Erjavec

Links