Back to Main Conference 2014
LREC 2014main

HamleDT 2.0: Thirty Dependency Treebanks Stanfordized

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/2ygvvxmvsj9h

Abstract

We present HamleDT 2.0 (HArmonized Multi-LanguagE Dependency Treebank). HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular in recent years. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion. We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future. We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline.

Details

Paper ID
lrec2014-main-699
Pages
pp. 2334-2341
BibKey
rosa-etal-2014-hamledt
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • RR

    Rudolf Rosa

  • JM

    Jan Mašek

  • DM

    David Mareček

  • MP

    Martin Popel

  • DZ

    Daniel Zeman

  • Zdeněk Žabokrtský

Links