DiNoS: Creating a Data-Driven German Noun Phrase Lexicon from Universal Dependencies
Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)
Abstract
To foster investigations of noun phrase (NP) inflection in German at scale, this paper introduces DiNoS (Distributional Noun Structure), a data-driven lexicon of NP heads, which includes statistical information on the dependents and the morphosyntactic features of their original in-context appearances. We make available the source code for the extraction of NPs from CoNLL-U treebanks, which includes rule-based heuristics to improve feature annotation coverage and ensures a homogeneous lemmatisation strategy across treebanks. While the resulting JSON-based lexicon is suitable for no-code interaction for non-experts, it is further supported by a toolkit for the automatic calculation of, and access to, various statistical overviews. In this paper, we present the heuristics employed to extract NP datasets from the German Universal Dependencies’ Hamburg Dependency and GSD treebanks. In addition, we provide a preview of the emerging DiNoS lexica’s properties and discuss some implications of noun and determiner word form ambiguity for NP complexity.