Back to Main Conference 2006
LREC 2006main

Preprocessing and Tokenisation Standards in DELPH-IN Tools

Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006)

DOI:10.63317/3n5zp7yqbdn4

Abstract

We discuss preprocessing and tokenisation standards within DELPH-IN, a large scale open-source collaboration providing multiple independent multilingual shallow and deep processors. We discuss (i) a component-specific XML interface format which has been used for some time to interface preprocessor results to the PET parser, and (ii) our implementation of a more generic XML interface format influenced heavily by the (ISO working draft) Morphosyntactic Annotation Framework (MAF). Our generic format encapsulates the information which may be passed from the preprocessing stage to a parser: it uses standoff-annotation, a lattice for the representation of structural ambiguity, intra-annotation dependencies and allows for highly structured annotation content. This work builds on the existing Heart of Gold middleware system, and previous work on Robust Minimal Recursion Semantics (RMRS) as part of an inter-component interface. We give examples of usage with a number of the DELPH-IN processing components and deep grammars.

Details

Paper ID
lrec2006-main-115
Pages
N/A
BibKey
waldron-etal-2006-preprocessing
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-2-4
Conference
Fifth International Conference on Language Resources and Evaluation
Location
Genoa, Italy
Date
24 May 2006 26 May 2006

Authors

  • BW

    Benjamin Waldron

  • AC

    Ann Copestake

  • US

    Ulrich Schäfer

  • BK

    Bernd Kiefer

Links