Back to Main Conference 2012
LREC 2012main

Syntactic annotation of spontaneous speech: application to call-center conversation data

Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012)

DOI:10.63317/43xtqkc3yiw8

Abstract

This paper describes the syntactic annotation process of the DECODA corpus. This corpus contains manual transcriptions of spoken conversations recorded in the French call-center of the Paris Public Transport Authority (RATP). Three levels of syntactic annotation have been performed with a semi-supervised approach: POS tags, Syntactic Chunks and Dependency parses. The main idea is to use off-the-shelf NLP tools and models, originaly developped and trained on written text, to perform a first automatic annotation on the manually transcribed corpus. At the same time a fully manual annotation process is performed on a subset of the original corpus, called the GOLD corpus. An iterative process is then applied, consisting in manually correcting errors found in the automatic annotations, retraining the linguistic models of the NLP tools on this corrected corpus, then checking the quality of the adapted models on the fully manual annotations of the GOLD corpus. This process iterates until a certain error rate is reached. This paper describes this process, the main issues raising when adapting NLP tools to process speech transcriptions, and presents the first evaluations performed with these new adapted tools.

Details

Paper ID
lrec2012-main-397
Pages
pp. 1338-1342
BibKey
bazillon-etal-2012-syntactic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-7-7
Conference
Eighth International Conference on Language Resources and Evaluation
Location
Istanbul, Turkey
Date
21 May 2012 27 May 2012

Authors

  • TB

    Thierry Bazillon

  • MD

    Melanie Deplano

  • FB

    Frederic Bechet

  • AN

    Alexis Nasr

  • BF

    Benoit Favre

Links