Back to Main Conference 2010
LREC 2010main

Constructing the CODA Corpus: A Parallel Corpus of Monologues and Expository Dialogues

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/2gy24a54qyg6

Abstract

We describe the construction of the CODA corpus, a parallel corpus of monologues and expository dialogues. The dialogue part of the corpus consists of expository, i.e., information-delivering rather than dramatic, dialogues written by several acclaimed authors. The monologue part of the corpus is a paraphrase in monologue form of these dialogues by a human annotator. The annotator-written monologue preserves all information present in the original dialogue and does not introduce any new information that is not present in the original dialogue. The corpus was constructed as a resource for extracting rules for automated generation of dialogue from monologue. Using authored dialogues allows us to analyse the techniques used by accomplished writers for presenting information in the form of dialogue. The dialogues are annotated with dialogue acts and the monologues with rhetorical structure. We developed annotation and translation guidelines together with a custom-developed tool for carrying out translation, alignment and annotation of the dialogues. The final parallel CODA corpus consists of 1000 dialogue turns that are tagged with dialogue acts and aligned with monologue that expresses the same information and has been annotated with rhetorical structure relations.

Details

Paper ID
lrec2010-main-079
Pages
N/A
BibKey
stoyanchev-piwek-2010-constructing
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • SS

    Svetlana Stoyanchev

  • PP

    Paul Piwek

Links