Back to Main Conference 2022
LREC 2022main

Design and Evaluation of the Corpus of Everyday Japanese Conversation

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/2f5r3h7e2q5d

Abstract

We have constructed the Corpus of Everyday Japanese Conversation (CEJC) and published it in March 2022. The CEJC is designed to contain various kinds of everyday conversations in a balanced manner to capture their diversity. The CEJC features not only audio but also video data to facilitate precise understanding of the mechanism of real-life social behavior. The publication of a large-scale corpus of everyday conversations that includes video data is a new approach. The CEJC contains 200 hours of speech, 577 conversations, about 2.4 million words, and a total of 1675 conversants. In this paper, we present an overview of the corpus, including the recording method and devices, structure of the corpus, formats of video and audio files, transcription, and annotations. We then report some results of the evaluation of the CEJC in terms of conversant and conversation attributes. We show that the CEJC includes a good balance of adult conversants in terms of gender and age, as well as a variety of conversations in terms of conversation forms, places, activities, and numbers of conversants.

Details

Paper ID
lrec2022-main-599
Pages
pp. 5587-5594
BibKey
koiso-etal-2022-design
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • HK

    Hanae Koiso

  • HA

    Haruka Amatani

  • YD

    Yasuharu Den

  • YI

    Yuriko Iseki

  • YI

    Yuichi Ishimoto

  • WK

    Wakako Kashino

  • YK

    Yoshiko Kawabata

  • KN

    Ken’ya Nishikawa

  • YT

    Yayoi Tanaka

  • YU

    Yasuyuki Usuda

  • YW

    Yuka Watanabe

Links