Towards Processing of the Oral History Interviews and Related Printed Documents

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Abstract

In this paper, we describe the initial stages of our project, the goal of which is to create an integrated archive of the recordings, scanned documents, and photographs that would be accessible online and would provide multifaceted search capabilities (spoken content, biographical information, relevant time period, etc.). The recordings contain retrospective interviews with the witnesses of the totalitarian regimes in Czechoslovakia, where the vocabulary used in such interviews consists of many archaic words and named entities that are now quite rare in everyday speech. The scanned documents consist of text materials and photographs mainly from the home archives of the interviewees or the archive of the State Security. These documents are usually typewritten or even handwritten and have really bad optical quality. In order to build an integrated archive, we will employ mainly methods of automatic speech recognition (ASR), automatic indexing and search in recognized recordings and, to a certain extent, also the optical character recognition (OCR). Other natural language processing techniques like topic detection are also planned to be used in the later stages of the project. This paper focuses on the processing of the speech data using ASR and the scanned typewritten documents with OCR and describes the initial experiments.

Resources

Details

Paper ID

lrec2018-main-331

Pages

N/A

DOI

10.63317/2joemn8ghqdi

BibKey

zajic-etal-2018-towards

Editors

Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-00-9

Conference

Eleventh International Conference on Language Resources and Evaluation

Location

Miyazaki, Japan

Date

7 - 12 May 2018

Authors

ZZ
Zbyněk Zajíc
LS
Lucie Skorkovská
PN
Petr Neduchal
PI
Pavel Ircing
JP
Josef V. Psutka
MH
Marek Hrúz
AP
Aleš Pražák
DS
Daniel Soutner
JŠ
Jan Švec
LB
Lukáš Bureš
LM
Luděk Müller

Links

URL

DOI