Back to Main Conference 2004
LREC 2004main

Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/4ftorx69ehby

Abstract

This paper describes the first steps towards the creation of a Bulgarian-Croatian comparable corpus. Its base are two newspaper subcorpora from larger reference corpora of Bulgarian and Croatian. In the beginning we rely on more extralinguistically-oriented, but methodologically cleaner parameters of similarity like: specific topics, pre-defined time span and data size. The idea of `light' and `hard' comparable corpora is introduced. At this stage we aim at producing a `light' bilingual comparable corpus. The algorithm for identifying lexical similarity and aligning linguistic units is presented, and the initial experiments are outlined.

Details

Paper ID
lrec2004-main-323
Pages
N/A
BibKey
bekavac-etal-2004-making
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • BB

    Božo Bekavac

  • PO

    Petya Osenova

  • KS

    Kiril Simov

  • MT

    Marko Tadić

Links