Back to Main Conference 2024
LREC-COLING 2024main

A Closer Look at Clustering Bilingual Comparable Corpora

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/2ry2txxpkoje

Abstract

We study in this paper the problem of clustering comparable corpora, building upon the observation that different types of clusters can be present in such corpora: monolingual clusters comprising documents in a single language, and bilingual or multilingual clusters comprising documents written in different languages. Based on a state-of-the-art deep variant of Kmeans, we propose new clustering models fully adapted to comparable corpora and illustrate their behavior on several bilingual collections (in English, French, German and Russian) created from Wikipedia.

Details

Paper ID
lrec2024-main-0012
Pages
pp. 133-142
BibKey
laskina-etal-2024-closer
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • AL

    Anna Laskina

  • EG

    Eric Gaussier

  • GC

    Gaelle Calvary

Links