A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Abstract

In this paper we describe our effort to create a dataset for the evaluation of cross-language textual similarity detection. We present preexisting corpora and their limits and we explain the various gathered resources to overcome these limits and build our enriched dataset. The proposed dataset is multilingual, includes cross-language alignment for different granularities (from chunk to document), is based on both parallel and comparable corpora and contains human and machine translated texts. Moreover, it includes texts written by multiple types of authors (from average to professionals). With the obtained dataset, we conduct a systematic and rigorous evaluation of several state-of-the-art cross-language textual similarity detection methods. The evaluation results are reviewed and discussed. Finally, dataset and scripts are made publicly available on GitHub: http://github.com/FerreroJeremy/Cross-Language-Dataset.

Resources

Details

Paper ID

lrec2016-main-657

Pages

pp. 4162-4169

DOI

10.63317/36kswfbxh3gk

BibKey

ferrero-etal-2016-multilingual

Editors

Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

978-2-9517408-9-1

Conference

Tenth International Conference on Language Resources and Evaluation

Location

Portorož, Slovenia

Date

23 - 28 May 2016

Authors

JF
Jérémy Ferrero
FA
Frédéric Agnès
LB
Laurent Besacier
DS
Didier Schwab

Links

URL

DOI