Back to Main Conference 2024
LREC-COLING 2024main

Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/5ivff8cddd57

Abstract

Paraphrase detection is a task to identify if two sentences are semantically similar or not. It plays an important role in maintaining the integrity of written work such as plagiarism detection and text reuse detection. Formerly, researchers focused on developing large corpora for English. However, no research has been conducted on sentence-level paraphrase detection in low-resource Pashto language. To bridge this gap, we introduce the first fully manually annotated Pashto sentential paraphrase detection corpus collected from authentic cases in journalism covering 10 different domains, including Sports, Health, Environment, and more. Our proposed corpus contains 6,727 sentences, encompassing 3,687 paraphrased and 3,040 non-paraphrased. Experimental findings reveal that our proposed corpus is sufficient to train XLM-RoBERTa to accurately detect paraphrased sentence pairs in Pashto with an F1 score of 84%. To compare our corpus with those in other languages, we also applied our fine-tuned model to the Indonesian and English paraphrase datasets in a zero-shot manner, achieving F1 scores of 82% and 78%, respectively. This result indicates that the quality of our corpus is not less than commonly used datasets. It‘s a pioneering contribution to the field. We will publicize a subset of 1,800 instances from our corpus, free from any licensing issues.

Details

Paper ID
lrec2024-main-1011
Pages
pp. 11574-11581
BibKey
ali-etal-2024-monolingual
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • IA

    Iqra Ali

  • HK

    Hidetaka Kamigaito

  • TW

    Taro Watanabe

Links