Back to Main Conference 2022
LREC 2022main

SciPar: A Collection of Parallel Corpora from Scientific Abstracts

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/3jdy6o7bkxhj

Abstract

This paper presents SciPar, a new collection of parallel corpora created from openly available metadata of bachelor theses, master theses and doctoral dissertations hosted in institutional repositories, digital libraries of universities and national archives. We describe first how we harvested and processed metadata from 86, mainly European, repositories to extract bilingual titles and abstracts, and then how we mined high quality sentence pairs in a wide range of scientific areas and sub-disciplines. In total, the resource includes 9.17 million segment alignments in 31 language pairs and is publicly available via the ELRC-SHARE repository. The bilingual corpora in this collection could prove valuable in various applications, such as cross-lingual plagiarism detection or adapting Machine Translation systems for the translation of scientific texts and academic writing in general, especially for language pairs which include English.

Details

Paper ID
lrec2022-main-284
Pages
pp. 2652-2657
BibKey
roussis-etal-2022-scipar
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • DR

    Dimitrios Roussis

  • VP

    Vassilis Papavassiliou

  • PP

    Prokopis Prokopidis

  • SP

    Stelios Piperidis

  • VK

    Vassilis Katsouros

Links