Back to Main Conference 2016
LREC 2016main

Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/272h8o2ttq9m

Abstract

This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain ".hr" and the Slovene top-level domain ".si", and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs.

Details

Paper ID
lrec2016-main-471
Pages
pp. 2949-2956
BibKey
ljubesic-etal-2016-producing
Editors
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 - 28 May 2016

Authors

  • NL

    Nikola Ljubešić

  • ME

    Miquel Esplà-Gomis

  • AT

    Antonio Toral

  • SR

    Sergio Ortiz Rojas

  • FK

    Filip Klubička

Links