Back to Main Conference 2014
LREC 2014main

caWaC – A web corpus of Catalan and its application to language modeling and machine translation

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/24fez48mgv7y

Abstract

In this paper we present the construction process of a web corpus of Catalan built from the content of the .cat top-level domain. For collecting and processing data we use the Brno pipeline with the spiderling crawler and its accompanying tools. To the best of our knowledge the corpus represents the largest existing corpus of Catalan containing 687 million words, which is a significant increase given that until now the biggest corpus of Catalan, CuCWeb, counts 166 million words. We evaluate the resulting resource on the tasks of language modeling and statistical machine translation (SMT) by calculating LM perplexity and incorporating the LM in the SMT pipeline. We compare language models trained on different subsets of the resource with those trained on the Catalan Wikipedia and the target side of the parallel data used to train the SMT system.

Details

Paper ID
lrec2014-main-647
Pages
pp. 1728-1732
BibKey
ljubesic-toral-2014-cawac
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • NL

    Nikola Ljubešić

  • AT

    Antonio Toral

Links