Back to Main Conference 2014
LREC 2014main

Identification of Multiword Expressions in the brWaC

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/2ifk59osp55n

Abstract

Although corpus size is a well known factor that affects the performance of many NLP tasks, for many languages large freely available corpora are still scarce. In this paper we describe one effort to build a very large corpus for Brazilian Portuguese, the brWaC, generated following the Web as Corpus kool initiative. To indirectly assess the quality of the resulting corpus we examined the impact of corpus origin in a specific task, the identification of Multiword Expressions with association measures, against a standard corpus. Focusing on nominal compounds, the expressions obtained from each corpus are of comparable quality and indicate that corpus origin has no impact on this task.

Details

Paper ID
lrec2014-main-429
Pages
pp. 728-735
BibKey
boos-etal-2014-identification
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • RB

    Rodrigo Boos

  • KP

    Kassius Prestes

  • AV

    Aline Villavicencio

Links