Back to Main Conference 2010
LREC 2010main

How Specialized are Specialized Corpora? Behavioral Evaluation of Corpus Representativeness for Maltese.

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/4awanv6bvpd6

Abstract

In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then apply statistical methods to evaluate the extent to which familiarity ratings predict corpus frequency for verbs in the Maltese corpus from three angles: 1) token frequency, 2) frequency distributions and 3) morpho-syntactic type (binyan). This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.

Details

Paper ID
lrec2010-main-454
Pages
N/A
BibKey
francom-etal-2010-specialized
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • JF

    Jerid Francom

  • AL

    Amy LaCross

  • AU

    Adam Ussishkin

Links