Back to Main Conference 2004
LREC 2004main

A Framework for Evaluating the Suitability of Non-English Corpora for Language Engineering

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/47vg5kbx8dcn

Abstract

In this paper we develop a framework for fast profiling and quality verification of datasets for language engineering and information retrieval research. The profiling steps consist of an initial tokenization of the corpus to produce a frequency list from which some basic statistics are derived. Manual sampling is carried out to detect obvious discrepancies. Two diagnostic tests are performed to check for sparseness related measures. The behaviour of the function words is traced to gauge homogeneity of their distribution in documents.

Details

Paper ID
lrec2004-main-288
Pages
N/A
BibKey
sarkar-de-roeck-2004-framework
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • AS

    Avik Sarkar

  • AD

    Anne De Roeck

Links