Back to Main Conference 2002
LREC 2002main

Measuring corpus homogeneity using a range of measures for inter-document distance

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

DOI:10.63317/5avq3qndtjqm

Abstract

With the ever more widespread use of corpora in language research, it is becoming increasingly important to be able to describe and compare corpora. The analysis of corpus homogeneity is preliminary to any quantitative approach to corpora comparison. We describe a method for text analysis based only on document-internal linguistic features, and a set of related homogeneity measures based on inter-document distance. We present a preliminary experiment to validate the hypothesis that in the presence of a homogeneous corpus the subcorpus that is necessary to train an NLP system is smaller than the one required if a heterogeneous corpus is used.Overhead projector

Details

Paper ID
lrec2002-main-232
Pages
N/A
BibKey
cavaglia-2002-measuring
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Third International Conference on Language Resources and Evaluation
Location
Las Palmas, Spain
Date
29 May 2002 31 May 2002

Authors

  • GC

    Gabriela Cavaglià

Links