Back to Main Conference 2000
LREC 2000main

A Flexible Infrastructure for Large Monolingual Corpora

Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000)

DOI:10.63317/5fz67iuvztv2

Abstract

In this paper we describe a flexible and portable infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different application for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (http://wortschatz.uni-leipzig.de).

Details

Paper ID
lrec2000-main-172
Pages
N/A
BibKey
quasthoff-wolff-2000-flexible
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Second International Conference on Language Resources and Evaluation
Location
Athens, Greece
Date
31 May 2000 2 June 2000

Authors

  • UQ

    Uwe Quasthoff

  • CW

    Christian Wolff

Links