Back to Main Conference 2000
LREC 2000main

A Web-based Text Corpora Development System

Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000)

DOI:10.63317/3qdqmpgveewi

Abstract

One of the most important starting points for any NLP endeavor is the construction of text corpora of appropriate size and quality. This paper presents a web-based text corpora development system which focuses both on the size and the quality of these corpora. The quantitative problem is solved by using the Internet as a practically limitless source of texts. To ensure a certain quality, we enrich the text with relevant information, to be fit for further use, by treating in an integrated manner the problems of morpho-syntactic annotation, lexical ambiguity resolution, and diacritic characters restoration. Although at this moment it is targeted at texts in Romanian, the system can be adapted to other languages, provided that some appropriate auxiliary resources are available.

Details

Paper ID
lrec2000-main-079
Pages
N/A
BibKey
bohus-boldea-2000-web
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Second International Conference on Language Resources and Evaluation
Location
Athens, Greece
Date
31 May 2000 2 June 2000

Authors

  • DB

    Dan Bohuş

  • MB

    Marian Boldea

Links