Back to Main Conference 2002
LREC 2002main

Producing a Large-scale Encyclopedic Corpus over the Web

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

DOI:10.63317/2hyengvpq66q

Abstract

Encyclopedias, which describe general/technical terms, are valuable language resources (LRs). As with other types of LRs relying on human introspection and supervision, constructing encyclopedias is quite expensive. To resolve this problem, we automatically produced a large-scale encyclopedic corpus over the World Wide Web. We first searched the Web for pages containing a term in question. Then we used linguistic patterns and HTML structures to extract text fragments describing the term. Finally, we organized extracted term descriptions based on domains. The resultant corpus contains approximately 100,000 terms. We also evaluated the quality of 2,000 test terms, and found that correct descriptions were obtained for 65\% of test terms.

Details

Paper ID
lrec2002-main-338
Pages
N/A
BibKey
fujii-etal-2002-producing
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Third International Conference on Language Resources and Evaluation
Location
Las Palmas, Spain
Date
29 May 2002 31 May 2002

Authors

  • AF

    Atsushi Fujii

  • KI

    Katunobu Itou

  • TI

    Tetsuya Ishikawa

Links