Back to Main Conference 2008
LREC 2008main

A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure

Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008)

DOI:10.63317/2566vxbysbsf

Abstract

In recent years, language resources acquired from theWeb are released, and these data improve the performance of applications in several NLP tasks. Although the language resources based on the web page unit are useful in NLP tasks and applications such as knowledge acquisition, document retrieval and document summarization, such language resources are not released so far. In this paper, we propose a data format for results of web page processing, and a search engine infrastructure which makes it possible to share approximately 100 million Japanese web data. By obtaining the web data, NLP researchers are enabled to begin their own processing immediately without analyzing web pages by themselves.

Details

Paper ID
lrec2008-main-417
Pages
N/A
BibKey
shinzato-etal-2008-large
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-4-0
Conference
Sixth International Conference on Language Resources and Evaluation
Location
Marrakech, Morocco
Date
28 May 2008 30 May 2008

Authors

  • KS

    Keiji Shinzato

  • DK

    Daisuke Kawahara

  • CH

    Chikara Hashimoto

  • SK

    Sadao Kurohashi

Links