Back to Main Conference 2006
LREC 2006main

Language Specific and Topic Focused Web Crawling

Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006)

DOI:10.63317/4wsshahr4n8u

Abstract

We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the crawler builds a new large collection consisting only of documents that satisfy both the language and the topic model. The manual analysis of acquired English and German medicine corpora reveals the high accuracy of the crawler. However, there are significant differences between both languages.

Details

Paper ID
lrec2006-main-125
Pages
N/A
BibKey
medelyan-etal-2006-language
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-2-4
Conference
Fifth International Conference on Language Resources and Evaluation
Location
Genoa, Italy
Date
24 May 2006 26 May 2006

Authors

  • OM

    Olena Medelyan

  • SS

    Stefan Schulz

  • JP

    Jan Paetzold

  • MP

    Michael Poprat

  • KM

    Kornél Markó

Links