LREC 2006 - Proceedings sorted by papers

Title	Language Specific and Topic Focused Web Crawling
Authors	O. Medelyan, S. Schulz, J. Paetzold, M. Poprat, K. Markó
Abstract	We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the crawler builds a new large collection consisting only of documents that satisfy both the language and the topic model. The manual analysis of acquired English and German medicine corpora reveals the high accuracy of the crawler. However, there are significant differences between both languages.
Keywords
Full paper	Language Specific and Topic Focused Web Crawling

Title

Language Specific and Topic Focused Web Crawling

Authors

O. Medelyan, S. Schulz, J. Paetzold, M. Poprat, K. Markó

Abstract

We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the crawler builds a new large collection consisting only of documents that satisfy both the language and the topic model. The manual analysis of acquired English and German medicine corpora reveals the high accuracy of the crawler. However, there are significant differences between both languages.

Keywords

Full paper

Language Specific and Topic Focused Web Crawling