Back to Main Conference 2008
LREC 2008main

Automatic Extraction of Textual Elements from News Web Pages

Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008)

DOI:10.63317/4st3qvisjpjj

Abstract

In this paper we present an algorithm for automatic extraction of textual elements, namely titles and full text, associated with news stories in news web pages. We propose a supervised machine learning classification technique based on the use of a Support Vector Machine (SVM) classifier to extract the desired textual elements. The technique uses internal structural features of a webpage without relying on the Document Object Model to which many content authors fail to adhere. The classifier uses a set of features which rely on the length of text, the percentage of hypertext, etc. The resulting classifier is nearly perfect on previously unseen news pages from different sites. The proposed technique is successfully employed in Alzoa.com, which is the largest Arabic news aggregator on the web.

Details

Paper ID
lrec2008-main-300
Pages
N/A
BibKey
ibrahim-etal-2008-automatic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-4-0
Conference
Sixth International Conference on Language Resources and Evaluation
Location
Marrakech, Morocco
Date
28 May 2008 30 May 2008

Authors

  • HI

    Hossam Ibrahim

  • KD

    Kareem Darwish

  • AM

    Abdel-Rahim Madany

Links