Bayesian Semantics Incorporation to Web Content for Natural Language Information Retrieval

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

Abstract

For the present work, we endeavor with the important aspect of information retrieval of Web content using natural language queries. Currently, markup languages and formalisms do not fully provide mechanisms for effective and accurate analysis of Web content but rather provide means for describing the content in a more human-centric approach. As a result, natural language queries cannot be handled by the Internet search engines. Other approaches use grammar markup labels that attempt to fully match an unforeseen query. For the purposes of this paper, we introduce the theoretical and implementation issues of a novel, statistical framework that can cope with Web content analysis and information retrieval using natural language. The framework is based on Bayesian networks, a tool for knowledge representation and reasoning under conditions of uncertainty. The Web page designer provides the lexical items that contain useful information and labels the corresponding semantic interpretation, from a pre-defined set of domain categories. This knowledge is used for learning the structure and the parameters of a Bayesian network. At the time a user’s query is encountered, the network is used in order to return pages that contain the most related semantic content to the user’s query.