Combining Bayesian and Support Vector Machines Learning to automatically complete Syntactical Information for HPSG-like Formalisms

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

Abstract

Learning Bayesian Belief Networks (BBN) from corpora and incorporating the extracted inferring knowledge with a Support Vector Machines (SVM) classifier has been applied to the automatic acquisition of verb subcategorization frames for Modern Greek. We have made use of minimal linguistic resources, such as basic morphological tagging and phrase chunking, to demonstrate that verb subcategorization, which is of great significance for developing robust natural language human computer interaction systems, could be achieved using large corpora, without having any general-purpose syntactic parser at all. Moreover, by taking advantage of the plethora in unlabeled data found in text corpora in addition to some available labeled examples, we overcome the expensive task of annotating the whole set of training data and the performance of the subcategorization frames learner is increased. We argue that a classifier generated from BBN and SVM is well suited for learning to identify verb subcategorization frames. Empirical results will support this claim. Performance has been methodically evaluated using two different corpora, one balanced and one domain-specific in order to determine the unbiased behavior of the trained models. Limited training data are proved to endow with satisfactory results. We have been able to achieve precision exceeding 90% on the identification of subcategorization frames which were not known beforehand. The obtained valid frames have been used to fill out the subcategorization field of verb entries in an HPSG-like lexicon using the LKB grammar development environment.