LREC 2008 Proceedings

Summary of the paper

Title	Subdomain Sensitive Statistical Parsing using Raw Corpora
Authors	Barbara Plank and Khalil Sima’an
Abstract	Modern statistical parsers are trained on large annotated corpora (treebanks). These treebanks usually consist of sentences addressing different subdomains (e.g. sports, politics, music), which implies that the statistics gathered by current statistical parsers are mixtures of subdomains of language use. In this paper we present a method that exploits raw subdomain corpora gathered from the web to introduce subdomain sensitivity into a given parser. We employ statistical techniques for creating an ensemble of domain sensitive parsers, and explore methods for amalgamating their predictions. Our experiments show that introducing domain sensitivity by exploiting raw corpora can improve over a tough, state-of-the-art baseline.
Language	Single language
Topics	Parsing Systems, Statistical methods, Acquisition, Machine Learning
Full paper	Subdomain Sensitive Statistical Parsing using Raw Corpora
Slides	Subdomain Sensitive Statistical Parsing using Raw Corpora
Bibtex	@InProceedings{PLANK08.120, author = {Barbara Plank and Khalil Sima’an}, title = {Subdomain Sensitive Statistical Parsing using Raw Corpora}, booktitle = {Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)}, year = {2008}, month = {may}, date = {28-30}, address = {Marrakech, Morocco}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-4-0}, note = {http://www.lrec-conf.org/proceedings/lrec2008/}, language = {english} }