Back to Main Conference 2010
LREC 2010main

Efficiently Extract Rrecurring Tree Fragments from Large Treebanks

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/3whev4f9uxen

Abstract

In this paper we describe FragmentSeeker, a tool which is capable to identify all those tree constructions which are recurring multiple times in a large Phrase Structure treebank. The tool is based on an efficient kernel-based dynamic algorithm, which compares every pair of trees of a given treebank and computes the list of fragments which they both share. We describe two different notions of fragments we will use, i.e. standard and partial fragments, and provide the implementation details on how to extract them from a syntactically annotated corpus. We have tested our system on the Penn Wall Street Journal treebank for which we present quantitative and qualitative analysis on the obtained recurring structures, as well as provide empirical time performance. Finally we propose possible ways our tool could contribute to different research fields related to corpus analysis and processing, such as parsing, corpus statistics, annotation guidance, and automatic detection of argument structure.

Details

Paper ID
lrec2010-main-420
Pages
N/A
BibKey
sangati-etal-2010-efficiently
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • FS

    Federico Sangati

  • WZ

    Willem Zuidema

  • RB

    Rens Bod

Links