Back to Main Conference 2010
LREC 2010main

New Tools for Web-Scale N-grams

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)

DOI:10.63317/44fpr4ygd2xg

Abstract

While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.

Details

Paper ID
lrec2010-main-158
Pages
N/A
BibKey
lin-etal-2010-new
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-6-7
Conference
Seventh International Conference on Language Resources and Evaluation
Location
Valletta, Malta
Date
17 May 2010 23 May 2010

Authors

  • DL

    Dekang Lin

  • KC

    Kenneth Church

  • HJ

    Heng Ji

  • SS

    Satoshi Sekine

  • DY

    David Yarowsky

  • SB

    Shane Bergsma

  • KP

    Kailash Patil

  • EP

    Emily Pitler

  • RL

    Rachel Lathbury

  • VR

    Vikram Rao

  • KD

    Kapil Dalwani

  • SN

    Sushant Narsale

Links