Back to Main Conference 2014
LREC 2014main

Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/4z5z6wk53p8e

Abstract

The DARPA BOLT Program develops systems capable of allowing English speakers to retrieve and understand information from informal foreign language genres. Phase 2 of the program required large volumes of naturally occurring informal text (SMS) and chat messages from individual users in multiple languages to support evaluation of machine translation systems. We describe the design and implementation of a robust collection system capable of capturing both live and archived SMS and chat conversations from willing participants. We also discuss the challenges recruitment at a time when potential participants have acute and growing concerns about their personal privacy in the realm of digital communication, and we outline the techniques adopted to confront those challenges. Finally, we review the properties of the resulting BOLT Phase 2 Corpus, which comprises over 6.5 million words of naturally-occurring chat and SMS in English, Chinese and Egyptian Arabic.

Details

Paper ID
lrec2014-main-071
Pages
pp. 1699-1704
BibKey
song-etal-2014-collecting
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • ZS

    Zhiyi Song

  • SS

    Stephanie Strassel

  • HL

    Haejoong Lee

  • KW

    Kevin Walker

  • JW

    Jonathan Wright

  • JG

    Jennifer Garland

  • DF

    Dana Fore

  • BG

    Brian Gainor

  • PC

    Preston Cabe

  • TT

    Thomas Thomas

  • BC

    Brendan Callahan

  • AS

    Ann Sawyer

Links