Back to Main Conference 2014
LREC 2014main

The Gulf of Guinea Creole Corpora

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/534q4zpksbyn

Abstract

We present the process of building linguistic corpora of the Portuguese-related Gulf of Guinea creoles, a cluster of four historically related languages: Santome, Angolar, Principense and Fa d’Ambô. We faced the typical difficulties of languages lacking an official status, such as lack of standard spelling, language variation, lack of basic language instruments, and small data sets, which comprise data from the late 19th century to the present. In order to tackle these problems, the compiled written and transcribed spoken data collected during field work trips were adapted to a normalized spelling that was applied to the four languages. For the corpus compilation we followed corpus linguistics standards. We recorded meta data for each file and added morphosyntactic information based on a part-of-speech tag set that was designed to deal with the specificities of these languages. The corpora of three of the four creoles are already available and searchable via an online web interface.

Details

Paper ID
lrec2014-main-376
Pages
pp. 523-529
BibKey
hagemeijer-etal-2014-gulf
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • TH

    Tjerk Hagemeijer

  • MG

    Michel Généreux

  • IH

    Iris Hendrickx

  • AM

    Amália Mendes

  • AT

    Abigail Tiny

  • AZ

    Armando Zamora

Links