Speech data collection in an under-resourced language within a multilingual context

Molapo, Raymond; Barnard, Etienne; de Wet, Febe

Speech data collection in an under-resourced language within a multilingual context

Date

2014

Authors

Molapo, Raymond

Barnard, Etienne

de Wet, Febe

Researcher ID

21021287 - Barnard, Etienne

Publisher

International Research Institute MICA

Abstract

In this paper, we present an end-to-end solution to the development of an automatic speech recognition (ASR) system in typical under-resourced languages, where the target language is likely to be influenced by one more embedded foreign languages. We first describe the collection and processing of the text corpus crawled from the World Wide Web using the Rapid Language Adaptation Toolkit. In particular, we highlight the challenges faced when foreign languages are embedded within the matrix language. Thereafter, we discuss our speech data collection efforts in under-resourced environments. We finally report on a strategy called transliteration that aids to improve recognition results of our grapheme-based automatic speech recognition system in the presence of embedded language words.

Keywords

Under-resourced languages, Matrix language, Transliteration, Grapheme-based ASR

Citation

Molapo, R. et al. 2014. Speech data collection in an under-resourced language within a multilingual context. (In: 4th International Workshop on Spoken Language Technologies for Under-resourced Languages, St Petersburg, Russia, 14-16 May. p. 238-242).

URI

http://hdl.handle.net/10394/17362

Collections

Faculty of Engineering
Faculty of Natural and Agricultural Sciences

Full item page

Speech data collection in an under-resourced language within a multilingual context

Date

Authors

Researcher ID

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Record Identifier

Abstract

Sustainable Development Goals

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By