Speech data collection in an under-resourced language within a multilingual context
Abstract
In this paper, we present an end-to-end solution to the development of an automatic speech recognition (ASR) system in
typical under-resourced languages, where the target language is likely to be influenced by one more embedded foreign languages. We first describe the collection and processing of the text corpus crawled from the World Wide Web using the
Rapid Language Adaptation Toolkit. In particular, we highlight the challenges faced when foreign languages are embedded
within the matrix language. Thereafter, we discuss our speech data collection efforts in under-resourced environments.
We finally report on a strategy called transliteration that aids to improve recognition results of our grapheme-based automatic speech recognition system in the presence of embedded language words.