Automatic diacritization of Arabic for acoustic modeling in speech recognition

TLDR

Automatic recognition of Arabic dialectal speech is difficult because dialects are primarily spoken varieties, few resources exist, and most acoustic data lack diacritics that encode essential pronunciation such as short vowels. The study investigates procedures that automatically insert missing diacritics into transcriptions to enable use of existing training data. These procedures combine acoustic information with morphological and contextual constraints and are evaluated against manually diacritized transcriptions. The accuracy of the automatic diacritization directly improves the recognition performance of acoustic models trained on the resulting data.

Abstract

Automatic recognition of Arabic dialectal speech is a challenging task because Arabic dialects are essentially spoken varieties. Only few dialectal resources are available to date; moreover, most available acoustic data collections are transcribed without diacritics. Such a transcription omits essential pronunciation information about a word, such as short vowels. In this paper we investigate various procedures that enable us to use such training data by automatically inserting the missing diacritics into the transcription. These procedures use acoustic information in combination with different levels of morphological and contextual constraints. We evaluate their performance against manually diacritized transcriptions. In addition, we demonstrate the effect of their accuracy on the recognition performance of acoustic models trained on automatically diacritized training data.

References

Page 1

	Year	Citations

Page 1