Unlimited vocabulary speech recognition for agglutinative languages

TLDR

Building a comprehensive word‑based lexicon for agglutinative languages is impractical because words are formed by concatenating many prefixes, suffixes, and compounds, producing millions of frequent forms that are difficult to split automatically and cause out‑of‑vocabulary problems for rule‑based analyzers. The study aims to apply a fully automatic, language‑ and vocabulary‑independent approach to build sub‑word lexica for agglutinative languages. The authors implement this approach to construct sub‑word lexica for three distinct agglutinative languages. Using these sub‑word lexica, they built large‑vocabulary speech recognizers for each language that outperform the corresponding word‑based reference systems.

Abstract

It is practically impossible to build a word-based lexicon for speech recognition in agglutinative languages that would cover all the relevant words. The problem is that words are generally built by concatenating several prefixes and suffixes to the word roots. Together with compounding and inflections this leads to millions of different, but still frequent word forms. Due to inflections, ambiguity and other phenomena, it is also not trivial to automatically split the words into meaningful parts. Rule-based morphological analyzers can perform this splitting, but due to the handcrafted rules, they also suffer from an out-of-vocabulary problem. In this paper we apply a recently proposed fully automatic and rather language and vocabulary independent way to build sub-word lexica for three different agglutinative languages. We demonstrate the language portability as well by building a successful large vocabulary speech recognizer for each language and show superior recognition performance compared to the corresponding word-based reference systems.

References

Page 1

	Year	Citations

Page 1