IRSTLM: an open source toolkit for handling large scale language models

TLDR

Research in speech recognition and machine translation is boosting the use of large‑scale n‑gram language models. We present an open‑source toolkit that permits efficient handling of language models with billions of n‑grams on conventional machines. The IRSTLM toolkit supports distribution of n‑gram collection and smoothing over a computer cluster, compression through probability quantization, and lazy‑loading of huge language models from disk. IRSTLM has been successfully deployed with the Moses toolkit for statistical machine translation and with the FBK‑irst speech recognition system, and its efficiency was demonstrated on a speech transcription task of Italian political speeches using a 1.1 billion‑four‑gram language model.

Abstract

Research in speech recognition and machine translation is boosting the use of large scale n-gram language models. We present an open source toolkit that permits to efficiently handle language models with billions of n-grams on conventional machines. The IRSTLM toolkit supports distribution of ngram collection and smoothing over a computer cluster, language model compression through probability quantization, lazy-loading of huge language models from disk. IRSTLM has been so far successfully deployed with the Moses toolkit for statistical machine translation and with the FBK-irst speech recognition system. Efficiency of the tool is reported on a speech transcription task of Italian political speeches using a language model of 1.1 billion four-grams.

References

Page 1

	Year	Citations

Page 1