Concepedia

TLDR

Deep neural networks for large‑vocabulary continuous speech recognition achieve high accuracy but are slow to train because they contain 10–50 million parameters, most of which reside in the final weight layer that maps to many output targets. This work proposes a low‑rank matrix factorization of that final weight layer to reduce model size. The factorization is applied to DNNs used for both acoustic and language modeling. Across three LVCSR tasks spanning 50–400 hours, the low‑rank approach cuts parameters by 30–50 %, yielding a comparable reduction in training time with no significant loss in recognition accuracy.

Abstract

While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training of these networks is slow. One reason is that DNNs are trained with a large number of training parameters (i.e., 10–50 million). Because networks are trained with a large number of output targets to achieve good performance, the majority of these parameters are in the final weight layer. In this paper, we propose a low-rank matrix factorization of the final weight layer. We apply this low-rank technique to DNNs for both acoustic modeling and language modeling. We show on three different LVCSR tasks ranging between 50–400 hrs, that a low-rank factorization reduces the number of parameters of the network by 30–50%. This results in roughly an equivalent reduction in training time, without a significant loss in final recognition accuracy, compared to a full-rank representation.

References

YearCitations

Page 1