Fast Compression and Optimization of Deep Learning Models for Natural Language Processing

Abstract

Nowadays, recurrent neural networks (RNN) and convolutional neural networks (CNN) play a major role in a lot of natural language domains like text document categorization, part of speech tagging, chatbots, language modeling or language translation. Very often RNN networks have a few stacked layers with several megabytes of memory, the same is in case of CNN networks. In many domains like automatic speech recognition the real time inference is a crucial factor to achieve satisfactory quality of service. Compressing the network layers parameters and outputs into a suitable precision formats and applying pruning process can reduce the required storage and computation cycles in embedded devices. It can drastically reduce the consumed power and the memory capacity. In this article, we present pruning and quantization on deep learning models used for sentiment analysis, language modelling and language translation. All of them with a minor degradation of performance metric compared to full floating-point version. We present our attention based modification of language modelling network which achieved state-of-the-art perplexity results and significantly shortened training time.

References

Page 1

	Year	Citations

Page 1