Concepedia

Publication | Closed Access

Improving the speed of neural networks on CPUs

674

Citations

10

References

2011

Year

TLDR

Large deep neural networks with tens of millions of parameters now enable real‑time applications, but their size imposes a heavy computational burden that typically forces reliance on GPUs. This tutorial aims to show how modern x86 CPUs can be leveraged to dramatically reduce the computational cost of such networks. The authors detail data‑layout optimizations, batching strategies, and the use of SSE2, SSSE3, and SSE4 fixed‑point instructions, achieving up to a three‑fold speedup over an optimized floating‑point baseline. Applying these techniques to a real‑time hybrid HMM/NN speech‑recognition system yields a 10‑fold speedup over an unoptimized baseline and a 4‑fold speedup over an aggressively optimized floating‑point baseline, with no loss in accuracy, and the methods generalize to training and offer a viable software alternative to specialized hardware.

Abstract

Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3× improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware.

References

YearCitations

Page 1