Publication | Closed Access
Towards Memory Friendly Long-Short Term Memory Networks (LSTMs) on Mobile GPUs
19
Citations
36
References
2018
Year
Unknown Venue
Mobile GpusMemory Friendly LstmEngineeringMachine LearningIntelligent Personal AssistantsComputer ArchitectureRecurrent Neural NetworkSocial SciencesLanguage ProcessingData ScienceComputing SystemsMemoryAdaptive MemoryParallel ComputingLarge Ai ModelComputer EngineeringComputer ScienceMobile ComputingDeep LearningMemory ArchitectureStorage (Memory)Lstm CellsIn-memory Computing
Intelligent Personal Assistants (IPAs) with the capability of natural language processing (NLP) are increasingly popular in today's mobile devices. Recurrent neural networks (RNNs), especially one of their forms – Long-Short Term Memory networks (LSTMs), are becoming the core machine learning technique applied in the NLP-based IPAs. With the continuously improved performance of mobile GPUs, local processing has become a promising solution to the large data transmission and privacy issues induced by the cloud-centric computations of IPAs. However, LSTMs exhibit quite inefficient memory access pattern when executed on mobile GPUs due to the redundant data movements and limited off-chip bandwidth. In this study, we aim to explore the memory friendly LSTM on mobile GPUs by hierarchically reducing the off-chip memory accesses. To address the redundant data movements, we propose inter-cell level optimizations that intelligently parallelize the originally sequentially executed LSTM cells (basic units in RNNs, corresponding to neurons in CNNs) to improve the data locality across cells with negligible accuracy loss. To relax the pressure on limited off-chip memory bandwidth, we propose intra-cell level optimizations that dynamically skip the loads and computations of rows in the weight matrices with trivial contribution to the outputs. We also introduce a light-weighted module to the GPUs architecture for the runtime row skipping in weight matrices. Moreover, our techniques are equipped with thresholds which provide a unique tunning space for performance-accuracy trade-offs directly guided by the user preferences. The experimental results show our optimizations achieves substantial improvements on both performance and power with user-imperceptible accuracy loss. And our optimizations exhibit the strong scalability with the increasing input data set. Our user study also shows that our designed system delivers the excellent user experience.
| Year | Citations | |
|---|---|---|
Page 1
Page 1