EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding

Abstract

Generative tasks, such as text generation and question answering, are essential for mobile applications. Given their inherent privacy sensitivity, executing them on devices is demanded. Nowadays, the execution of these generative tasks heavily relies on the Large Language Models (LLMs). However, the scarce device memory severely hinders the scalability of these models. We present <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EdgeLLM</monospace>, an efficient on-device LLM inference system for models whose sizes exceed the device's memory capacity. <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EdgeLLM</monospace> is built atop speculative decoding, which delegates most tokens to a smaller, memory-resident (draft) LLM. <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EdgeLLM</monospace> integrates three novel techniques: (1) Instead of generating a fixed width and depth token tree, <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EdgeLLM</monospace> proposes compute-efficient branch navigation and verification to pace the progress of different branches according to their accepted probability to prevent the wasteful allocation of computing resources to the wrong branch and to verify them all at once efficiently. (2) It uses a self-adaptive fallback strategy that promptly initiates the verification process when the smaller LLM generates an incorrect token. (3) To not block the generation, <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EdgeLLM</monospace> proposes speculatively generating tokens during large LLM verification with the compute-IO pipeline. Through extensive experiments, <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EdgeLLM</monospace> exhibits impressive token generation speed which is up to 9.3× faster than existing engines.

References

Page 1

	Year	Citations

Page 1