An efficient large language model inference method for bandwidth-constrained digital signal processors

doi:10.3969/j.issn.1007-130X.2026.04.004

Computer Engineering & Science ›› 2026, Vol. 48 ›› Issue (4): 599-607.doi: 10.3969/j.issn.1007-130X.2026.04.004

• High Performance Computing • Previous Articles Next Articles

An efficient large language model inference method for bandwidth-constrained digital signal processors

CHEN Yang,YANG Xi,SU Huayou,CHEN Kangkang

(1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
2.National Key Laboratory of Parallel and Distributed Computing,National University of Defense Technology,Changsha 410073,China)

Received:2024-08-15 Revised:2024-12-25 Online:2026-04-25 Published:2026-04-29

Abstract

Abstract: With the rise of large language models (LLMs), the parameter scale of neural network models has grown exponentially, reaching the order of hundreds of billions or even trillions, posing immense challenges to the computing power and bandwidth of computational devices for model inference tasks. To achieve high-performance LLMs inference on low-bandwidth devices, this study focuses on bandwidth-constrained, long-vector digital signal processor (DSP) architectures, designing and implementing efficient LLMs inference methods. It proposes a tensor shape-aware low-precision matrix multiplication method that fully leverages the DSP’s computational capabilities while reducing memory access pressure. Additionally, it introduces a data dependency-based operator fusion method to minimize the transmission of intermediate temporary data and employs a deferred operator execution method to enhance the core execution efficiency of DSP devices. Experimental results demonstrate that this approach effectively improves the inference per-formance of large models on bandwidth-constrained DSP devices. Compared to conventional implementations, the optimized inference method achieves a speedup ranging from 1.4 to 2.3 times. Furthermore, when compared to multi-core ARM CPUs and Intel Xeon Gold CPUs with higher memory bandwidth, the LLMs inference performance achieves speedups of 2.5 times and 1.5 times, respectively, under the same number of cores.

Key words: digital signal processor (DSP), large language model (LLM), bandwidth-constrained device, inference

CHEN Yang, YANG Xi, SU Huayou, CHEN Kangkang. An efficient large language model inference method for bandwidth-constrained digital signal processors[J]. Computer Engineering & Science, 2026, 48(4): 599-607.

[1]	TANG Jintao, ZHANG Chengxian, BAO Chenlong, LI Wenjing. Domain oriented discontinuous named entity recognition based on large language model [J]. Computer Engineering & Science, 2025, 47(12): 2253-2260.
[2]	LIU Gao, XU Jianliang, ZHANG Xianyi, LIU Xiandong. OpenLM: A multi-platform and high-performance large language model inference framework [J]. Computer Engineering & Science, 2025, 47(12): 2129-2138.

An efficient large language model inference method for bandwidth-constrained digital signal processors

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 2

Recommended Articles

Metrics

Comments