• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2026, Vol. 48 ›› Issue (4): 599-607.

• High Performance Computing • Previous Articles     Next Articles

An efficient large language model inference method for bandwidth-constrained digital signal processors

CHEN Yang,YANG Xi,SU Huayou,CHEN Kangkang   

  1. (1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
    2.National Key Laboratory of Parallel and Distributed Computing,National University of Defense Technology,Changsha 410073,China)
  • Received:2024-08-15 Revised:2024-12-25 Online:2026-04-25 Published:2026-04-29

Abstract: With the rise of large language models (LLMs), the parameter scale of neural network models has grown exponentially, reaching the order of hundreds of billions or even trillions, posing immense challenges to the computing power and bandwidth of computational devices for model inference tasks. To achieve high-performance LLMs inference on low-bandwidth devices, this study focuses on bandwidth-constrained, long-vector digital signal processor (DSP) architectures, designing and implementing efficient LLMs inference methods. It proposes a tensor shape-aware low-precision matrix multiplication method that fully leverages the DSP’s computational capabilities while reducing memory access pressure. Additionally, it introduces a data dependency-based operator fusion method to minimize the transmission of intermediate temporary data and employs a deferred operator execution method to enhance the core execution efficiency of DSP devices. Experimental results demonstrate that this approach effectively improves the inference per-formance of large models on bandwidth-constrained DSP devices. Compared to conventional implementations, the optimized inference method achieves a speedup ranging from 1.4 to 2.3 times. Furthermore, when compared to multi-core ARM CPUs and Intel Xeon Gold  CPUs with higher memory bandwidth, the LLMs inference performance achieves speedups of 2.5 times and 1.5 times, respectively, under the same number of cores.


Key words: digital signal processor (DSP), large language model (LLM), bandwidth-constrained device, inference