• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2026, Vol. 48 ›› Issue (4): 599-607.

• 高性能计算 • 上一篇    下一篇

面向带宽受限型DSP的高效大语言模型推理方法

陈阳,杨希,苏华友,陈抗抗   

  1. (1.国防科技大学计算机学院,湖南 长沙 410073;
    2.国防科技大学并行与分布计算全国重点实验室,湖南 长沙 410073) 

  • 收稿日期:2024-08-15 修回日期:2024-12-25 出版日期:2026-04-25 发布日期:2026-04-29
  • 基金资助:
    国家国防科技工业局国防科技重点实验室稳定支持项目(WDZC20235250102)

An efficient large language model inference method for bandwidth-constrained digital signal processors

CHEN Yang,YANG Xi,SU Huayou,CHEN Kangkang   

  1. (1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
    2.National Key Laboratory of Parallel and Distributed Computing,National University of Defense Technology,Changsha 410073,China)
  • Received:2024-08-15 Revised:2024-12-25 Online:2026-04-25 Published:2026-04-29

摘要: 随着大语言模型的兴起,神经网络模型的参数规模呈指数级增长并达到千/万亿量级,模型的推理任务对计算设备的算力和带宽提出了巨大挑战。为实现低带宽设备上的高性能LLMs推理,针对带宽受限、长向量数字信号处理器体系结构,设计并实现高效的LLMs推理方法,提出基于张量形状感知的低精度矩阵乘方法,充分利用DSP的计算能力和降低访存压力的能力;提出基于数据依赖关系的算子融合方法减少中间临时数据的传输;使用延迟算子执行方法提升DSP设备内核执行效率。实验表明,该方法能够有效提升大模型在带宽受限DSP设备上的推理性能,优化后的推理方法相较于普通实现能够实现1.4~2.3倍的加速比;相较于内存带宽更高的多核ARM CPU以及Intel Xeon Gold  CPU,同等核心数量下LLMs推理性能的加速比分别达到2.5倍和1.2倍以上。


关键词: 数字信号处理器;大语言模型;带宽受限设备;推理 ,

Abstract: With the rise of large language models (LLMs), the parameter scale of neural network models has grown exponentially, reaching the order of hundreds of billions or even trillions, posing immense challenges to the computing power and bandwidth of computational devices for model inference tasks. To achieve high-performance LLMs inference on low-bandwidth devices, this study focuses on bandwidth-constrained, long-vector digital signal processor (DSP) architectures, designing and implementing efficient LLMs inference methods. It proposes a tensor shape-aware low-precision matrix multiplication method that fully leverages the DSP’s computational capabilities while reducing memory access pressure. Additionally, it introduces a data dependency-based operator fusion method to minimize the transmission of intermediate temporary data and employs a deferred operator execution method to enhance the core execution efficiency of DSP devices. Experimental results demonstrate that this approach effectively improves the inference per-formance of large models on bandwidth-constrained DSP devices. Compared to conventional implementations, the optimized inference method achieves a speedup ranging from 1.4 to 2.3 times. Furthermore, when compared to multi-core ARM CPUs and Intel Xeon Gold  CPUs with higher memory bandwidth, the LLMs inference performance achieves speedups of 2.5 times and 1.5 times, respectively, under the same number of cores.


Key words: digital signal processor (DSP), large language model (LLM), bandwidth-constrained device, inference