面向带宽受限型DSP的高效大语言模型推理方法

doi:10.3969/j.issn.1007-130X.2026.04.004

计算机工程与科学 ›› 2026, Vol. 48 ›› Issue (4): 599-607.doi: 10.3969/j.issn.1007-130X.2026.04.004

面向带宽受限型DSP的高效大语言模型推理方法

陈阳，杨希，苏华友，陈抗抗

(1.国防科技大学计算机学院，湖南长沙 410073;
2.国防科技大学并行与分布计算全国重点实验室，湖南长沙 410073)

收稿日期:2024-08-15 修回日期:2024-12-25 出版日期:2026-04-25 发布日期:2026-04-29
基金资助:
国家国防科技工业局国防科技重点实验室稳定支持项目（WDZC20235250102）

An efficient large language model inference method for bandwidth-constrained digital signal processors

CHEN Yang,YANG Xi,SU Huayou,CHEN Kangkang

(1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
2.National Key Laboratory of Parallel and Distributed Computing,National University of Defense Technology,Changsha 410073,China)

Received:2024-08-15 Revised:2024-12-25 Online:2026-04-25 Published:2026-04-29

摘要/Abstract

摘要： 随着大语言模型的兴起，神经网络模型的参数规模呈指数级增长并达到千/万亿量级，模型的推理任务对计算设备的算力和带宽提出了巨大挑战。为实现低带宽设备上的高性能LLMs推理，针对带宽受限、长向量数字信号处理器体系结构，设计并实现高效的LLMs推理方法，提出基于张量形状感知的低精度矩阵乘方法，充分利用DSP的计算能力和降低访存压力的能力；提出基于数据依赖关系的算子融合方法减少中间临时数据的传输；使用延迟算子执行方法提升DSP设备内核执行效率。实验表明，该方法能够有效提升大模型在带宽受限DSP设备上的推理性能，优化后的推理方法相较于普通实现能够实现1.4~2.3倍的加速比；相较于内存带宽更高的多核ARM CPU以及Intel Xeon Gold CPU，同等核心数量下LLMs推理性能的加速比分别达到2.5倍和1.2倍以上。

关键词: 数字信号处理器；大语言模型；带宽受限设备；推理 ,

Abstract: With the rise of large language models (LLMs), the parameter scale of neural network models has grown exponentially, reaching the order of hundreds of billions or even trillions, posing immense challenges to the computing power and bandwidth of computational devices for model inference tasks. To achieve high-performance LLMs inference on low-bandwidth devices, this study focuses on bandwidth-constrained, long-vector digital signal processor (DSP) architectures, designing and implementing efficient LLMs inference methods. It proposes a tensor shape-aware low-precision matrix multiplication method that fully leverages the DSP’s computational capabilities while reducing memory access pressure. Additionally, it introduces a data dependency-based operator fusion method to minimize the transmission of intermediate temporary data and employs a deferred operator execution method to enhance the core execution efficiency of DSP devices. Experimental results demonstrate that this approach effectively improves the inference per-formance of large models on bandwidth-constrained DSP devices. Compared to conventional implementations, the optimized inference method achieves a speedup ranging from 1.4 to 2.3 times. Furthermore, when compared to multi-core ARM CPUs and Intel Xeon Gold CPUs with higher memory bandwidth, the LLMs inference performance achieves speedups of 2.5 times and 1.5 times, respectively, under the same number of cores.

Key words: digital signal processor (DSP), large language model (LLM), bandwidth-constrained device, inference

陈阳, 杨希, 苏华友, 陈抗抗. 面向带宽受限型DSP的高效大语言模型推理方法[J]. 计算机工程与科学, 2026, 48(4): 599-607.

CHEN Yang, YANG Xi, SU Huayou, CHEN Kangkang. An efficient large language model inference method for bandwidth-constrained digital signal processors[J]. Computer Engineering & Science, 2026, 48(4): 599-607.

[1]	廖湘科, 谭郁松, 贾周阳, 王尚文, 蹇松雷, 李宝. 基础软件的演进与未来展望[J]. 计算机工程与科学, 2026, 48(5): 761-769.
[2]	赵成卓, 吕方旭, 徐炜遐, 黄恒, 罗章, 辛可为, 王文晨, 李萌, 赖明澈, 庞征斌. 基于 28 nm CMOS工艺采用亚阈值区MOSFET的低温漂系数高电源抑制电流模带隙基准[J]. 计算机工程与科学, 2026, 48(5): 770-778.
[3]	王振飞, 顿龙祥, 鲍梓良, 杨芮嘉, 李桂秋. 面向分布式文件系统的元数据预取策略研究综述[J]. 计算机工程与科学, 2026, 48(5): 779-792.
[4]	李骥, 周磊, 龚春叶, 马迪, 沈玉林, 张翔. 可满足性问题并行求解优化[J]. 计算机工程与科学, 2026, 48(5): 793-802.
[5]	章铁飞, 邢建国. 面向GPU的低能耗数据传输的组重映射编码方法[J]. 计算机工程与科学, 2026, 48(5): 803-809.
[6]	袁鑫, 李宁, 高铭锋, 房姝彤, 张兆心, 于昌利. CoTree：边缘计算中无边界、分布式的服务器协作策略[J]. 计算机工程与科学, 2026, 48(5): 810-827.
[7]	李旭东, 黄宇豪, 程子果, 李泽麟. 基于成组跳表的区块链查询系统研究[J]. 计算机工程与科学, 2026, 48(5): 828-843.
[8]	高熠辉, 李元庆, 张三峰, 杨望. 一种基于增强图对比学习的欺诈检测方法[J]. 计算机工程与科学, 2026, 48(5): 844-853.
[9]	谢佳辰, 张翔, 付道勇, 何子文, 李梓强. 一种基于方向映射与多区域嵌入的H.265/HEVC视频隐写方法[J]. 计算机工程与科学, 2026, 48(5): 854-864.
[10]	李钰, 蒋音盈, 常青. 基于回波去噪的扩散焊接超声C扫描成像优化方法[J]. 计算机工程与科学, 2026, 48(5): 865-875.
[11]	盛伟, 刘明剑, 刘殿臣. 面向人员密集与遮挡场景的实时目标检测方法[J]. 计算机工程与科学, 2026, 48(5): 876-887.
[12]	陈浩然, 王小鹏, 王海洲. 多次聚类自适应半监督模糊聚类图像分割算法[J]. 计算机工程与科学, 2026, 48(5): 888-897.
[13]	马冬梅, 朱启荣, 吕雪龙. 基于注意力机制的特征融合语义分割模型[J]. 计算机工程与科学, 2026, 48(5): 898-905.
[14]	袁亮, 郭卫斌. 基于BERT和情感分析的无偏见攻击性文本检测方法[J]. 计算机工程与科学, 2026, 48(5): 906-913.
[15]	薛安荣, 陈杰. 基于变分Transformer的时间序列异常检测[J]. 计算机工程与科学, 2026, 48(5): 914-924.

面向带宽受限型DSP的高效大语言模型推理方法

An efficient large language model inference method for bandwidth-constrained digital signal processors

PDF

可视化

摘要/Abstract

引用本文

使用本文

相关文章 15

编辑推荐

Metrics

本文评价