基于直接内存访问和动态共享缓冲区的超长向量归约操作硬件卸载结构与方法

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (4): 571-581.

基于直接内存访问和动态共享缓冲区的超长向量归约操作硬件卸载结构与方法

徐金波,戴艺,翦杰

(国防科技大学计算机学院，湖南长沙 410073)

收稿日期:2023-12-29 修回日期:2024-03-28 出版日期:2025-04-25 发布日期:2025-04-17
基金资助:
国防科技重点实验室基金(2022-KJWPDL-11);自主创新科学基金(22-ZZCX-002)

A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer

XU Jinbo,DAI Yi,JIAN Jie

(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

Received:2023-12-29 Revised:2024-03-28 Online:2025-04-25 Published:2025-04-17

摘要/Abstract

摘要： MPI聚合通信通过将多个计算结点的多个进程组织起来协同完成一系列通信操作，以提高系统性能。其中，超长操作数向量的归约操作在高性能计算和AI计算中应用广泛。提出了一种基于DMA和动态共享缓冲区的超长向量归约操作的硬件卸载结构与方法。通过专用硬件通信序列触发机制，实现聚合通信硬件卸载流程的控制；通过DMA传输协议提升归约操作数的软硬件传输效率；提出片上动态共享缓冲区存储结构，以实现大量操作数的灵活高效缓存；通过部署片上ALU阵列，直接在网络芯片中完成计算。实验结果表明，相对于MPI非卸载方式和“天河”原有卸载方式均有明显的加速效果，尤其是当归约向量长度较大时，加速效果显著提升。

关键词: 聚合通信, 归约, 直接内存访问, 动态共享缓冲区, 硬件卸载

Abstract: MPI (Message Passing Interface) collective communication enhances system performance by organizing multiple processes across multiple computing nodes to collaboratively complete a series of communication operations. Among these, reduction operations on ultra-long operand vectors are widely used in high performance computing and AI (Artificial Intelligence) computations. This paper proposes a hardware offloading structure and method for ultra-long vector reduction operations based on DMA (Direct Memory Access) and dynamic shared buffers. It achieves control over the hardware offloading process for collective communication through a dedicated hardware communication sequence trigger mechanism. The DMA transmission protocol is employed to enhance the software-hardware transmission efficiency of reduction operands. An on-chip dynamic shared buffer storage structure is introduced to achieve flexible and efficient caching of a large number of operands. By deploying an on-chip ALU (Arithmetic Logic Unit) array, computations are performed directly within the network chip. Experimental results demonstrate significant acceleration compared to both non-offloaded MPI methods and the original offloading method used in Tianhe, especially when dealing with longer reduction vectors.

Key words: collective communication, reduce, direct memory access, dynamic shared buffer, hardware offloading

徐金波, 戴艺, 翦杰. 基于直接内存访问和动态共享缓冲区的超长向量归约操作硬件卸载结构与方法[J]. 计算机工程与科学, 2025, 47(4): 571-581.

XU Jinbo, DAI Yi, JIAN Jie. A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer[J]. Computer Engineering & Science, 2025, 47(4): 571-581.

[1]	何康, 黄春, 姜浩, 谷同祥, 齐进, 刘杰, . 基于MPI的高精度归约函数设计与实现[J]. 计算机工程与科学, 2021, 43(04): 594-602.
[2]	王浩, 张伟, 谢旻, 董勇. 基于天河互连MPI聚合通信归约操作卸载优化[J]. 计算机工程与科学, 2020, 42(11): 1981-1987.
[3]	谢旻, 张伟, 周恩强, 董勇. 面向天河互连网络的可扩展通信框架实现技术[J]. 计算机工程与科学, 2020, 42(10高性能专刊): 1720-1729.
[4]	张家豪，单征，岳峰，傅立国，王军，李明亮. 基于TCG技术的二进制翻译条件转移指令优化研究[J]. 计算机工程与科学, 2019, 41(08): 1343-1352.
[5]	朱云，曾晓勤，朱宁，刘禹锋. 基于图文法的程序流程图与源代码自动转换[J]. J4, 2015, 37(05): 937-945.
[6]	蔡小波1,2，张学杰1. 一种基于QoS参数归约的云计算环境能效评估方法[J]. J4, 2014, 36(12): 2305-2311.
[7]	梅松青，周洪建. 基于信息熵的局部线性嵌入[J]. J4, 2014, 36(09): 1806-1811.
[8]	陆惠玲,周涛 . 基于自依赖规则分析的主动规则终止性研究[J]. J4, 2013, 35(8): 135-143.