• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (4): 571-581.

• 高性能计算 • 上一篇    下一篇

基于直接内存访问和动态共享缓冲区的超长向量归约操作硬件卸载结构与方法

徐金波,戴艺,翦杰   

  1. (国防科技大学计算机学院,湖南 长沙 410073)


  • 收稿日期:2023-12-29 修回日期:2024-03-28 出版日期:2025-04-25 发布日期:2025-04-17
  • 基金资助:
    国防科技重点实验室基金(2022-KJWPDL-11);自主创新科学基金(22-ZZCX-002)

A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer

XU Jinbo,DAI Yi,JIAN Jie   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2023-12-29 Revised:2024-03-28 Online:2025-04-25 Published:2025-04-17

摘要: MPI聚合通信通过将多个计算结点的多个进程组织起来协同完成一系列通信操作,以提高系统性能。其中,超长操作数向量的归约操作在高性能计算和AI计算中应用广泛。提出了一种基于DMA和动态共享缓冲区的超长向量归约操作的硬件卸载结构与方法。通过专用硬件通信序列触发机制,实现聚合通信硬件卸载流程的控制;通过DMA传输协议提升归约操作数的软硬件传输效率;提出片上动态共享缓冲区存储结构,以实现大量操作数的灵活高效缓存;通过部署片上ALU阵列,直接在网络芯片中完成计算。实验结果表明,相对于MPI非卸载方式和“天河”原有卸载方式均有明显的加速效果,尤其是当归约向量长度较大时,加速效果显著提升。

关键词: 聚合通信, 归约, 直接内存访问, 动态共享缓冲区, 硬件卸载

Abstract: MPI (Message Passing Interface) collective communication enhances system performance by organizing multiple processes across multiple computing nodes to collaboratively complete a series of communication operations. Among these, reduction operations on ultra-long operand vectors are widely used in high performance computing and AI (Artificial Intelligence) computations. This paper proposes a hardware offloading structure and method for ultra-long vector reduction operations based on DMA (Direct Memory Access) and dynamic shared buffers. It achieves control over the hardware offloading process for collective communication through a dedicated hardware communication sequence trigger mechanism. The DMA transmission protocol is employed to enhance the software-hardware transmission efficiency of reduction operands. An on-chip dynamic shared buffer storage structure is introduced to achieve flexible and efficient caching of a large number of operands. By deploying an on-chip ALU (Arithmetic Logic Unit) array, computations are performed directly within the network chip. Experimental results demonstrate significant acceleration compared to both non-offloaded MPI methods and the original offloading method used in Tianhe, especially when dealing with longer reduction vectors.


Key words: collective communication, reduce, direct memory access, dynamic shared buffer, hardware offloading