• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (4): 571-581.

• High Performance Computing • Previous Articles     Next Articles

A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer

XU Jinbo,DAI Yi,JIAN Jie   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2023-12-29 Revised:2024-03-28 Online:2025-04-25 Published:2025-04-17

Abstract: MPI (Message Passing Interface) collective communication enhances system performance by organizing multiple processes across multiple computing nodes to collaboratively complete a series of communication operations. Among these, reduction operations on ultra-long operand vectors are widely used in high performance computing and AI (Artificial Intelligence) computations. This paper proposes a hardware offloading structure and method for ultra-long vector reduction operations based on DMA (Direct Memory Access) and dynamic shared buffers. It achieves control over the hardware offloading process for collective communication through a dedicated hardware communication sequence trigger mechanism. The DMA transmission protocol is employed to enhance the software-hardware transmission efficiency of reduction operands. An on-chip dynamic shared buffer storage structure is introduced to achieve flexible and efficient caching of a large number of operands. By deploying an on-chip ALU (Arithmetic Logic Unit) array, computations are performed directly within the network chip. Experimental results demonstrate significant acceleration compared to both non-offloaded MPI methods and the original offloading method used in Tianhe, especially when dealing with longer reduction vectors.


Key words: collective communication, reduce, direct memory access, dynamic shared buffer, hardware offloading