A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (4): 571-581.

• High Performance Computing • Previous Articles Next Articles

A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer

XU Jinbo,DAI Yi,JIAN Jie

(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

Received:2023-12-29 Revised:2024-03-28 Online:2025-04-25 Published:2025-04-17

Abstract

Abstract: MPI (Message Passing Interface) collective communication enhances system performance by organizing multiple processes across multiple computing nodes to collaboratively complete a series of communication operations. Among these, reduction operations on ultra-long operand vectors are widely used in high performance computing and AI (Artificial Intelligence) computations. This paper proposes a hardware offloading structure and method for ultra-long vector reduction operations based on DMA (Direct Memory Access) and dynamic shared buffers. It achieves control over the hardware offloading process for collective communication through a dedicated hardware communication sequence trigger mechanism. The DMA transmission protocol is employed to enhance the software-hardware transmission efficiency of reduction operands. An on-chip dynamic shared buffer storage structure is introduced to achieve flexible and efficient caching of a large number of operands. By deploying an on-chip ALU (Arithmetic Logic Unit) array, computations are performed directly within the network chip. Experimental results demonstrate significant acceleration compared to both non-offloaded MPI methods and the original offloading method used in Tianhe, especially when dealing with longer reduction vectors.

Key words: collective communication, reduce, direct memory access, dynamic shared buffer, hardware offloading

XU Jinbo, DAI Yi, JIAN Jie. A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer[J]. Computer Engineering & Science, 2025, 47(4): 571-581.

[1]	ZHANG Yu er, XI Yuhao, LIU Peng. Designing and optimizing RISC-V instruction set functionality based on multi-operand acceleration [J]. Computer Engineering & Science, 2025, 47(6): 968-975.
[2]	LIAN Zihan, HE Weifeng. High-performance processor design based on dynamic timing slack exploitation [J]. Computer Engineering & Science, 2025, 47(2): 219-227.
[3]	XU Cheng-zhou, LI Lu, ZHANG Wen-tao. An attribute-based encryption scheme supporting complex access policies [J]. Computer Engineering & Science, 2023, 45(10): 1779-1788.
[4]	XU Cheng-zhou, ZHANG Wen-tao, LANG Jing-hong. An attribute-based encryption scheme preventing irrelevant attributes interference [J]. Computer Engineering & Science, 2022, 44(05): 800-809.
[5]	ZHANG Li-zhi, RAN Zhe-jiang, LAI Zhi-quan, LIU Feng. Performance analysis of distributed deep learning communication architecture [J]. Computer Engineering & Science, 2021, 43(03): 416-425.
[6]	XIE Min, ZHANG Wei, ZHOU En-qiang, DONG Yong. Implementation of scalable communication framework on TH-express interconnection [J]. Computer Engineering & Science, 2020, 42(10高性能专刊): 1720-1729.
[7]	WANG Yu-xin,WANG Fei,WANG Guan,GUO He. A MapReduce workflow heterogeneous scheduling algorithm based on two-level DAG model [J]. Computer Engineering & Science, 2019, 41(08): 1353-1359.
[8]	TAO Xiao-ling1,2,KANG Rui-nan3,LIU Li-yan3. A parallel multi-classifier fusion approach based on selective ensemble [J]. Computer Engineering & Science, 2018, 40(05): 787-792.
[9]	WANG Jing1,2,WANG Ruo-fei1,2. E-commerce query suggestion based on log mining [J]. Computer Engineering & Science, 2018, 40(02): 231-237.
[10]	XIAO Wen,HU Juan,ZHOU Xiao-feng. PFPonCanTree：A parallel frequent patterns incremental mining algorithm based on MapReduce [J]. Computer Engineering & Science, 2018, 40(01): 15-23.
[11]	CAI Wu-yue1,WANG Ke2,HAO Yu-jie2,DUAN Xiao-ran2. An abnormal behavior detection method in Hadoop cluster [J]. Computer Engineering & Science, 2017, 39(12): 2185-2191.
[12]	LI Yuan1,DIAO Sheng-quan1,HU Jin-zhu1,2,ZHAI Hong-sen1,YANG Meng-chuan1,HUANG Wen-can1. Marked complex sentence hierarchy analysis based on semantics and rules [J]. Computer Engineering & Science, 2017, 39(12): 2306-2313.
[13]	ZHAO Bao-wen,XU Hua. A parallel MRACO-PAM clustering algorithm based on MapReduce [J]. Computer Engineering & Science, 2017, 39(10): 1801-1806.
[14]	WU Yun-wei,NING Qian. Distributed SVM parameter optimization based on Hadoop [J]. Computer Engineering & Science, 2017, 39(06): 1042-1047.
[15]	ZHAO Yi-ning,XIAO Hai-li. Optimization of the log pattern extraction algorithm for large-scale syslog files [J]. Computer Engineering & Science, 2017, 39(05): 821-828.

A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments