面向结构矩阵的可扩展并行矩阵乘算法框架

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (9): 1529-1538.

面向结构矩阵的可扩展并行矩阵乘算法框架

李胜国1，廖霞2，于恒彪1，黄春1，姜浩1，逯喜燕1，王华林3，成礼智3

(1.国防科技大学计算机学院，湖南长沙 410073;2.清华大学计算机科学与技术系，北京 100084；
3.国防科技大学理学院，湖南长沙 410073)

收稿日期:2023-01-04 修回日期:2023-07-04 出版日期:2024-09-25 发布日期:2024-09-19
基金资助:
国家重点研发计划（2021YFB0300101）

A scalable parallel structured matrix multiplication algorithm framework

LI Sheng-guo1,LIAO Xia2,YU Heng-biao1,HUANG Chun1,JIANG Hao1,LU Xi-yan1,WANG Hua-lin3,CHENG Li-zhi3

(1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
2.Department of Computer Science and Technology,Tsinghua University,Beijing 100084;
3.College of Science and Technology,National University of Defense Technology,Changsha 410073,China)

Received:2023-01-04 Revised:2023-07-04 Online:2024-09-25 Published:2024-09-19

摘要/Abstract

摘要： 摘要：结构矩阵在科学计算和工程应用中具有重要作用，例如Cauchy、Toeplitz、Vandermonde和Hankel矩阵等。虽然这些矩阵都是稠密的，但只需要O(n)个参数(生成元)就可以表示，其中n为矩阵的维数。提出了面向结构矩阵的可扩展并行矩阵乘算法框架，利用矩阵生成元显式地构造各进程的局部矩阵块，从而减少通信开销；同时利用矩阵块的数值低秩性，进一步降低计算开销。因此，该算法框架可同时降低计算量和通信量，适用于Cannon、Fox和PUMMA等矩阵乘算法。在天河2巨型机上进行了大量的数值测试，测试结果表明，该算法可获得相对ScaLAPACK中的PDGEMM函数的8.96倍加速。

关键词: 结构矩阵, 矩阵乘法, FFT, Cauchy矩阵, Toeplitz矩阵, 分布式并行

Abstract: Structured matrices play an important role in scientific computing and engineering applications, such as Cauchy, Toeplitz, Vandermonde, and Hankel matrices. Although these matrices are dense, they can be expressed with only O(n) parameters (generators), where n is the dimension of the matrix. The core idea of the algorithm in this paper is to use matrix generators to explicitly construct local matrix blocks of each process, thereby reducing communication overhead. Additionally, by leveraging the numerical low-rank property of these matrix blocks. This paper further minimize computational overhead. Consequently, the proposed parallel structured matrix multiplication algorithm framework can simultaneously reduce both computational and communication costs, making it suitable for matrix multiplication algorithms like Cannon, Fox, and PUMMA. Extensive numerical tests were conducted on the Tianhe-2 supercomputer, and the results demonstrate that the proposed algorithm achieves an 8.96× speedup compared to the PDGEMM function in ScaLAPACK.

Key words: structured matrix, matrix multiplication, FFT, Cauchy matrix, Toeplitz matrix, distributed parallel

李胜国, 廖霞, 于恒彪, 黄春, 姜浩, 逯喜燕, 王华林, 成礼智. 面向结构矩阵的可扩展并行矩阵乘算法框架[J]. 计算机工程与科学, 2024, 46(9): 1529-1538.

LI Sheng-guo, LIAO Xia, YU Heng-biao, HUANG Chun, JIANG Hao, LU Xi-yan, WANG Hua-lin, CHENG Li-zhi. A scalable parallel structured matrix multiplication algorithm framework[J]. Computer Engineering & Science, 2024, 46(9): 1529-1538.

[1]	庄鹤林, 杨火根, 夏小云, 廖伟志. 关于矩阵乘法问题的人工蜂群优化算法研究[J]. 计算机工程与科学, 2021, 43(12): 2131-2138.
[2]	赵小强, 姜晶菲, 许金伟, 窦勇. 基于FPGA的卷积神经网络加速器动态余数处理映射模型[J]. 计算机工程与科学, 2021, 43(09): 1521-1528.
[3]	沈俊忠，肖涛，乔寓然，杨乾明，文梅. 一种支持优化分块策略的矩阵乘加速器设计[J]. 计算机工程与科学, 2016, 38(09): 1748-1754.
[4]	王民，曹清菁，贠卫国，周军妮. 改进MFCC算法在朱鹮鸣声个体识别中的应用[J]. J4, 2016, 38(05): 1052-1056.
[5]	周磊涛1,2，陶耀东2，刘生1,2，李锁3. 基于FPGA的Systolic乘法技术研究[J]. J4, 2015, 37(09): 1632-1636.
[6]	牟胜梅杨晓东. 高吞吐率浮点FFT处理器的FPGA实现研究[J]. J4, 2008, 30(7): 98-99.
[7]	陈勤刘茵王涛. 一种基于遗传算法的FFT Snake模型图像分割方法[J]. J4, 2007, 29(5): 66-68.