Design of BLAS level3 computation on a matrix multiplication coprocessor

Computer Engineering & Science ›› 2020, Vol. 42 ›› Issue (11): 1913-1921.

Previous Articles Next Articles

Design of BLAS level3 computation on a matrix multiplication coprocessor

JIA Xun，QIAN Lei，YUAN Hao，ZHANG Kun，WU Dong

（State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi 214125,China）

Received:2020-06-08 Revised:2020-08-11 Online:2020-11-25 Published:2020-11-26

Abstract

Abstract: BLAS level3 subprograms have high computation complexity, which usually become applications' performance bottleneck. By organizing largescale floatingpoint units into a linear array architecture, the matrix multiplication coprocessor can perform highperformance and efficient matrix multiplication. Achieving efficient BLAS level3 computation on the matrix multiplication coprocessor is essential for the acceleration of largescale science and engineering applications.
By taking matrix multiplication as the kernel and combining the characteristics of the underlying linear array architecture, this paper proposes the design of BLAS level3 computation on a matrix multiplication coprocessor, and construct a corresponding performance model. Experimental results show that SYMM, SYRK and TRMM subprograms on the matrix multiplication coprocessor achieves the computation efficiency of 99%, 98% and 80% respectively, at most 31% higher than those on the SW26010 and NVIDIA V100 GPU.

Key words: linear array, matrix multiplication, coprocessor, BLAS level-3

CLC Number:

国家自然科学基金（61732018）

JIA Xun, QIAN Lei, YUAN Hao, ZHANG Kun, WU Dong. Design of BLAS level3 computation on a matrix multiplication coprocessor[J]. Computer Engineering & Science, 2020, 42(11): 1913-1921.

[1]	HAN Jin, WU Zewei. Design of AES_ll coprocessor based on RISC-V [J]. Computer Engineering & Science, 2026, 48(1): 79-88.
[2]	PENG Lin, ZHANG Peng, CHEN Junfeng, TANG Tao, HUANG Chun. Selection of sparse matrix multiplication algorithms based on supervised learning [J]. Computer Engineering & Science, 2025, 47(3): 381-391.
[3]	AI Chenyang1, ZHAO Lechuan, HUA Tao, WANG Xin’an, WANG Ying. A hybrid matrix-vector processor with dynamically reconfigurable dataflow [J]. Computer Engineering & Science, 2025, 47(11): 1912-1921.
[4]	LI Sheng-guo, LIAO Xia, YU Heng-biao, HUANG Chun, JIANG Hao, LU Xi-yan, WANG Hua-lin, CHENG Li-zhi. A scalable parallel structured matrix multiplication algorithm framework [J]. Computer Engineering & Science, 2024, 46(9): 1529-1538.
[5]	JIANG Jing-fei, HE Yuan-hong, XU Jin-wei, XU Shi-yao, QIAN Xi-fu. NM-SpMM:A semi-structured sparse matrix multiplication algorithm for domestic heterogeneous vector processors [J]. Computer Engineering & Science, 2024, 46(7): 1141-1150.
[6]	ZHAO Xiao-qiang, JIANG Jing-fei, XU Jin-wei, DOU Yong. A dynamic remainder processing mapping model for convolutional neural network accelerator on FPGA [J]. Computer Engineering & Science, 2021, 43(9): 1521-1528.
[7]	ZHUANG He-lin, YANG Huo-gen, XIA Xiao-yun, LIAO Wei-zhi. Artificial bee colony algorithm for matrix multiplication problem [J]. Computer Engineering & Science, 2021, 43(12): 2131-2138.
[8]	JIA Xun,WU Guiming,QIAN Lei,XIE Xianghui,WU Dong. An efficient solver for large-scale triangular linear equations [J]. Computer Engineering & Science, 2019, 41(2): 240-245.
[9]	WANG Ji-jun，HAO Zi-yu，LI Hong-liang. 3D-MMA:Matrix multiplication accelerator architecture based on 3D integrated circuits [J]. Computer Engineering & Science, 2019, 41(12): 2110-2118.
[10]	GAN Xin-biao1,2，SUN Liao-yuan3,LIU Jie1，XIONG Cheng-wei1,HUANG Jia-kun1. Orchestrating HPL between CPU and China accelerator [J]. Computer Engineering & Science, 2018, 40(1): 10-14.
[11]	ZHU Min,TANG Bo,ZHAO Juan,ZOU Dan,LI Jincai. Distributed heterogeneous parallel Boolean matrix multiplication and its performance optimization [J]. Computer Engineering & Science, 2017, 39(4): 634-640.
[12]	SHEN Jun zhong,XIAO Tao,QIAO Yu ran,YANG Qian ming,WEN Mei. A matrix multiplication accelerator design for optimization blocking strategy [J]. Computer Engineering & Science, 2016, 38(9): 1748-1754.
[13]	ZHOU Leitao1,2,TAO Yaodong2,LIU Sheng1,2,LI Suo3. Research on Systolic multiplication technology based on FPGA [J]. J4, 2015, 37(9): 1632-1636.
[14]	LI Tao,SUN Zhigang. Hybrid Custom Hardware Acceleration for Coarsegrained Dataflow Network Processor [J]. J4, 2011, 33(11): 40-47.
[15]	. [J]. J4, 2007, 29(3): 80-83.

Design of BLAS level3 computation on a matrix multiplication coprocessor

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments