Computer Engineering & Science ›› 2020, Vol. 42 ›› Issue (11): 1913-1921.
Previous Articles Next Articles
JIA Xun,QIAN Lei,YUAN Hao,ZHANG Kun,WU Dong
Received:
Revised:
Accepted:
Online:
Published:
Abstract: BLAS level3 subprograms have high computation complexity, which usually become applications' performance bottleneck. By organizing largescale floatingpoint units into a linear array architecture, the matrix multiplication coprocessor can perform highperformance and efficient matrix multiplication. Achieving efficient BLAS level3 computation on the matrix multiplication coprocessor is essential for the acceleration of largescale science and engineering applications. By taking matrix multiplication as the kernel and combining the characteristics of the underlying linear array architecture, this paper proposes the design of BLAS level3 computation on a matrix multiplication coprocessor, and construct a corresponding performance model. Experimental results show that SYMM, SYRK and TRMM subprograms on the matrix multiplication coprocessor achieves the computation efficiency of 99%, 98% and 80% respectively, at most 31% higher than those on the SW26010 and NVIDIA V100 GPU.
Key words: linear array, matrix multiplication, coprocessor, BLAS level-3
CLC Number:
国家自然科学基金(61732018)
JIA Xun, QIAN Lei, YUAN Hao, ZHANG Kun, WU Dong. Design of BLAS level3 computation on a matrix multiplication coprocessor[J]. Computer Engineering & Science, 2020, 42(11): 1913-1921.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2020/V42/I11/1913