J4 ›› 2015, Vol. 37 ›› Issue (05): 847-856.
• 论文 • Next Articles
ZHANG Shuai,LI Tao,WANG Yifeng,JIAO Xiaofan,YANG Yulu
Received:
Revised:
Online:
Published:
Abstract:
The Dense Linear Algebra (DLA), which is very important to many applications such as pattern recognition and bioinformatics,depends critically on the general matrixmatrix multiplication (GEMM) routine.In current cuBLAS and MAGMA libraries,GEMM is implemented with kernel functions to achieve high performance for large GEMM.However,they are not efficient for multiple independent small matrices,even though the interfaces for batched small GEMMs are provided in cuBLAS.Moreover,they cannot automatically scale across multiple different GPUs with good load balancing.In this paper,we propose a task parallel GEMM (TPGEMM) that explores fine grained task parallelism for batched and multi-GPU GEMMs.The workloads of one or more GEMMs are decomposed into tasks which are scheduled to persistent GPU kernels at runtime.The TPGEMM avoids the overhead for launching multiple kernels and achieves better performance for batched small GEMMs compared with the cuBLAS and MAGMA libraries.Based on the fine grained task scheduling with low overhead, TPGEMM supports auto-parallelization across multiple GPUs and achieves an efficiency close to 100% on a workstation with 4 different GPUs.
Key words: GEMM;persistent kernel;task parallelism;load balancing
ZHANG Shuai,LI Tao,WANG Yifeng,JIAO Xiaofan,YANG Yulu. Exploring fine grained task parallel GEMM on single-and multi-GPU systems [J]. J4, 2015, 37(05): 847-856.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2015/V37/I05/847