• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (05): 847-856.

• 论文 •     Next Articles

Exploring fine grained task parallel GEMM on
single-and multi-GPU systems  

ZHANG Shuai,LI Tao,WANG Yifeng,JIAO Xiaofan,YANG Yulu   

  1. (College of Computer and Control Engineering,Nankai University,Tianjin 300071,China)
  • Received:2014-10-15 Revised:2014-12-20 Online:2015-05-25 Published:2015-05-25

Abstract:

The Dense Linear Algebra (DLA), which is very important to many applications such as pattern recognition and bioinformatics,depends critically on the general matrixmatrix multiplication (GEMM) routine.In current cuBLAS and MAGMA libraries,GEMM is implemented with kernel functions to achieve high performance for large GEMM.However,they are not efficient for multiple independent small matrices,even though the interfaces for batched small GEMMs are provided in cuBLAS.Moreover,they cannot automatically scale across multiple different GPUs with good load balancing.In this paper,we propose a task parallel GEMM (TPGEMM) that explores fine grained task parallelism for batched and multi-GPU GEMMs.The workloads of one or more GEMMs are decomposed into tasks  which are scheduled to persistent GPU kernels at runtime.The TPGEMM avoids the overhead for launching multiple kernels and achieves better performance for batched small GEMMs compared with the cuBLAS and MAGMA libraries.Based on the fine grained task scheduling with low overhead, TPGEMM supports auto-parallelization across multiple GPUs and achieves an efficiency close to 100% on a workstation with 4 different GPUs.

Key words: GEMM;persistent kernel;task parallelism;load balancing