• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (05): 847-856.

• 论文 •    下一篇

细粒度任务并行GPU通用矩阵乘

张帅,李涛,王艺峰,焦晓帆,杨愚鲁   

  1. (南开大学计算机与控制工程学院,天津 300071)
  • 收稿日期:2014-10-15 修回日期:2014-12-20 出版日期:2015-05-25 发布日期:2015-05-25

Exploring fine grained task parallel GEMM on
single-and multi-GPU systems  

ZHANG Shuai,LI Tao,WANG Yifeng,JIAO Xiaofan,YANG Yulu   

  1. (College of Computer and Control Engineering,Nankai University,Tianjin 300071,China)
  • Received:2014-10-15 Revised:2014-12-20 Online:2015-05-25 Published:2015-05-25

摘要:

稠密线性代数运算对模式识别和生物信息等许多实际应用至关重要,而通用矩阵乘(GEMM)处于稠密线性代数运算的基础地位。在cuBLAS与MAGMA中,GEMM被实现为若干kernel函数,对大型GEMM计算能够达到很高的性能。然而,现有实现对批量的小型GEMM计算性能发挥则较为有限。而且,现有实现也不能在多个具有不同性能的GPU之间自动扩展并达到负载均衡。提出任务并行式GEMM(TPGEMM),用细粒度任务并行的方式实现批量矩阵乘和多GPU矩阵乘。一个或多个GEMM的计算能够被拆分为多个任务,动态地调度到一个或多个GPU上。TPGEMM避免了为批量矩阵乘启动多个kernel函数的开销,对批量矩阵乘能够取得显著高于cuBLAS与MAGMA的性能。在低开销细粒度任务调度的基础上,TPGEMM支持单个GEMM计算在多个GPU间的自动并行,在一台具有四个不同性能GPU的工作站上取得了接近100%的扩展效率。

关键词: 通用矩阵乘;持久化kernel;任务并行;负载均衡

Abstract:

The Dense Linear Algebra (DLA), which is very important to many applications such as pattern recognition and bioinformatics,depends critically on the general matrixmatrix multiplication (GEMM) routine.In current cuBLAS and MAGMA libraries,GEMM is implemented with kernel functions to achieve high performance for large GEMM.However,they are not efficient for multiple independent small matrices,even though the interfaces for batched small GEMMs are provided in cuBLAS.Moreover,they cannot automatically scale across multiple different GPUs with good load balancing.In this paper,we propose a task parallel GEMM (TPGEMM) that explores fine grained task parallelism for batched and multi-GPU GEMMs.The workloads of one or more GEMMs are decomposed into tasks  which are scheduled to persistent GPU kernels at runtime.The TPGEMM avoids the overhead for launching multiple kernels and achieves better performance for batched small GEMMs compared with the cuBLAS and MAGMA libraries.Based on the fine grained task scheduling with low overhead, TPGEMM supports auto-parallelization across multiple GPUs and achieves an efficiency close to 100% on a workstation with 4 different GPUs.

Key words: GEMM;persistent kernel;task parallelism;load balancing