• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

一种支持优化分块策略的矩阵乘加速器设计

沈俊忠,肖涛,乔寓然,杨乾明,文梅   

  1. (国防科学技术大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2015-12-10 修回日期:2016-03-16 出版日期:2016-09-25 发布日期:2016-09-25
  • 基金资助:

    国家863计划(2012AA012706);国家自然科学基金(61272145)

A matrix multiplication accelerator design for optimization blocking strategy          

SHEN Jun zhong,XIAO Tao,QIAO Yu ran,YANG Qian ming,WEN Mei   

  1. (College of Computer,National University of Defense Technology,Changsha 410073,China)
  • Received:2015-12-10 Revised:2016-03-16 Online:2016-09-25 Published:2016-09-25

摘要:

在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045 FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48 GFLOPS的实测性能,而所需带宽仅为4.8 GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。

关键词: FPGA, 非均匀矩阵, 矩阵乘法, 分块策略

Abstract:

Large scale floating point matrix multiplication is one of the most time consuming computational kernels in many applications. There is a feature in emerging applications that matrices usually own at least one small dimension, which is called non uniform large scale matrix multiplication. Due to the limited amount of onchip memory for storing intermediate results on FPGA, partitioning largescale matrix multiplication into fine grained subblock computational tasks is needed. When accelerating non uniform matrix multiplications, most of the existing hardware matrix multipliers with a linear array architecture can suffer great performance reduction due to the fixed sub block size support. To solve this problem, we propose an efficient optimization blocking strategy. Based on it, we implement a novel matrix multiplier to support variable subblock operations on a Xilinx Zynq XC7Z045 FPGA. By integrating 224 processing elements (PEs), the multiplier achieves up to 48 GFLOPS for non uniform matrix multiplication in real application at 150 MHz with requirement of 4.8 GB/s of memory bandwidth. Results show that our proposed blocking strategy can improve up to 12% of performance in comparison with traditional blocking algorithms.

Key words: FPGA, non uniform matrix, matrix multiplication, blocking strategy