• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A matrix multiplication accelerator design for optimization blocking strategy          

SHEN Jun zhong,XIAO Tao,QIAO Yu ran,YANG Qian ming,WEN Mei   

  1. (College of Computer,National University of Defense Technology,Changsha 410073,China)
  • Received:2015-12-10 Revised:2016-03-16 Online:2016-09-25 Published:2016-09-25

Abstract:

Large scale floating point matrix multiplication is one of the most time consuming computational kernels in many applications. There is a feature in emerging applications that matrices usually own at least one small dimension, which is called non uniform large scale matrix multiplication. Due to the limited amount of onchip memory for storing intermediate results on FPGA, partitioning largescale matrix multiplication into fine grained subblock computational tasks is needed. When accelerating non uniform matrix multiplications, most of the existing hardware matrix multipliers with a linear array architecture can suffer great performance reduction due to the fixed sub block size support. To solve this problem, we propose an efficient optimization blocking strategy. Based on it, we implement a novel matrix multiplier to support variable subblock operations on a Xilinx Zynq XC7Z045 FPGA. By integrating 224 processing elements (PEs), the multiplier achieves up to 48 GFLOPS for non uniform matrix multiplication in real application at 150 MHz with requirement of 4.8 GB/s of memory bandwidth. Results show that our proposed blocking strategy can improve up to 12% of performance in comparison with traditional blocking algorithms.

Key words: FPGA, non uniform matrix, matrix multiplication, blocking strategy