• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (11): 34-41.

• 论文 • 上一篇    下一篇

大点数一维FFT的GPU设计实现

何涛1,2,朱岱寅1   

  1. (1.南京航空航天大学电子信息工程学院,江苏 南京 210016;
    2.中航工业雷达与电子设备研究院,江苏 无锡 214063)
  • 收稿日期:2013-06-08 修回日期:2013-09-02 出版日期:2013-11-25 发布日期:2013-11-25

Design and implementation of largepoint 1D FFT on GPU 

HE Tao1,2,ZHU Daiyin1   

  1. (1.College of Electronic and Information Engineering,
    Nanjing University of Aeronautics and Astronautics,Nanjing 210016;
    2.Institute of Radar and Electronic Equipment,Aviation Industry Corporation China,Wuxi 214063,China)
  • Received:2013-06-08 Revised:2013-09-02 Online:2013-11-25 Published:2013-11-25

摘要:

鉴于GPU强大的计算性能以及先进的并行处理器架构,主要研究一种将FFT的并行算法映射到CUDA模型的并行设计方法。该设计方法遵循如减少内核函数中的全局存储器访问、全局存储器合并访问、高效利用共享存储器、高密集度计算等GPU平台下主要的设计准则进行优化设计,并在基于NVIDIA Fermi处理架构的Tesla C2075 GPU平台上进行了大点数一维FFT设计实现。实验结果表明了该方法的可行性及高效性,在256K点范围内性能优于CUFFT库,加速比最高达到CUFFT 4.0库的2.1倍。

关键词: CUDA 4.0, 快速傅里叶变换, GPU, 高性能计算

Abstract:

Considering the GPU’s powerful computing performance and advanced parallel processor architecture, a kind of concurrent design method is studied, which maps the FFT parallel algorithm onto CUDA architecture. This method follows optimized design principles for GPU platforms, such as, reducing global memory access, global memory access coalescing, efficient usage of shared memory, and intensive computing. Then, a largePoint 1D FFT is implemented on NVIDIA Tesla C2075 GPU based on the architecture of NVIDIA  Fermi. Experimental results show that this method is superior to the CUFFT library when the number of points is not larger than 256K, and it runs two times faster than the CUFFT 4.0 library, which shows that the new method is feasible and effective.

Key words: CUDA 4.0;fast fourier transform;GPU;high performance computing