• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (05): 799-806.

• 高性能计算 • 上一篇    下一篇

面向FT-M7002的高斯滤波算法优化实现

陈云1,2,王梦园1,2,柴晓楠1,2,商建东1,2   

  1. (1.郑州大学信息工程学院,河南 郑州 450001;2.河南省超级计算中心(郑州大学),河南 郑州 450052)

  • 收稿日期:2020-12-17 修回日期:2021-03-04 接受日期:2021-05-25 出版日期:2021-05-25 发布日期:2021-05-19

Optimization of Gaussian filtering algorithm on FT-M7002

CHEN Yun1,2,WANG Meng-yuan1,2,CHAI Xiao-nan1,2,SHANG Jian-dong1,2   

  1. (1.School of Information Engineering,Zhengzhou University,Zhengzhou 450001;

    2.Supercomputing Center of Henan Province (Zhengzhou University),Zhengzhou 450052,China)


  • Received:2020-12-17 Revised:2021-03-04 Accepted:2021-05-25 Online:2021-05-25 Published:2021-05-19

摘要: 国产自主研发的飞腾系列高性能DSP处理器在图像处理领域的应用,对面向该平台的高性能图像处理算法提出了强烈需求。高斯滤波作为图像处理的基础算法,能有效滤除图像中的高斯噪声,在图像处理领域具有广泛应用。针对飞腾高性能DSP的体系结构特点与高斯滤波算法特性,实现了面向飞腾高性能DSP的高斯滤波算法优化。通过手工向量化、控制流消除和循环展开等优化手段充分利用数据级与指令级并行性,从而减少数据访存次数,提高指令执行效率。针对FT-MT2内核中的DMA硬件及向量存储器结构特点,进行了“乒-乓”缓存、DMA数组转置等优化,以减少数据传输时间,提高数据局部性。多种滤波核大小及图像矩阵规模下的测试结果表明,相对于高斯滤波算法的串行实现,该并行优化实现获得了1.3~1.41倍的加速比。在开启Cache的情况下,相较于dsplib库中高斯滤波算法在TMS320C6678平台上的运行性能,获得了1.15~1.71倍的加速效果。


关键词: 高性能DSP, 高斯滤波, 向量并行优化, DMA传输优化

Abstract: With the application of domestically developed Feiteng series high-performance DSP processors in the field of image processing, there is a strong demand for high-performance image processing algorithms on this platform. As the basic algorithm of image processing, Gaussian filtering can effectively filter out Gaussian noise in images, and it has been widely used in the field of image processing. According to the architectural characteristics of FeiTeng high-performance DSP and the characteristics of Gaussian filtering algorithm, the optimization of Gaussian filtering algorithm on Feiteng high performance DSP is realized. Optimization methods such as manual vectorization, control flow elimination, and loop unrolling are adopted to take full advantage of data-level and instruction-level parallelism, thereby reducing the number of data accesses and improving instruction efficiency. According to the DMA hardware and vector memory structure characteristics in the FT-MT2 core, optimizations such as ping-pong cache and DMA array transposition are performed to reduce the data transmission time and improve the data locality. Test results under various filter kernel sizes and image matrix scales show that, compared to the serial implementation of the Gaussian filter algorithm, the parallel optimization implementation achieves a speedup of 1.3~1.41. With cache enabled, compared with the running performance of the Gaussian filtering algorithm in the dsplib library on the TMS320C6678 platform, the acceleration effect is 1.15~1.71 times.




Key words: high performance DSP, Gaussian filtering, vector parallel optimization, DMA transmission optimization