• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (04): 662-669.

• 高性能计算 • 上一篇    下一篇

基于SIMD的Square Root函数高性能实现与优化

赵永浩1,2,贾海鹏2,张云泉2,张思佳1   

  1. (1.大连海洋大学信息工程学院,辽宁 大连 116023;

    2.中国科学院计算技术研究所计算机体系结构国家重点实验室,北京100190)

  • 收稿日期:2020-12-22 修回日期:2021-01-26 接受日期:2021-04-25 出版日期:2021-04-25 发布日期:2021-04-21
  • 基金资助:
    国家重点研发计划(2017YFB0202105,2018YFC0809306,2016YFB0200803,2017YFB0202302);国家自然科学基金(61972376);北京市自然科学基金(L182053)

High-performance implementation and optimization of Square Root function based on SIMD

ZHAO Yong-hao1,2,JIA Hai-peng2,ZHANG Yun-quan2,ZHANG Si-jia1#br# #br#   

  1. (1.College of Information Engineering,Dalian Ocean University,Dalian 116023;

    2.State Key Laboratory of Computer Architecture,

    Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)

  • Received:2020-12-22 Revised:2021-01-26 Accepted:2021-04-25 Online:2021-04-25 Published:2021-04-21

摘要: 在计算机图形学、积分计算和神经网络等应用场景中,平方根函数的高性能实现在构建处理器的基础软件生态中起到了十分重要的作用。随着ARM架构处理器得到广泛的使用,研究ARM架构下的函数快速算法实现变得更加关键。当前大量处理器都采用了SIMD架构,所以,研究基于SIMD实现高性能函数计算方法具有重要的研究意义和发展前景。因此,对平方根函数进行了高性能的实现与优化。通过分析IEEE 754标准的浮点数在内存中的存储格式,设计了高效的平方根函数算法;然后通过结合平方根倒数和泰勒公式算法,进一步提高了算法精度;最后通过SIMD优化进一步提升了算法性能。实验结果表明,在满足精度的前提下,相比于libm算法库,实现的平方根函数的,性能提高了约7倍,相比于ARM V8提供的计算平方根的指令在性能上提高了约3倍。


关键词: 平方根函数, SIMD, 高性能, 数值分析, ARM V8架构

Abstract: In computer graphics, integral calculation, neural network and other application scenarios, the high-performance implementation of Square Root function plays a very important role in the construction of the basic software ecology of processors. With the widespread use of ARM architecture processors, it becomes more critical to study the fast algorithm implementation of functions under ARM architecture. At present, SIMD architecture is adopted by a large number of processors. Therefore, it is of great significance and development prospect to study the high performance function calculation method based on SIMD. To this end, this paper implements and optimizes the Square Root function with high performance. By analyzing the storage format of IEEE 754 standard float point number in memory, an efficient algorithm of Square Root function is designed, and then the algorithm precision is further improved by combining Square Root inverse and Taylor formula algorithm. Finally, the algorithm performance is further improved by SIMD optimization. According to the experimental results, on the premise of satisfying the accuracy, the performance of the implemented Square Root function is more than 7 times higher than the libm algorithm library, and more than 3 times higher than the instruction of calculating Square Root provided by ARM V8. 

Key words: square root function, SIMD, high performance, numerical analysis, ARM V8 architecture