• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 •    下一篇

利用Stencil建模及评估Intel IMCI vgather指令

林新华1,2,王一超1,秦强1,李硕3,文敏华1,松岡聡2   

  1. (1.上海交通大学高性能计算中心,上海 200240;
    2.东京工业大学,日本 东京 1528550;3.Intel公司,美国 波特兰 OR97124)
  • 收稿日期:2015-12-11 修回日期:2016-03-21 出版日期:2016-09-25 发布日期:2016-09-25
  • 基金资助:

    国家863计划(2014AA01A302);日本学术振兴会RONPAKU Fellowship资助

Modeling and evaluating Intel IMCI  vgather instruction using stencilsJames             

Lin1,2,WANG Yi chao1,QIN Qiang1,LI Shuo3,WEN Min hua1,Satoshi Matsuoka2   

  1. (1.Center for High Performance Computing,Shanghai Jiao Tong University,Shanghai 200240,China;
    2.Tokyo Institute of Technology,Tokyo 1528550,Japan;
    3.Intel Corporation,Portland OR97124,USA)
  • Received:2015-12-11 Revised:2016-03-21 Online:2016-09-25 Published:2016-09-25

摘要:

Intel Xeon Phi协处理器的指令集IMCI引入了硬件实现的vgather指令,旨在帮助512位SIMD寄存器访问非连续内存地址上的数据。然而实验结果显示,vgather很有可能成为应用在Xeon Phi协处理器上关键的性能瓶颈之一。基于以上结论,针对vgather的性能建模可以帮助用户深入地掌握和理解Xeon Phi协处理器的性能特性。在实验方法上,本文方法与现存的通过程序段内嵌入汇编代码进行数据统计不同,使用PAPI等性能分析工具直接收集硬件计数器的统计结果,作为模型的实验数据。本文的性能模型基于AGI事件次数和根据VPU_DATA_READ次数估算得出的vgather所导致的平均延迟构建而成。该模型能够对Xeon Phi应用代码中由vgather所导致的总延迟进行预测。最终,为了验证模型预测的准确性,将该模型应用在三维7点stencil应用代码上,预测结果显示,vgather耗时占计算总耗时的约40%。再将该结果与利用intrinsics指令去除vgather后的计算耗时进行了对比验证,结果显示模型预测准确。基于上述结论,采用硬件计数器的统计结果在Xeon Phi协处理器上针对vgather构建了性能模型。同时,通过与其他平台的vgather对比,认为该模型也可以应用在同样具备vgather的Intel CPU处理器平台上。

关键词: 性能建模, vgather, Xeon Phi, 硬件计数器

Abstract:

Vgather is a hardwareimplemented vector instruction introduced by Intel Initial ManyCore Instructions (IMCI) for Xeon Phi. Its target is to help SIMD registers access data from noncontiguous memory locations. However, experimental results show that it can also be one of the key performance bottlenecks on Xeon Phi. We model the performance of Vgather by using the profiling tool PAPI to directly collect the results of hardware performance counters. Address Generation Interlock (AGI) events are profiled as the number of Vgather and the average latency of Vgather are estimated with VPU_DATA_READ events based on which we model the total latencies of Vgather instructions. 3D7P stencils are used to evaluate our model and the results show that Vgather spents nearly 40% of total kernel time. We implement a Vgatherfree version with intrinsic instruction to validate this model. Our contribution includes modeling Intel IMCI vgather instruction with hardware counters and validating it by stencils. The model can also be applicable on CPUs.

Key words: performance modeling, vgather, Xeon Phi, hardware performance counters