[1]Xue W,Yang C,Fu H,et al. Enabling and scaling a global shallowwater atmospheric model on Tianhe2[C]∥Proc of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium,2014:745754.
[2]Pennycook S J,Hughes C J,Smelyanskiy M,et al. Exploring SIMD for molecular dynamics,using Intel Xeon processors and Intel Xeon Phi coprocessors[C]∥Proc of IPDPS ’13,2013:10851097.
[3]Hofmann J. Performance evaluation of the Intel many integrated core architecture for 3D image reconstruction in computed tomography[D]. ErlangenNuremberg:FriedrichAlexanderUniversity ErlangenNuremberg,2013.
[4]Hofmann J, Treibig J, Hager G,et al. Performance engineering for a medical imaging application on the Intel Xeon Phi accelerator[C]∥Proc of the 27th International Conference on Architecture of Computing Systems (ARCS2014),2014:18.
[5]Krishnaiyer R,Kultursay E,Chawla P,et al. Compilerbased data prefetching and streaming nontemporal store generation for the Intel(R) Xeon Phi(TM) Coprocessor[C]∥Proc of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum,2013:15751586.
[6]Peraza J,Tiwari A,Laurenzano M,et al. Understanding the performance of stencil computations on Intel’s Xeon Phi[C]∥Proc of 2013 IEEE International Conference on Uster Computing (USTER13),2013:15.
[7]Lin J,Nukada A,Matsuoka S. Modeling gather and scatter with hardware performance counters for Xeon Phi[J]. ACM/IEEE International Symposium on Cluster, Cloud & Grid Computing, 2015:713716.
[8]Henretty T,Stock K,Pouchet LN,et al. Data layout transformation for stencil computations on shortvector SIMD architectures[C]∥Proc of International Conference on Compiler Construction,2011:225245.
[9]Lin Xinhua,Li Shuo,Zhao Jiaming,et al. Nodelevel memory access optimization on Intel Knights Corner[J]. Computer Science,2015,42(11):3742. (in Chinese)
[10]Treibig J, Hager G. Introducing a performance model for bandwidthlimited loop kernels[C]∥Proc of the 18th International Conference on Parallel Processing and Applied Mathematics,2010:615524.
[11]Ramos S,Hoefler T. Modeling communication in cachecoherent SMP systems:A case study with Xeon Phi[C]∥Proc of the 22nd International Symposium on Highperformance Parallel and Distributed Computing,2013:97.
[12]Zhang Y, Owens J D. A quantitative performance analysis model for GPU architectures[C]∥Proc of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA 2011),2011:382393.
[13]Hong S,Kim H. An analytical model for a GPU architecture with memorylevel and threadlevel parallelism awareness[C]∥Proc of the 36th Annual International Symposium on Computer Architecture,2009:152163.
[14]Browne S, Dongarra J, Garner N,et al. A scalable crossplatform infrastructure for application performance tuning using hardware counters[C]∥Proc of ACM/IEEE 2000 Supercomputing Conference(SC00),2000:42.
附中文参考文献:
[12]林新华,李硕,赵嘉明,等. Intel Knights Corner 的结点级内存访问优化[J]. 计算机科学,2015,42(11):3742. |