[1]NVIDIA. GPUs are only up to 14 times faster than CPUs says Intel, 2010[EB/OL].[20100601]http:∥blogs.nvidia.com/ntersect/2010/06/gpusareonlyupto14timesfasterthancpussaysintel.html.
[2]http:∥www.top500.org.
[3]Luk CK, Hong S, Kim H. Qilin:Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping [C]∥Proc of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009:4555.
[4]Hong S, Kim H. An integrated GPU power and performance model [C]∥Proc of the 37th Annual International Symposium on Computer Architecture, 2010:280289.
[5]Sim J, Dasgupta A, Kim H, et al. A performance analysis framework for identifying potential benefits in GPGPU applications [C]∥Proc of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012:1122.
[6]Samadi M, Hormati A, Mehrara M, et al. Adaptive inputaware compilation for graphics engines [C]∥Proc of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, 2012:1322.
[7]Peng Di,Xue Jingling.Modeldriven tile size selection for DO ACROSS loops on GPUs [C]∥Proc of the 17th International Conference on Parallel Processing, 2011:401412.
[8]Choi J W, Singh A, Vuduc R W. Modeldriven autotuning of sparse matrixvector multiply on GPUs [C]∥Proc of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010:115126.
[9]http:∥en.wikipedia.org/wiki/Graphics_processing_unit.
[10]Owens J D,Luebke D,Govindaraju N,et al.A survey of generalpurpose computation on graphics hardware[C]∥Proc of Eurographics 2005,2005:2151.
[11]Wang Feng, Yang Canqun, Du Yunfei, et al. Optimizing linpack benchmark on GPUaccelerated petascale supercomputer [J]. J Comput Sci Technol, 2011,26(5):854865.
[12]Yang Canqun,Wang Feng,Du Yunfei,et al.Adaptive optimization for petascale heterogeneous CPU/GPU computing [C]∥Proc of IEEE International Conference on Cluster Computing, 2010:1928.
[13]QuintanaOrtí G, Igual F D, QuintanaOrtí E S,et al. Solving dense linear systems on platforms with multiple hardware accelerators [C]∥Proc of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009:121130.
[14]Linderman M D, Collins J D, Wang Hong, et al. Merge:A programming model for heterogeneous multicore systems [C]∥Proc of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, 2008:287296.
[15]http:∥www.khronos.org/opencl.
[16]http:∥developer.nvidia.com/cuda/cudatoolkit.
[17]http:∥www.openaccstandard.org.
[18]Kim Jungwon, Seo Sangmin, Lee Jun, et al. SnuCL: An OpenCL framework for heterogeneous CPU/GPU clusters [C]∥Proc of the 26th ACM International Conference on Supercomputing, 2012:341352.
[19]Ryoo S, Rodrigues C I, Stone S S, et al. Program optimization space pruning for a multithreaded GPU[C]∥Proc of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2008:195204.
[20]Hong S, Kim H. An analytical model for a GPU architecture with memorylevel and threadlevel parallelism awareness [C]∥Proc of the 36th Annual International Symposium on Computer Architecture, 2009:152163.
[21]Yuan Liang,Zhang Yunquan,Long Guoping,et al.A GPU computational model based on latency hidden factor[J]. Journal of Software, 2010,21(zk):251262. (in Chinese)
[22]Zhang Y, Owens J D. A quantitative performance analysis model for GPU architectures [C]∥Proc of IEEE 17th International Symposium on High Performance Computer Architecture, 2011:382393.
附中文参考文献:
[21]袁良, 张云泉, 龙国平, 等. 基于延迟隐藏因子的GPU计算模型 [J]. 软件学报, 2010,21(增刊):251262. |