通过部分页迁移实现CPU-GPU高效透明的数据通信

计算机工程与科学

通过部分页迁移实现CPU-GPU高效透明的数据通信

张诗情，杨耀华，沈立，王志英

（国防科技大学计算机学院,湖南长沙 410073）

收稿日期:2018-10-19 修回日期:2018-12-11 出版日期:2019-07-25 发布日期:2019-07-25

Efficient and transparent CPU-GPU data

communication through partial page migration

#br# ZHANG Shiqing，YANG Yaohua，SHEN Li，WANG Zhiying

（School of Computer,National University of Defense Technology,Changsha 410073,China）

Received:2018-10-19 Revised:2018-12-11 Online:2019-07-25 Published:2019-07-25

摘要/Abstract

摘要：

尽管对集成GPU和下一代互连的研究投入日益增加，但由PCI Express连接的独立GPU仍占据市场的主导地位，CPU和GPU之间的数据通信管理仍在不断发展。最初，程序员显式控制CPU和GPU之间的数据传输。为了简化编程，GPU供应商开发了一种编程模型，为“CPU+GPU”异构系统提供单个虚拟地址空间。此模型中的页迁移机制会自动根据需要在CPU和GPU之间迁移页面。为了满足高性能工作负载的需求，页面大小有增大趋势。受低带宽和高延迟互连的限制，较大的页面迁移延迟时间较长，这可能会影响计算和传输的重叠并导致严重的性能下降。提出了部分页迁移机制，它只迁移页面的所需部分，以缩短迁移延迟并避免页面变大时整页迁移的性能下降。实验表明，当页面大小为2 MB且PCI Express带宽为16 GB/s时，部分页迁移可以显著隐藏整页迁移的性能开销，相比于程序员控制数据传输，整页迁移有平均98.62%倍的减速，而部分页迁移可以实现平均1.29倍的加速。此外，我们测试了页面大小对快表缺失率的影响以及迁移单元大小对性能的影响，使设计人员能够基于这些信息做出决策。

关键词: &ldquo, CPU+GPU&rdquo, 异构系统, 数据通信, 页迁移

Abstract:

Despite the increasing investment in integrated GPUs and nextgeneration interconnect research, discrete GPUs connected by PCI Express still dominate the market, and the management of data communication between CPUs and GPUs continues to evolve. Initially, the programmers control the data transfer between CPUs and GPUs explicitly. To simplify programming, GPU vendors have developed a programming model to provide a single virtual address space for “CPU + GPU” heterogeneous systems. The page migration engine in this model transfers pages between CPUs and GPUs on demand automatically. To meet the needs of high-performance workloads, the page size tends to be larger. Limited by low bandwidth and high latency interconnections, larger page migration has longer delay, which can reduce the overlap of computation and transmission and cause severe performance degradation. We propose a partial page migration mechanism that only transfers the requested part of a page to shorten the migration latency and avoid performance degradation of the whole page migration when the page becomes larger. Experiments show that the proposed partial page migration can well hide the performance overheads of the whole page migration when the page size is 2MB and the PCI Express bandwidth is 16GB/sec. Compared with data transmission controlled by the programmers, the whole page migration degrades the performance by 98.62 on average, while the partial page migration upgrades the performance by 1.29 on average. Additionally, we examine the impact of page size on TLB miss rate and the impact of migration unit size on execution time, enabling designers to make informed decisions based on this information.

Key words: heterogeneous “CPU + GPU&rdquo, system;data communication;page migration

张诗情，杨耀华，沈立，王志英. 通过部分页迁移实现CPU-GPU高效透明的数据通信[J]. 计算机工程与科学.

ZHANG Shiqing，YANG Yaohua，SHEN Li，WANG Zhiying.

Efficient and transparent CPU-GPU data

communication through partial page migration

[J]. Computer Engineering & Science.

编辑推荐

Metrics

阅读次数

全文

264

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	264	0	0

来源	本网站	其他网站

次数	223	41
比例	84%	16%

摘要

196

最新录用	在线预览	正式出版

196	0	0

	来源	本网站

	次数	196
	比例	100%

[1]	康宇晗, 时洋, 陈照云, 文梅. 面向迈创+MatrixZone异构系统的深度学习编程框架[J]. 计算机工程与科学, 2023, 45(07): 1149-1158.
[2]	张晗,宋志华,王彤,朱良谊. 基于军事背景的大学计算机课程案例教学探索与实践[J]. 计算机工程与科学, 2019, 41(增刊S1): 210-212.
[3]	聂牧1，梁华国 1,2，卞景昌2，倪天明2，徐秀敏2，黄正峰2. 基于硅通孔绑定后三维芯片测试调度优化方案[J]. 计算机工程与科学, 2017, 39(03): 458-463.
[4]	雷阳1，张敏情1，郝斌1，孔韦韦2. 基于计算思维的“加涅式”新型计算机课程教学法[J]. 计算机工程与科学, 2016, 38(增刊): 190-194.
[5]	张婷婷，胡斌，牛彦杰. 面向“三种思维”培养体系的军队院校大学计算机基础教学课程改革研究[J]. 计算机工程与科学, 2016, 38(增刊): 252-254.
[6]	舒欣，刘才威，张坤. “阿米巴”教学实践法在任职士官计算机教学中的应用研究[J]. 计算机工程与科学, 2016, 38(增刊): 338-341.
[7]	翟皓1，袁占良1，黄祥志2,3，臧文乾2，张周威2，周珂2,4. 一种面向海量遥感数据分类应用的并行解决方案[J]. 计算机工程与科学, 2016, 38(12): 2450-2455.
[8]	秦振陆1，2，方芳1,王伟1，2,朱侠1，2,郭二辉3,任福继1，2. “绑定中测试”“多绑一测”方式对于测试过程的影响[J]. 计算机工程与科学, 2016, 38(08): 1602-1608.
[9]	彭晏飞1，尚永刚1,王德建2. 一种新的基于SVM和主动学习的图像检索方法[J]. J4, 2014, 36(07): 1371-1376.
[10]	李元金1,2 ,王涛1,马良1. 基于因子缩放的车牌图像倾斜校正方法[J]. J4, 2012, 34(7): 130-135.
[11]	李大普1，陈光喜1,李肯立2. 异构系统中基于可用性的粒子群任务调度算法[J]. J4, 2012, 34(6): 44-49.
[12]	唐滔，杨学军. 异构系统编程方法综述[J]. J4, 2012, 34(3): 29-34.
[13]	王桂彬. 基于硬件性能计数器的GPU功耗预测模型[J]. J4, 2012, 34(3): 46-50.
[14]	朱志军,熊伟,王超,陈宏盛. 地理栅格影像的时空聚集精确算法[J]. J4, 2012, 34(3): 165-169.
[15]	贾佳. 异构系统的异步应用级Checkpointing技术[J]. J4, 2011, 33(11): 54-59.