• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2014, Vol. 36 ›› Issue (02): 216-221.

• 论文 • 上一篇    下一篇

一种基于NIC的RDMA可靠传输协议的设计与实现

夏军,庞征斌,刘路,张峻,常俊胜   

  1. (国防科学技术大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2013-07-10 修回日期:2013-10-06 出版日期:2014-02-25 发布日期:2014-02-25
  • 基金资助:

    国家自然科学基金资助项目(61103083,61133007);国家863计划资助项目(2012AA01A301)

Design and implementation of a NIC based RDMA reliable communication protocol                 

XIA Jun,PANG Zhengbin,LIU Lu,ZHANG Jun,CHANG Junsheng   

  1. (College of Computer,National University of Defense Technology,Changsha 410073,China)
  • Received:2013-07-10 Revised:2013-10-06 Online:2014-02-25 Published:2014-02-25

摘要:

高性能计算机不断增长的规模和复杂性使得可靠性成为影响高性能计算机系统可用性的关键因素,系统互连网络是高性能计算机的重要组成部分,其可靠性是高性能计算机系统设计必须考虑的重要问题。针对高性能计算机系统互连网络可能出现的故障,提出一种基于NIC实现的RDMA可靠传输协议,给出了一种通用的设计实现方案,并对该方案的几种具体优化设计实现方法进行了讨论。提出的可靠传输协议及实现方案能容忍系统互连网络可能出现的多种网络故障,并能尽量减少实现可靠传输所带来的额外开销。实验结果表明,所提出的RDMA可靠传输的实际测试性能与无连接RDMA传输相当。

关键词: RDMA, 可靠性, 网络接口, 可靠传输协议

Abstract:

With the continually growing size and complexity of high performance computing systems, reliability has become the crucial factor of affecting the availability of high performance computing systems. System network is the important component of high performance computing systems and its reliability must be considered in high performance computing system design. Aiming at failures possibly occurring in high performance computing system network, the paper proposes a NIC based RDMA reliable communication protocol, gives a general framework of realizing this protocol and discusses some optimized implementation methods based on the framework. The reliable communication protocol and its implementation can tolerate system network failures and can reduce the overhead of realizing reliable communications. The experimental results show that the performance of the RDMA reliable communication is comparable with that of the noconnection RDMA communication.

Key words: RDMA;reliability;network interface;reliable communication protocol