• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2026, Vol. 48 ›› Issue (2): 228-237.

• High Performance Computing • Previous Articles     Next Articles

An RDMA QP communication mechanism of next-generation intelligent computing center

WANG Junliang,LIN Baohong,ZHANG Jiao,SUN Mengyu,PAN Yongchen   

  1. (1.China Telecom Guangdong Research Institute,Guangzhou 510660;
    2.State Key Laboratory of Networking and Switching Technology,
    Beijing University of Posts and Telecommunications,Beijing 100876;
    3.China Telecom Beijing Research Institute,Beijing 100045,China)
  • Received:2024-03-07 Revised:2024-10-23 Online:2026-02-25 Published:2026-03-10

Abstract: Currently, intelligent computing centers primarily employ RDMA (remote direct memory access)protocol to achieve ultra-high-performance communication within clusters, where each pair of processes needs to establish a queue pair (QP) based on the reliable connection (RC) type. In the context of AI  large model scenarios in next-generation large-scale intelligent computing centers, distributed collective communication operations such as All-to-All and All Reduce will trigger fully connected communication between processes. The number of QPs that need to be maintained under the RC-based mechanism will exceed one million, posing significant challenges to the limited memory and performance of RDMA network interface cards (NICs). To address this issue, an RDMA QP communication mechanism named ERD (efficient reliable datagram) is proposed. On one hand, it replaces traditional RC with RD (reliable datagram) to enhance the scalability of QPs on NICs; on the other hand, it designs an RD-based reliable reception mechanism that incorporates packet loss handling and rapid ordered processing in the network stack, ensuring network reliability while improving transmission performance. Through experiments and NS3 simulation tests, ERD can reduce the number of QPs by 99.96% and enhance transmission performance by over 15% during network congestion.

Key words: intelligent computing center network, AI large-model communication, RDMA, QP communication