• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (5): 775-786.

• 高性能计算 • 上一篇    下一篇

面向算力网络的跨集群数据迁移系统的设计和实现

李俊哲1,2,付振新2,3,杨宏辉2,马银萍2,3,李若淼2,3,樊春2,3   

  1. (1.北京大学计算机学院,北京 100871;2.北京大学计算中心,北京 100871;
    3.北京大学长沙计算与数字经济研究院,湖南 长沙 410205)
  • 收稿日期:2023-12-29 修回日期:2024-05-26 出版日期:2025-05-25 发布日期:2025-05-27
  • 基金资助:
    2023年湖南省十大技术攻关项目(2023GK1010)

Design and implementation of a cross-cluster data migration system for computational networks

LI Junzhe1,2,FU Zhenxin2,3,YANG Honghui2,MA Yinping2,3,LI Ruomiao2,3,FAN Chun2,3   

  1. (1.School of Computer Science,Peking University,Beijing 100871;
    2.Computer Center,Peking University,Beijing 100871;
    3.Changsha Institute for Computing and Digital Economy,Peking University,Changsha 410205,China)
  • Received:2023-12-29 Revised:2024-05-26 Online:2025-05-25 Published:2025-05-27

摘要: 在算力网络的建设中,如何在不同地域算力中心的集群之间进行高效可靠的数据迁移,是影响算力网络建设成功与否的关键研究课题。鉴于此,设计并实现了基于RSYNC的高性能传输软件SCOW-SYNC。首先,SCOW-SYNC采用队列和线程池架构,对传统的RSYNC进行了优化,通过并行建立多个TCP连接和并行传输,提高了带宽利用率。此外,SCOW-SYNC还支持大文件自动切分、动态压缩、后台运行、进度实时查询和SSH连接池管理等功能。经测试,SCOW-SYNC相比RSYNC能够达到125%~130%的加速比。其次,为了提高传输的安全性,面向算力中心提出了一套可靠的跨集群传输系统架构,数据传输仅在“传输节点”之间发起,使用“传输密钥”进行加密,该密钥由“管理节点”负责动态检查、生成和分发。最后,将SCOW-SYNC集成到高性能计算门户和管理平台SCOW中,实现了SCOW的跨集群传输模块,使得用户可以通过浏览器在不同集群之间进行高性能的数据迁移,并通过容器化技术部署到了北京大学跨集群环境中,提高了生产效率。

关键词: 高性能计算系统软件, 算力网络, 并行传输, RSYNC, 集群安全

Abstract: In the construction of computational networks, how to conduct efficient and reliable data migration between clusters in different regional computing centers is a key research topic. In view of this, this paper designs and implements a high-performance transmission software based on RSYNC, namely SCOW-SYNC. The main research results are as follows: Firstly, SCOW-SYNC adopts the queue and thread pool architecture to optimize the traditional RSYNC. By parallelly establishing multiple TCP connections and parallel transmission, the bandwidth utilization rate is improved. In addition, SCOW-SYNC also supports functions such as automatic large file splitting, dynamic compression, background operation, real-time progress query, and SSH connection pool management. Through testing, SCOW-SYNC can achieve a speedup ratio of 125% to 130% compared with RSYNC. Secondly, in order to improve the security of transmission, this paper proposes a reliable cross-cluster transmission system architecture for computing centers. Data transmission only occurs between "transmission nodes" and is encrypted by "transmission keys", which are dynamically checked, generated, and distributed by the "management node". Finally, this paper integrates SCOW-SYNC into the high-performance computing portal and management platform SCOW, and implements the cross-cluster transmission module of SCOW, so that users can perform high-performance data migration between different clusters through the browser, and deploys it to the cross-cluster environment of Peking University through containerization technology, which improves the production efficiency.

Key words: high performance computing system software, computational network, parallel transmission, RSYNC, cluster security