• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2023, Vol. 45 ›› Issue (12): 2135-2145.

• High Performance Computing • Previous Articles     Next Articles

Key techniques and practice on managing multi-site HPC clusters for university campus

ZHANG Tian-yang,CHI Cheng-yue,GUO Wu,GAO Yi-qin,WEN Min-hua,WEI Jian-wen    

  1. (Network and Information Center,Shanghai Jiao Tong University,Shanghai 200240,China)
  • Received:2023-03-29 Revised:2023-06-25 Accepted:2023-12-25 Online:2023-12-25 Published:2023-12-14

Abstract: With the growth and expansion of high-performance computing businesses, external factors such as data center space and power supply capacity often become constraints on cluster expansion and upgrading, resulting in the need for the construction of multi-site high-performance computing (HPC) clusters. Multi-site HPC cluster can break through the geographical limitations of a single cluster and provide more computing resources. Based on the practice of SJTU-computing platform, this paper summarizes the unified management methods of infrastructure and system software, as well as the high availability design of cluster remote disaster tolerance, including: adaptive Slurm job scheduling system and Open OnDemand visual portal site, extending high availability capabilities for LDAP and other basic services, and building a hierarchical aggregation monitoring system. Finally, this paper demonstrates the effectiveness of remote supercomputing cluster solutions from three dimensions: data transmission, user experience, and platform high availability.


Key words: high performance computing, multi-site cluster, remote disaster recovery, multi-level federation monitor