校级异地超算集群管理的关键技术研究与实践

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (12): 2135-2145.

校级异地超算集群管理的关键技术研究与实践

张天阳,池成悦,郭武,高亦沁,文敏华,韦建文

(上海交通大学网络信息中心，上海 200240)

收稿日期:2023-03-29 修回日期:2023-06-25 出版日期:2023-12-25 发布日期:2023-12-14
基金资助:
国家重点基础研究发展计划(2018YFA0404600,2018YFA0404603)

Key techniques and practice on managing multi-site HPC clusters for university campus

ZHANG Tian-yang,CHI Cheng-yue,GUO Wu,GAO Yi-qin,WEN Min-hua,WEI Jian-wen

(Network and Information Center,Shanghai Jiao Tong University,Shanghai 200240,China)

Received:2023-03-29 Revised:2023-06-25 Online:2023-12-25 Published:2023-12-14

摘要/Abstract

摘要： 随着高性能计算的业务增长和规模扩大，机房空间、供电能力等外部因素常常会成为集群扩容升级的制约因素，由此产生了异地超算集群的建设需求。异地超算能突破单个集群的地理限制，提供更多算力资源。基于上海交通大学“交我算”计算平台建设异地联合超算集群的实践，总结了基础设施与系统软件的统一管理方法，以及集群异地容灾的高可用设计，具体包括：适配Slurm作业调度系统、Open OnDemand可视化门户站点、扩展LDAP等基础服务的高可用能力，以及建设分层汇聚监控系统。最后，从数据传输、用户体验和平台可用性3个维度展示了异地超算集群方案的有效性。

关键词: 高性能计算, 多站点集群, 异地容灾, 多层联合监控

Abstract: With the growth and expansion of high-performance computing businesses, external factors such as data center space and power supply capacity often become constraints on cluster expansion and upgrading, resulting in the need for the construction of multi-site high-performance computing (HPC) clusters. Multi-site HPC cluster can break through the geographical limitations of a single cluster and provide more computing resources. Based on the practice of SJTU-computing platform, this paper summarizes the unified management methods of infrastructure and system software, as well as the high availability design of cluster remote disaster tolerance, including: adaptive Slurm job scheduling system and Open OnDemand visual portal site, extending high availability capabilities for LDAP and other basic services, and building a hierarchical aggregation monitoring system. Finally, this paper demonstrates the effectiveness of remote supercomputing cluster solutions from three dimensions: data transmission, user experience, and platform high availability.

Key words: high performance computing, multi-site cluster, remote disaster recovery, multi-level federation monitor

张天阳, 池成悦, 郭武, 高亦沁, 文敏华, 韦建文. 校级异地超算集群管理的关键技术研究与实践[J]. 计算机工程与科学, 2023, 45(12): 2135-2145.

ZHANG Tian-yang, CHI Cheng-yue, GUO Wu, GAO Yi-qin, WEN Min-hua, WEI Jian-wen . Key techniques and practice on managing multi-site HPC clusters for university campus[J]. Computer Engineering & Science, 2023, 45(12): 2135-2145.