• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2020, Vol. 42 ›› Issue (10高性能专刊): 1807-1814.

• 高性能计算机系统软件 • 上一篇    下一篇

“魔方-3”高性能计算机运维管理平台设计与实现

赵奇奇   

  1. (上海超级计算中心,上海 201203)
  • 收稿日期:2020-06-04 修回日期:2020-07-15 接受日期:2020-10-25 出版日期:2020-10-25 发布日期:2020-10-23

Design and implementation of the maintenance and management platform powered by  Magic Cube-3 high-performance computer

ZHAO Qi-qi   

  1. (Shanghai Supercomputer Center,Shanghai 201203,China)
  • Received:2020-06-04 Revised:2020-07-15 Accepted:2020-10-25 Online:2020-10-25 Published:2020-10-23

摘要: 随着科技的进步,高性能计算机作为重要的科研基础设施为各行各业的发展提供了有力的支撑保障。确保高性能计算机稳定高效的运行是系统管理员的希冀也是职责所在。主要介绍了以“魔方-3”高性能计算机为对象开发的运维管理平台,包括平台架构设计、底层数据采集接口和方式,以及该平台实现的系统监控、自动巡检、数据分析等多种功能。借助这个平台系统管理员能直观清晰地了解计算机运行状况,及时发现并处置故障,通过多角度的数据挖掘分析影响当前运行效率的瓶颈所在,为后续软硬件优化升级提供科学的决策依据。

关键词: 高性能计算机, 运维管理, 系统监控, 数据分析

Abstract:

With the progress of science and technology, high-performance computers, as important infrastructure for scientific research, have provided strong support for the development of various indu- stries. It is administrators’ wishes and responsibilities to guarantee that high-performance computers can operate stably and efficiently. This paper mainly introduces the maintenance and management system powered by “magic cube-3” supercomputer. The introduction includes platform structure design, underlying data collection interface and methods, and various functions achieved by the platform  including system monitoring, automatic detection and data analysis. This platform  enables administrators to directly know the operation status of computers and timely find and handle malfunction. Through collecting and analyzing data from multiple perspectives, administrators can find out bottlenecks that slow down the operation efficiency, thus offering scientific decision-making basis for subsequent optimization and upgrading.





Key words: high-performance computer, maintenance and management, system monitoring, data ana- lysis