• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (09): 1521-1531.

• 高性能计算 • 上一篇    下一篇

Beacon+:面向E级超级计算机的轻量级端到端I/O性能监控与分析诊断系统

YANG Bin1,2,WANG Jing-yu3,LIU Shi-chao1,2,SHAO Ming-shan1,2,XIAO Wei3,Chen Qi3,4,HE Xiao-bin3,LIU Wei-guo1,2,XUE Wei2,4   

  1. (1.山东大学软件学院,山东 济南 250101;2.国家超级计算无锡中心,江苏 无锡 214072;
    3.国家并行计算机工程技术研究中心,北京 100080;4.清华大学计算机科学与技术系,北京 100084)
  • 收稿日期:2022-01-15 修回日期:2022-05-18 接受日期:2022-09-25 出版日期:2022-09-25 发布日期:2022-09-25
  • 基金资助:
    国家重点研发计划(2020YFA0607900)

Beacon+:A scalable lightweight end-to-end I/O performance monitoring, analysis and diagnosis

YANG Bin1,2,WANG Jing-yu3,LIU Shi-chao1,2,SHAO Ming-shan1,2,XIAO Wei3,Chen Qi3,4,HE Xiao-bin3,LIU Wei-guo1,2,XUE Wei2,4   

  1. Beacon+:A scalable lightweight end-to-end I/O performance monitoring, analysis and diagnosis 
    system for exascale supercomputers
    (1.School of Software,Shandong University,Jinan 250101;
    2.National Supercomputing Center in Wuxi,Wuxi 214072;
    3.National Research Center of Parallel Computer Engineering & Technology,Beijing 100080;
    4.Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)
  • Received:2022-01-15 Revised:2022-05-18 Accepted:2022-09-25 Online:2022-09-25 Published:2022-09-25

摘要: 随着E级计算的屏障被打破,高性能计算已经迈入了新时代。为了满足日益增长的数据访问需求,新兴的技术和存储介质都被运用到了超级计算机中,这使得其架构变得日趋复杂,其性能异常和系统热点定位也变得十分困难。为此,设计并实现了一个面向E级超级计算机的轻量级端到端I/O性能监控与分析诊断系统——Beacon+。该系统无需修改应用代码/脚本即可对每个应用的数据访问过程进行全路径实时监控与分析。通过在线+离线的压缩方法和分布式缓存/存储等机制,Beacon+在保证系统本身高扩展性、低开销的同时还可以持续稳定地提供I/O诊断服务。以神威新一代超级计算机为部署平台,通过I/O标准测试应用和实际应用证明了Beacon+的低开销和高准确性,以及I/O诊断的高效性。

关键词: I/O监控, 数据压缩, I/O诊断, 异常检测, 性能瓶颈优化

Abstract: Abstract:With the barrier to exascale computing being broken, high performance computing has entered a new era. In order to meet the growing demand for data access, new technologies and storage media have been used in supercomputers, which makes its architecture increasingly complex and makes it difficult to locate abnormal performance and system hotspots. To this end, a scalable lightweight end-to-end I/O performance monitoring, analysis and diagnosis system for exascale supercomputers, Beacon+, is designed and implemented. It can monitor and analyze the data access process of each application in real-time without modifying the application code/script. Through online+offline compression methods and distributed caching/storage mechanisms, Beacon+ ensures that the system itself is highly scalable and low-cost, and can continuously and stably provide I/O diagnostic services. Using Sunway new-generation supercomputer as the deployment platform, we have proved Beacon+s low overhead, high accuracy and high efficiency of I/O diagnostics through I/O standard test applications and real-world applications.


Key words: