• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (04): 594-604.

• 高性能计算 • 上一篇    下一篇

面向Lustre集群存储的应用日志分析及系统自动优化框架

程稳1,李焱2,曾令仿3,王芳1,唐士程2,杨力平2,冯丹1,曾文君2   

  1. (1.华中科技大学武汉光电国家研究中心,
    信息存储系统教育部重点实验室暨数据存储系统与技术教育部工程研究中心,湖北 武汉 430074;
    2.深圳国家基因库,广东 深圳 518120;3.之江实验室,浙江 杭州 311121)
  • 收稿日期:2021-08-13 修回日期:2021-11-11 接受日期:2022-04-25 出版日期:2022-04-25 发布日期:2022-04-20
  • 基金资助:
    国家自然科学基金(61832020);创新研究群体项目(61821003);浙江省“万人计划”(2021R2007);之江实验室自设科研项目(2021DA0AM01)

An application log analysis and system automation optimization framework for Lustre cluster storage

CHENG Wen1,LI Yan2,ZENG Ling-fang3,WANG Fang1,TANG Shi-cheng2,YANG Li-ping2,FENG Dan1,ZENG Wen-jun2   

  1. (1.Wuhan National Laboratory for Optoelectronics,Huazhong University of Science and Technology,
    Key Laboratory of Information Storage System,Engineering Research Center of Data Storage Systems and Technology,
    Ministry of Education of China,Wuhan 430074;
    2.China National GeneBank,BGI-Shenzhen,Shenzhen 518120;
    3.Zhejiang Lab, Hangzhou 311121,China)
  • Received:2021-08-13 Revised:2021-11-11 Accepted:2022-04-25 Online:2022-04-25 Published:2022-04-20

摘要: 在科学计算、大数据处理和人工智能等领域,对相关应用负载进行研究,分析负载I/O模式,揭示应用负载变迁规律等,对指导集群存储系统性能优化十分重要。当前应用种类繁多并且应用快速迭代更新,复杂的环境使得对应用负载的特性挖掘充满挑战。针对以上问题,在生产环境中收集了5个Lustre集群存储共计326天的应用日志信息,对应用负载的访问、负载特性进行了深入的探究与分析,并对已有观察进行了验证和补充。通过对应用日志信息横向、纵向和多维度对比分析与信息挖掘,总结了4个发现,并研究相关发现与以往工作的关联性,结合实际生产环境,给出了相应的系统优化策略与切实可行的实施方案,为用户、维护人员、上层应用开发者和多层存储系统设计等人员提供了相关参考与建议。同时,针对实际应用环境复杂、系统优化工作耗时费力等问题,设计并实现了一种系统自动优化框架(SAOF),SAOF可为指定应用负载提供资源预留、带宽限定等功能,初步测试表明,SAOF能根据系统资源与任务负载需求为不同任务提供自动化的QoS保障。

关键词: Lustre文件系统, 日志分析, 系统优化, 服务质量, 资源管理

Abstract: In the fields of scientific computing, big data processing, and artificial intelligence,  it is very important to study the relevant application load, analyze the load I/O pattern to reveal the application load change law, etc., which is very important to guide the performance optimization of the cluster storage system. At present, there are many kinds of applications and the applications are updated rapidly and iteratively. The complex environment makes the feature mining of application load full of challenges. To address the above problems, we collected the application log information of five Lustre cluster storages in the production environment for 326 days, explored and analyzed the access and load characteristics of the application load, and verified and supplemented the existing observations. Through horizontal, vertical, and multi-dimensional comparative analysis and information mining of the application log information, we summarize four findings, explore the relationship between the relevant findings and previous research work, and then combine the actual production environment with the corresponding system optimization strategies. Feasible implementation schemes are given, which provide relevant references and suggestions for users, maintainers, upper application developers, multi-tier storage system designers, and other personnel. At the same time, because of the complex practical application environment and time-consuming work of system optimization, a system automation optimization framework (SAOF) is designed and implemented. SAOF can provide functions such as resource reservation and bandwidth limitation for specified application loads. Preliminary tests show that SAOF can provide automatic QoS guarantees for different tasks according to system resources and task load requirements.


Key words: Lustre file system, log analysis, system optimization, quality of service(QoS), resource management