面向Lustre集群存储的应用日志分析及系统自动优化框架

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (4): 594-604.

面向Lustre集群存储的应用日志分析及系统自动优化框架

程稳1，李焱2，曾令仿3，王芳1，唐士程2，杨力平2，冯丹1，曾文君2

（1.华中科技大学武汉光电国家研究中心，
信息存储系统教育部重点实验室暨数据存储系统与技术教育部工程研究中心，湖北武汉 430074;
2.深圳国家基因库，广东深圳 518120;3.之江实验室，浙江杭州 311121）

收稿日期:2021-08-13 修回日期:2021-11-11 出版日期:2022-04-25 发布日期:2022-04-20
基金资助:
国家自然科学基金（61832020）；创新研究群体项目（61821003）;浙江省“万人计划”（2021R2007）;之江实验室自设科研项目（2021DA0AM01）

An application log analysis and system automation optimization framework for Lustre cluster storage

CHENG Wen1,LI Yan2,ZENG Ling-fang3,WANG Fang1,TANG Shi-cheng2,YANG Li-ping2,FENG Dan1,ZENG Wen-jun2

（1.Wuhan National Laboratory for Optoelectronics,Huazhong University of Science and Technology,
Key Laboratory of Information Storage System,Engineering Research Center of Data Storage Systems and Technology,
Ministry of Education of China,Wuhan 430074;
2.China National GeneBank,BGI-Shenzhen,Shenzhen 518120;
3.Zhejiang Lab, Hangzhou 311121,China）

Received:2021-08-13 Revised:2021-11-11 Online:2022-04-25 Published:2022-04-20

摘要/Abstract

摘要： 在科学计算、大数据处理和人工智能等领域，对相关应用负载进行研究，分析负载I/O模式，揭示应用负载变迁规律等，对指导集群存储系统性能优化十分重要。当前应用种类繁多并且应用快速迭代更新，复杂的环境使得对应用负载的特性挖掘充满挑战。针对以上问题，在生产环境中收集了5个Lustre集群存储共计326天的应用日志信息，对应用负载的访问、负载特性进行了深入的探究与分析，并对已有观察进行了验证和补充。通过对应用日志信息横向、纵向和多维度对比分析与信息挖掘，总结了4个发现，并研究相关发现与以往工作的关联性，结合实际生产环境，给出了相应的系统优化策略与切实可行的实施方案，为用户、维护人员、上层应用开发者和多层存储系统设计等人员提供了相关参考与建议。同时，针对实际应用环境复杂、系统优化工作耗时费力等问题，设计并实现了一种系统自动优化框架（SAOF），SAOF可为指定应用负载提供资源预留、带宽限定等功能，初步测试表明，SAOF能根据系统资源与任务负载需求为不同任务提供自动化的QoS保障。

关键词: Lustre文件系统, 日志分析, 系统优化, 服务质量, 资源管理

Abstract: In the fields of scientific computing, big data processing, and artificial intelligence, it is very important to study the relevant application load, analyze the load I/O pattern to reveal the application load change law, etc., which is very important to guide the performance optimization of the cluster storage system. At present, there are many kinds of applications and the applications are updated rapidly and iteratively. The complex environment makes the feature mining of application load full of challenges. To address the above problems, we collected the application log information of five Lustre cluster storages in the production environment for 326 days, explored and analyzed the access and load characteristics of the application load, and verified and supplemented the existing observations. Through horizontal, vertical, and multi-dimensional comparative analysis and information mining of the application log information, we summarize four findings, explore the relationship between the relevant findings and previous research work, and then combine the actual production environment with the corresponding system optimization strategies. Feasible implementation schemes are given, which provide relevant references and suggestions for users, maintainers, upper application developers, multi-tier storage system designers, and other personnel. At the same time, because of the complex practical application environment and time-consuming work of system optimization, a system automation optimization framework (SAOF) is designed and implemented. SAOF can provide functions such as resource reservation and bandwidth limitation for specified application loads. Preliminary tests show that SAOF can provide automatic QoS guarantees for different tasks according to system resources and task load requirements.

Key words: Lustre file system, log analysis, system optimization, quality of service(QoS), resource management

程稳, 李焱, 曾令仿, 王芳, 唐士程, 杨力平, 冯丹, 曾文君. 面向Lustre集群存储的应用日志分析及系统自动优化框架[J]. 计算机工程与科学, 2022, 44(4): 594-604.

CHENG Wen, LI Yan, ZENG Ling-fang, WANG Fang, TANG Shi-cheng, YANG Li-ping, FENG Dan, ZENG Wen-jun. An application log analysis and system automation optimization framework for Lustre cluster storage[J]. Computer Engineering & Science, 2022, 44(4): 594-604.

[1]	朱正东, 吴寅超, 胡亚红, 蒋家强. 基于LSTM的集群用户作业执行时间预测模型[J]. 计算机工程与科学, 2022, 44(8): 1331-1341.
[2]	苟平章, 原晨, 张芬. 基于软件定义的WSNs非均匀分簇QoS路由算法[J]. 计算机工程与科学, 2022, 44(2): 227-236.
[3]	唐阳坤, 鲜港, 杨文祥, 喻杰, 张晓蓉, 王耀彬. 基于用户行为的超级计算机作业失败预测方法[J]. 计算机工程与科学, 2022, 44(10): 1753-1761.
[4]	徐金波, 常俊胜, 李琰. 支持多优先级多输出通道的数据队列调度方法和硬件实现[J]. 计算机工程与科学, 2020, 42(10高性能专刊): 1749-1756.
[5]	刘婷1,2,杨秋翔1. 基于混沌机制与Levy变异烟花算法的服务组合优化[J]. 计算机工程与科学, 2019, 41(09): 1618-1626.
[6]	颜丽燕，张桂珠. 基于蜂群算法的多维QoS云计算任务调度[J]. J4, 2016, 38(04): 648-655.
[7]	石双元1,吴颖敏2,张泽中1，徐凯3. 基于TAM的企业云应用服务采纳模型及关键因素[J]. J4, 2015, 37(05): 873-881.
[8]	王盖，王永炎. 物联网环境下基于用户满意度的实时调度算法[J]. J4, 2015, 37(01): 184-190.
[9]	熊芳1，3 ，黄宏斌2，黄玉成1，3 ，胡建中3. 基于带QoS能力模型的资源匹配方法[J]. J4, 2014, 36(10): 1911-1918.
[10]	张秀伟1,2,何克清1,王健1,刘建晓3. Web服务个性化推荐研究综述[J]. J4, 2013, 35(9): 132-140.
[11]	高晓燕. 基于QoS的P2P服务发现算法的研究[J]. J4, 2013, 35(6): 42-46.
[12]	符琦1,2,陈志刚1,蒋云霞2,尹风雨2,李润求2. 无线Mesh网络中一种基于剩余时延的公平调度策略[J]. J4, 2013, 35(12): 58-65.
[13]	陈玉仙1,2，罗三定2. 一种基于信息融合的新颖电梯调度算法[J]. J4, 2013, 35(12): 178-184.
[14]	蒋维成，李兰英，黄孝斌，蒋志平. 工业以太网的一种分组调度实现[J]. J4, 2012, 34(7): 191-194.
[15]	高红亮1，2，汪秉文1，高超1，胡晓娅1. 无线传感器网络QoS仿真与研究[J]. J4, 2012, 34(11): 7-13.