• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (12): 2128-2137.

• 高性能计算 • 上一篇    下一篇

研发类GPU集群任务数据集的构建及分析

罗婧1,2,叶志晟2,3,杨泽华2,3,傅天豪2,3,魏雄1,汪小林2,3,罗英伟2,3   

  1. (1.武汉纺织大学计算机与人工智能学院,湖北 武汉 430200;
    2.鹏城实验室,广东 深圳 518000;3.北京大学计算机学院,北京 100871)

  • 收稿日期:2023-05-18 修回日期:2023-10-26 接受日期:2024-12-25 出版日期:2024-12-25 发布日期:2024-12-23
  • 基金资助:
    国家自然科学基金(62032001,62032008)

Constructing and analyzing deep learning task dataset for R&D GPU clusters

LUO Jing1,2,YE Zhi-sheng2,3,YANG Ze-hua2,3,FU Tian-hao2,3,WEI Xiong1,WANG Xiao-lin2,3,LUO Ying-wei2,3   

  1. (1.School of Computer Science and Artificial Intelligence,Wuhan Textile University,Wuhan 430200;
    3.Pengcheng Laboratory,Shenzhen 518000;
    3.School of Computer Science,Peking University,Beijing 100871,China)
  • Received:2023-05-18 Revised:2023-10-26 Accepted:2024-12-25 Online:2024-12-25 Published:2024-12-23

摘要: 近年来,随着深度学习模型训练需求增长,研究机构和企业通过搭建共享GPU集群来降低成本和提高效率。现有研究主要关注企业生产类GPU集群的任务调度和资源分配。针对研发类GPU集群鹏城云脑I,进行任务运行时关键指标的监控和数据采集,构建含任务细粒度时序资源使用信息的深度学习训练任务数据集——鹏城云脑I任务数据集。该数据集是首个面向研发类GPU集群公开数据集,揭示了研发类GPU集群中资源利用率低的现象,为研发类GPU集群高资源利用率的调度器设计提供依据和参考,推动任务调度和资源分配机制的研究。

关键词: GPU集群, 深度学习, 集群负载, 任务数据集, 资源利用率

Abstract: In recent years, with the growing demand for training deep learning models, research institutions and enterprises have established shared GPU clusters to reduce costs and improve efficiency. Existing research mainly focuses on task scheduling and resource allocation in enterprise-level GPU clusters. However, this paper focuses on the Pengcheng Cloud Brain I, a research and development  GPU cluster, by monitoring and collecting key indicators during task runtime. It constructs a dataset for deep learning training tasks, named the Pengcheng Cloud Brain I Task Dataset, which includes fine-grained time-series resource usage information for tasks. This dataset represents the first publicly available dataset tailored for R&D GPU clusters. It reveals the phenomenon of low resource utilization in R&D GPU clusters and provides a basis and reference for designing schedulers with high resource utilization for R&D GPU clusters, thereby promoting research on task scheduling and resource allocation mechanisms.

Key words: GPU cluster, deep learning, cluster workload, workloads dataset, resource utilization