• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2024, Vol. 46 ›› Issue (12): 2128-2137.

• High Performance Computing • Previous Articles     Next Articles

Constructing and analyzing deep learning task dataset for R&D GPU clusters

LUO Jing1,2,YE Zhi-sheng2,3,YANG Ze-hua2,3,FU Tian-hao2,3,WEI Xiong1,WANG Xiao-lin2,3,LUO Ying-wei2,3   

  1. (1.School of Computer Science and Artificial Intelligence,Wuhan Textile University,Wuhan 430200;
    3.Pengcheng Laboratory,Shenzhen 518000;
    3.School of Computer Science,Peking University,Beijing 100871,China)
  • Received:2023-05-18 Revised:2023-10-26 Accepted:2024-12-25 Online:2024-12-25 Published:2024-12-23

Abstract: In recent years, with the growing demand for training deep learning models, research institutions and enterprises have established shared GPU clusters to reduce costs and improve efficiency. Existing research mainly focuses on task scheduling and resource allocation in enterprise-level GPU clusters. However, this paper focuses on the Pengcheng Cloud Brain I, a research and development  GPU cluster, by monitoring and collecting key indicators during task runtime. It constructs a dataset for deep learning training tasks, named the Pengcheng Cloud Brain I Task Dataset, which includes fine-grained time-series resource usage information for tasks. This dataset represents the first publicly available dataset tailored for R&D GPU clusters. It reveals the phenomenon of low resource utilization in R&D GPU clusters and provides a basis and reference for designing schedulers with high resource utilization for R&D GPU clusters, thereby promoting research on task scheduling and resource allocation mechanisms.

Key words: GPU cluster, deep learning, cluster workload, workloads dataset, resource utilization