• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

混部数据中心负载特征及其任务调度优化分析

王济伟1,葛浙奉1,蒋从锋1,张纪林1,俞俊1,林江彬2,闫龙川3,任祖杰4,万健5   

  1. (1.杭州电子科技大学计算机学院,浙江 杭州 310018;2.阿里云计算有限公司,浙江 杭州 311121;
    3.国网电力信息通信有限公司,北京 100053;4.之江实验室,浙江 杭州 311121;
    5.浙江科技学院信息与电子工程学院,浙江 杭州 310023)
  • 收稿日期:2019-08-18 修回日期:2019-10-21 出版日期:2020-01-25 发布日期:2020-01-25
  • 基金资助:

    国家自然科学基金(61972118,61572163);国家重点研发计划(2017YFB1010001);浙江省重点研发计划(2017C03024,2018C01098)

Workload characterization and task scheduling
optimization of co-located Internet data centers

WANG Ji-wei1,GE Zhe-feng1,JIANG Cong-feng1,ZHANG Ji-lin1,#br# YU Jun1,LIN Jiang-bin2,YAN Long-chuan3,REN Zu-jie4,WAN Jian5   

  1. (1.School of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018;
    2.Alibaba Cloud Computing Co.,Ltd.,Hangzhou 311121;
    3.State Grid Electrical Information Communication Co.,Ltd.,Beijing 100053;4.Zhijiang Laboratory,Hangzhou 311121;
    5.School of Information and Electronic Engineering,Zhejiang University of Science and Technology,Hangzhou 310023,China)
  • Received:2019-08-18 Revised:2019-10-21 Online:2020-01-25 Published:2020-01-25

摘要:

随着现代互联网数据中心的规模越来越大,数据中心面临着能耗、可靠性、可管理性与可扩展性等方面的挑战。同时,数据中心承载的服务多样,既有在线Web服务,也有离线批处理任务。在线任务要求较低的延迟,而离线任务要求较高的吞吐量。为了提高服务器利用率,降低数据中心能耗,当前数据中心往往将在线任务和离线任务混合部署到同一个计算集群中。在混部场景下,如何同时满足在线和离线任务的不同要求,是目前面临的关键挑战。分析了阿里巴巴于2018年发布的含有4 034台服务器的混部计算集群在8天内的日志数据(cluster-trace-v2018),从静态配置信息、动态混部运行状态、离线批处理作业DAG依赖结构等出发,揭示其负载特征,包括任务倾斜与容器部署的相关关系等,根据任务依赖关系与关键路径,提出了相应的任务调度优化策略。
 
 

关键词: 混部数据中心, 负载特性, 在线服务, 批处理作业, 调度

Abstract:

Modern Internet Data Centers (IDCs) are facing challenges in terms of energy consumption, reliability, management ability, and scalability, when their sizes increase gradually. Currently, IDCs carry a variety of services including online web services and offline batch processing jobs. Online jobs require lower latency, while offline jobs require higher throughput. In order to improve server utilization and reduce energy consumption, IDCs often deploy online and offline jobs in the same computing cluster. In the co-located scenario, how to meet the different requirements of online and offline jobs at the same time is the key challenge. This paper analyzes the Alibaba co-located cluster trace data (cluster-trace-v2018), which includes the data traces from 4034 machines during 8 days. Based on static configuration, dynamic co-located run-time status, and DAG (Directed Acyclic Graph) dependency structure of offline batch jobs, the co-located workloads including the relationship between task skew and container distribution are characterized. Based on the task dependencies and critical paths, a corresponding task scheduling optimization strategy is proposed.
 

Key words: co-located data center, workload characterization, online service, batch job, scheduling