• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (11): 14-21.

• 论文 • 上一篇    下一篇

延时敏感的推测多线程调度策略

李艳华,张悠慧,王为,郑纬民   

  1. (清华大学计算机系,北京 100084)
  • 收稿日期:2013-07-10 修回日期:2013-09-25 出版日期:2013-11-25 发布日期:2013-11-25
  • 基金资助:

    国家863计划资助项目(2013AA01A215);教育部-Intel 信息技术专项科研基金资助项目(MOE-INTEL-11-04)

Latencyaware thread scheduling
scheme for threadlevel speculation  

LI Yanhua,ZHANG Youhui,WANG Wei,ZHENG Weimin   

  1. (Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)
  • Received:2013-07-10 Revised:2013-09-25 Online:2013-11-25 Published:2013-11-25

摘要:

随着大规模片上多核处理器的发展,越来越多的核被集成到一个芯片上。一方面,总会有一些核处于空闲状态;另一方面,受功耗限制片上单核比较简单,导致单线程性能较弱。通过在片上多核处理器上支持推测多线程机制,可以利用空闲的片上资源来加速串行程序执行,提高单线程性能。决定推测多线程执行性能的一些额外开销,比如缓存缺失率上升、冲突检测开销、线程提交开销以及推测线程重新执行开销等,对片上多核处理器访存时延和核间通信时延非常敏感。传统的多线程调度算法因为没有考虑到推测多线程机制的特点,在用于推测多线程调度时效果不佳。提出的延时敏感的推测多线程调度算法,利用推测多线程在剖析、编译阶段产生的访存特性统计和实时访存记录,计算程序的数据重心,逐步将推测多线程调度到数据重心周围的相邻几个核中执行;同时,在推测线程调度过程中充分利用提交成功的线程和推测失败的线程留在缓存中的数据,提高缓存利用率。实验结果表明,推测多线程机制执行中,采用延时敏感的推测多线程调度策略相对于广泛采用的优先级调度策略能够取得平均16.8%的性能提升;相对于最近提出的基于非一致性数据访问优化的集群线程调度策略能够取得平均10.1%的性能提升。

关键词: 时延, 片上多核处理器, 推测多线程, 线程调度

Abstract:

With the advent of largescale chipmultiprocessors (CMPs), more and more cores are integrated on a single chip. On the first hand, there always will be some idle cores. And on the other hand, with the energy consumption limit, cores integrated on the chip are relatively simple. ThreadLevel Speculation (TLS) remains a promising technique for exploiting the idle hardware resources to improve the performance of a sequential program. However, the usual distributed design of largescale CMPs, like the nonuniform cache architecture (NUCA), introduces some nonuniform architectureproperties which significantly increase the overhead of TLS execution (L2 cache access overhead, task squashing overhead and reexecution overhead). Some stateoftheart multithread scheduling algorithms work poorly for TLS because of ignoring these TLSrelative characteristics. The proposed latencyaware thread scheduling algorithm for threadlevel speculation, uses the memory access statistics gained in the profiling, compiling and realtime executing stages, to calculate the CDG (Center of Data Gravity) of the program, and then schedules the speculative threads to the cores around the CDG. At the same time, the proposed thread scheduling algorithm makes good use of the data remained in the cache by the committed and squashed threads. Evaluation results show that latencyaware thread scheduling algorithm observed 16.8% performance speedup over priority scheduling, and 10.1% performance speedup over clusteredthread scheduling.

Key words: Latency;chip multiprocessors;threadlevel speculation;thread scheduling