非一致Cache体系结构技术综述

吴俊杰,杨学军

doi:10.3969/j.issn.1007130X.2011.

计算机工程与科学 >

2011 , Vol. 33 >Issue 2: 51 - 60

DOI: https://doi.org/10.3969/j.issn.1007130X.2011.

论文

非一致Cache体系结构技术综述

展开

(并行与分布处理国防科技重点实验室，湖南长沙 410073)

吴俊杰(1981),男，安徽蚌埠人，博士生，CCF会员（E200012744G），研究方向为计算机体系结构和编译技术。杨学军(1963),男，山东人，教授，博士生导师，研究方向为并行体系结构、并行操作系统和并行编译。

收稿日期: 2009-06-02

修回日期: 2009-12-15

网络出版日期: 2011-02-25

基金资助

国家自然科学基金资助项目（60621003,60873014,60633050）

收起

A Survey of the NonUniform Cache Architecture

Expand

(National Laboratory for Parallel and Distributed Processing,Changsha 410073，China)

Received date: 2009-06-02

Revised date: 2009-12-15

Online published: 2011-02-25

Fold

摘要

存储墙问题使得Cache技术的研究始终非常重要。面对日益增长的片上Cache容量，线延迟逐渐成为制约Cache设计的重要因素。为了提供统一的访问延迟，传统的Cache设计方法不得不迁就离处理器最远的Cache Bank的访问时间。为此，研究人员提出了一种非一致Cache结构（NUCA），NUCA几乎成为未来处理器中大容量Cache设计的一种趋势。处理器访问NUCA时，如果在离处理器较近的Bank中发生命中，处理器的等待时间就较短；如果在离处理器较远的Bank中发生命中，处理器的等待时间就较长。本文综述了NUCA技术产生的原因、发展，以及当前最典型的NUCA系统；并且指出了对NUCA技术研究有借鉴的两种多机存储系统技术——NUMA和COMA；最后，提出了NUCA技术研究的关键问题，并给出了相应的解决思路。

关键词： 非一致Cache; 线延迟; 局部性; 多核; 非一致存储访问; 全Cache存储结构

本文引用格式

吴俊杰,杨学军 . 非一致Cache体系结构技术综述[J]. 计算机工程与科学, 2011 , 33(2) : 51 -60 . DOI: 10.3969/j.issn.1007130X.2011.

Abstract

Because of the memory wall problem, cache memory technologies have been very important. In the future, growing wire delays will become one of the key factors restricting the design of large caches. To provide a uniform access delay, the traditional cache design methodology has to accommodate cache access time to the farthest cache bank from processors. Thus, researchers proposed a kind of NonUniform Cache Architecture (NUCA), which is almost the inevitable outcome in intending processors. If a processor hits the data stored in the nearby banks of NUCA, the cache access time will be short. Otherwise, the access time will be long. This paper summarizes the origin, development and typical cases of NUCA. It is important that two memory technologies of multiprocessors, from which the research of NUCA may borrow ideas, are introduced. Finally, the key problems of the NUCA research as well as solutions are proposed in the authors' opinion.

Key words： NUCA;wire delay;locality;multicore;NUMA;COMA

参考文献

［1］Wu CY,Shiau MC. Delay Models and Speed Improvement Techniques for RC Tree Interconnections Among SmallGeometry CMOS Inverters［J］. IEEE Journal of SolidState Circuits,1990,25(5):12471256.
［2］Semiconductor Industry Association［EB/OL］.［20020317］.http://public.itrs.net/Files/2002Update/2002Update.pdf.
［3］Hrishikesh M S,Jouppi N P,Farkas K I,et al.The Optimal Logic Depth Per Pipeline Stage Is 6 to 8 Inverter Delays［C］∥Proc of the 29th Annual Int’l Symp on Computer Architecture, 2002：1424.
［4］Kim C,Burger D,Keckler S W.An Adaptive,NonUniform Cache Structure for WireDelay Dominated OnChip Caches［C］∥Proc of the 10th Int’l Conf on Architectural Support for Programming Languages and Operating Systems,2002:211222.
［5］Chishti Z,Powell M D,Vijaykumar T N.Distance Associativity for HighPerformance EnergyEfficient NonUniform Cache Architectures［C］∥Proc of the 36th Annual IEEE/ACM Int’l Symp on Microarchitecture,2003：55.
［6］Beckmann　B M,Wood D A.Managing Wire Delay in Large ChipMultiprocessor Caches［C］∥Proc of the 37th Annual IEEE/ACM Int’l Symp on Microarchitecture,2004：319330.
［7］Chishti Z,Powell M D,Vijaykumar T N.Optimizing Replication,Communication, and Capacity Allocation in Cmps［C］∥Proc of the 32nd Annual Int’l Symp on Computer Architecture,2005:357368.
［8］Merino J,Puente V,Prieto P,et al.Spnuca:A Cost Effective Dynamic NonUniform Cache Architecture［J］.SIGARCH Computer Architecture News,2008,36(2):6471.
［9］Kandemir M,Li F,Irwin M J,et al.A Novel Migrationbased Nuca Design for Chip Multiprocessors［C］∥Proc of the 2008 ACM/IEEE Conf on Supercomputing,2008:112.
［10］Cho S,Jin L.Managing Distributed,Shared L2 Caches Through OSLevel Page Allocation［C］∥Proc of the 39th Annual IEEE/ACM Int’l Symp on Microarchitecture,2006：455468.
［11］Bardine A,Foglia P,Gabrielli G,et al.Improving Power Efficiency of DNuca Caches［J］.SIGARCH Comput Archit News,2007,35(4):5358.
［12］Bardine A,Foglia P,Gabrielli G,et al.Analysis of Static and Dynamic Energy Consumption in Nuca Caches: Initial Results［C］∥Proc of the 2007 Workshop on Memory Performance,2007:105112.
［13］Bardine A,Comparetti M,Foglia P,et al.Leveraging Data Promotion for Low Power DNuca Caches［C］∥Proc of the 11th EUROMICRO Conf on Digital System Design Architectures, Methods and Tools,2008:307316.
［14］Sankaralingam K,Nagarajan R,Liu H,et al.Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture［C］∥Proc of the 30th Annual Int’l Symp on Computer Architecture,2003:422433.
［15］Beckmann B M,Wood D A.TLC: Transmission Line Caches［C］∥Proc of the 36th Annual IEEE/ACM Int’l Symp on Microarchitecture,2003：43.
［16］Chang J,Sohi G S.Cooperative Caching for Chip Multiprocessors［C］∥Proc of the 33rd Annual Int’l Symp on Computer Architecture,2006:264276.
［17］Chang J,Sohi G S.Cooperative Cache Partitioning for Chip Multiprocessors［C］∥Proc of the 21st Annual Int’l Conf on Supercomputing,2007:242252.
［18］Wikipedia. NonUniform Memory Access［EB/OL］.［20091113］.http://en.wikipedia.org/wiki/NonUniform_Memory_Access.
［19］Grbic A,Brown S,Caranci S,et al.Design and Implementation of the Numachine Multiprocessor［C］∥Proc of the 35th Annual Conf on Design Automation,1998:6669.
［20］Pimentel A.Parallel Architectures:The Bare Metal［EB/OL］.［20091116］.http://staff.science.uva.nl/~andy/apr/apr.html.
［21］Mu T,Tao J,Schulz M,et al.Interactive Locality Optimization on Numa Architectures［C］∥Proc of the 2003 ACM Symp on Software Visualization,2003:133141.
［22］Larowe J R P.Page Placement for NonUniform Memory Access Time (NUMA) Shared Memory Multiprocessors:［Ph D dissertation］［D］.Durham Duke University,1991.
［23］Richard J,LaRowe P,Holliday M A,et al.An Analysis of Dynamic Page Placement on a Numa Multiprocessor［C］∥Proc of the 1992 ACM SIGMETRICS Joint Int’l Conf on Measurement and Modeling of Computer Systems,1992:2334.
［24］Li Z.Reducing Cache Conflicts by Partitioning and Privatizing Shared Arrays［C］∥Proc of the 1999 Int’l Conf on Parallel Architectures and Compilation Techniques,1999：183.
［2５］Wikipedia. Cache only Memory Architecture［EB/OL］.［20090617］.http://en.wikipedia.org/wiki/Cache_only_memory_architecture.
［26］Dahlgren F,Torrellas J.CacheOnly Memory Architectures［J］.Computer,1999,32(6):7279.
［27］Landin A,Haridi S.Ddm  a CacheOnly Memory Architecture［R］.European Research Consortium for Informatics and Mathematics at SICS, Technology Report, 1991.
［28］Wikipedia. Kendall Square Research［EB/OL］.［20090514］.http://en.wikipedia.org/wiki/Kendall_Square_Research.
［29］Windheiser D,Boyd E L,Hao E,et al.KSR1 Multiprocessor:Analysis of Latency Hiding Techniques in Sparse Solver［C］∥Proc of Seventh Int’l Parallel Processing Symp,1993：456461.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献