• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (11): 1929-1940.

• 高性能计算 • 上一篇    下一篇

超长指令字DSP标量访存单元的设计与优化

郑康,李晨,陈海燕,刘胜,方粮   

  1. (国防科技大学计算机学院,湖南 长沙 410073) 
  • 收稿日期:2022-10-19 修回日期:2023-02-23 接受日期:2023-11-25 出版日期:2023-11-25 发布日期:2023-11-16
  • 基金资助:
    国家自然科学基金(62202478);国防科技大学科研项目(ZK20-04)

Design and optimization of scalar memory access unit in VLIW DSPs

ZHENG Kang,LI Chen,CHEN Hai-yan,LIU Sheng,FANG Liang   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2022-10-19 Revised:2023-02-23 Accepted:2023-11-25 Online:2023-11-25 Published:2023-11-16

摘要: 近年来,随着集成电路技术的发展处理器与存储器之间的速度差异越来越大,存储器愈发成为制约计算系统性能的瓶颈。对于嵌入式、低功耗领域的DSP而言,其架构和应用场景与通用CPU不同,CPU的访存设计难以满足DSP的访存需求。针对超长指令字DSP在访存实时性、顺序与固定延迟、高效数据一致性方面的需求,设计了一种适用于DSP的标量访存单元,可配置的设计能够满足DSP的访存实时性;基于ID的顺序机制保证超长指令字架构对Load指令返回数据的顺序与固定延迟要求,存储开销为87.5 B;硬件查找“首1”加速了数据一致性所需的写回操作。当Cache中25%,50%和75%的行需要写回时,优化后的一致性写回开销为逐行扫描方法的26.4%,51.3%和76.2%,只与有效脏行数量成正比,与Cache容量无关。

关键词: 标量访存单元, DSP, 超长指令字

Abstract: In recent years, the speed difference between processors and memories has become increasingly larger with the development of integrated circuit technology, and memories have increasingly become the bottleneck that limits the performance of computing systems. For DSPs in embedded and low-power consumption areas, their architectures and application scenarios are different from general-purpose CPUs, and the memory access design of CPUs cannot meet the memory access requirements of DSPs. To address the requirements of Very Long Instruction Word (VLIW) DSPs in terms of real-time memory access, order and fixed delay, and efficient data consistency, a scalar memory access unit suitable for DSPs is designed. The configurable design can meet the real-time memory access requirements of DSPs. The ID-based ordering mechanism ensures the order and fixed delay requirements of VLIW with a storage overhead of 87.5 B. The write back operation, designed for data consistency, is accele- rated by searching leading-one in hardware. The time consumed by the optimized write back operation are 26.4%, 51.3% and 76.2%, compared to the basic overhead of the progressive scan method, when 25%, 50% and 75% lines of the cache need to be written back. The consistency write back performance is proportional to the number of lines under concern, regardless of the cache capacity.

Key words: scalar memory access unit, digital signal processor (DSP), very long instruction word (VLIW)