• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (7): 60-64.

• 论文 • 上一篇    下一篇

一种基于块匹配算法的SAD运算加速器

谷会涛,陈书明   

  1. (国防科学技术大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2010-06-28 修回日期:2010-10-26 出版日期:2012-07-25 发布日期:2012-07-25
  • 基金资助:

    国家863计划资助项目(2009AA011704);教育部“高性能微处理器技术”创新团队研究计划

A BlockMatching Algorithm Based Accelerator for SAD Computation

GU Huitao,CHEN Shuming   

  1. (School of Computer Science, National University of Defense Technology,Changsha 410073,China)
  • Received:2010-06-28 Revised:2010-10-26 Online:2012-07-25 Published:2012-07-25

摘要:

基于块匹配算法的运动估计是图像和视频应用中的关键技术。SAD运算是运动估计中最主要的运算形式,具有极高的计算复杂度和传输带宽需求。本文提出了一种可配置的SAD运算加速器结构,采用一个16×1规模的PE阵列和一个加法树结构加速SAD运算的执行。本文将PE阵列和加法树结构的流水线进行细致划分,有效提高了工作频率。加速器采用DMA事件机制,大部分的数据传输可以与SAD计算并行进行,减少了数据传输延迟引起的性能下降。实验结果显示,搜索16×16大小的搜索窗口,本文结构只需要4 102个周期。基于SMIC 0.13μm的CMOS标准单元工艺对本文结构进行综合,最高工作频率可达到750MHz,面积约为16.8k门和3.5 KB的片上存储器。

关键词: 运动估计, 块匹配算法, 处理单元阵列, 视频编码

Abstract:

Blockmatching based motion estimation is one of the most important techniques in image and video applications. The sum of absolute difference (SAD) is the major computation in motion estimation and requires huge computation complexity and transmission bandwidth. This paper proposes a reconfigurable SAD accelerator, in which a 16×1 processing elements (PE) array and an adder tree structure are used to improve the execution speed of SAD computation. The pipeline partition of PE array and adder tree is performed carefully in order to increase the work frequency. In order to reduce the performance loss caused by data transfer delay, a DMA event mechanism is employed to transmit data when the SAD accelerator is working. The experimental results show that, the proposed architecture needs 4102 cycles for searching a 16×16 search window. With a 0.13μm CMOS standard cell technology, the proposed accelerator requires only 16.8 k gates and 3.5 KB of memory at the 750MHz operation frequency.

Key words: motion estimation;BMA;processing element array;video coding