• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

一种适用于GPU图像处理算法的合并存储结构

左宪禹1,2,张哲1,5,黄祥志4,5,葛强1,2,张理涛3,臧文乾4,5   

  1. (1.河南省大数据分析与处理重点实验室,河南 开封 475004;
    2.河南大学计算机与信息工程学院数据与知识工程研究所,河南 开封 475004;
    3.郑州航空工业管理学院数学学院,河南 郑州 450015;
    4.中国科学院空天信息创新研究院,北京 100094;5.中科空间信息(廊坊)研究院,河北 廊坊 065000)
     
  • 收稿日期:2019-08-13 修回日期:2019-11-05 出版日期:2020-02-25 发布日期:2020-02-25
  • 基金资助:

    国家重点研发计划(2017YFD0301105);国家自然科学基金(U1704122,U1604145);河南省科技计划(182102210242,182102110065,192102210096);河南省高等学校重点科研项目计划基础研究专项(20zx003);河南省科技创新杰出青年基金(184100510004);航空科学基金(2017ZD55014)

A combined storage structure for
image processing algorithms on GPU

ZUO Xian-yu1,2,ZHANG Zhe1,5,HUANG Xiang-zhi4,5,GE Qiang1,2,ZHANG Li-tao3,ZANG Wen-qian4,5   

  1. (1.Henan Key Laboratory of Big Data Analysis and Processing,Kaifeng 475004;
    2.Institute of Data and Knowledge Engineering,
    College of Computer and Information Engineering,Henan University,Kaifeng 475004;
    3.College of Science,Zhengzhou University of Aeronautics,Zhengzhou 450015;
    4.Aerospace Information Research Institute,Chinese Academy of Sciences,Beijing 100094;
    5.Zhongke Langfang Institute of Spacial Information Application,Langfang 065000,China)

     
  • Received:2019-08-13 Revised:2019-11-05 Online:2020-02-25 Published:2020-02-25

摘要:

大多数图像处理算法都可利用GPU进行加速以达到更好的执行性能,但数据传输操作与核函数执行之间的调度策略问题仍是桎梏加速性能进一步提升的主要瓶颈。为了解决这个问题,通常采用GPU任务流将核函数执行与数据传输操作进行重叠,以隐藏部分数据传输与核函数执行耗时。但是,由于CUDA编程模型的特性以及GPU硬件资源的限制,在某些情况下,即使创建较多的任务流用于任务重叠,每个流上仍会存在串行执行的任务,导致加速效果无法进一步提升。因此,考虑利用CSS将待处理图像进行合并从而将单个流中的算子核函数及数据传输操作进行合并,以减少数据传输操作和核函数执行的固定代价及调用间隙。通过实验结果可知,提出的CSS结构不仅能在单流的情况下提高GPU图像处理算法执行性能,在多流的情况下其加速性能也得到了进一步提升,具有较好的实用性及可扩展性,适用于包含较多算子操作或较小尺寸图像批量处理的情况。此外,提出的方法对图像处理算法的GPU加速提供了新的研究思路。
 
 

关键词: 图像处理, GPU, CUDA流, 合并存储结构, 重叠

Abstract:

Most image processing algorithms optimized by GPU can achieve better performance, but the scheduling strategy between data transmission and kernel execution is still the main bottleneck for further improvement in efficiency. To solve this problem, streams are usually used to overlap data transmission and kernel execution, in order to hide some of the data transmission and kernel execution time. However, due to the characteristics of the CUDA programming model and the limitations of GPU resources at hardware level, operations are still serialized when there are so many operations to be execute, even if numerous streams are created. In this paper, a new data storage structure, named Combined Storage Structure (CSS), is proposed, which improves the performance by merging small data transmissions on the single stream into a large one to reduce the fixed cost and the call gap of the operations of data transmission and kernel execution. Experimental results show that CSS can not only improve the performance of GPU-based image processing algorithms in the case of single stream, but also improve the acceleration performance in the case of multiple streams. CSS has good practicability and scalability, and it is suitable for the image processing operations that contain more operators or a large number of small-scale images. In addition, the proposed method provides a new research idea for GPU acceleration of image processing algorithms.

 

Key words: image processing, GPU, CUDA stream;Combined Storage Structure (CSS), overlap