Computer Engineering & Science

Research on real-time machine model and instruction set with time semantics

CHEN Xiang-lan, LI Xi, WANG Chao, ZHOU Xue-hai

2021, 43(04): 571-578. doi:

Abstract ( 288 )

PDF (730KB) ( 339 ) 　　

In mixed critical systems, applications with different security and time criticality share computing resources. Suffered from kinds of uncertainties in the system, designers need a design method with tight timing that can satisfy multiple design constraints such as function behavior certainty, timing behavior predictability and high computing performance at the same time, which challenges the theories and methods of existing computer architectures and programming languages. A real-time machine mo- del,
RTM, and a time triggered instruction set, TTI, which support time semantics, are proposed as the important foundation in constructing a Multi-tier Tight Timing design method, MTTT. A helicopter flight control program is used as an example to illustrate the effectiveness of RTM and TTI instruction set.

Proto-Perf:Fast and accurate processor prototype performance evaluation

GUO Hui, HUANG Li-bo, ZHENG Zhong, SUI Bing-cai, WANG Yong-wen

2021, 43(04): 579-585. doi:

Abstract ( 251 )

PDF (625KB) ( 253 ) 　　

Performance verification and evaluation is one of the most important and necessary key steps in the design and implementation of general-purpose processors. An efficient general-purpose processor prototype system performance evaluation method can not only help processor designers locate performance design defects as early as possible in the processor design stage, but also verify whether the processor can meet the performance design expectations before the design tape. However, it takes a long time to perform a complete performance test of the processor prototype system. Such a huge time overhead makes designers unable to perform performance design analysis in time, which causes the performance evaluation of the processor prototype system to become the bottleneck of the entire project. This paper proposes a fast and accurate performance evaluation method of general-purpose processor prototype systems: Proto-Perf. The Proto-Perf performance evaluation method uses the dynamic program analysis method and the basic block aggregation technology to extract the characteristic program fragments of the test program for testing, which significantly reduces the performance test time. The experimental results show that, compared with the performance data obtained by running the SPEC CPU2006 REF data scale test program completely, the absolute error of the performance data obtained by the Proto-Perf test is 1.53% on average, and the highest is 7.86%. At the same time, for each program in the experiment, the test time using the Proto-Perf method is significantly reduced.

A redundacy-reduced candidate box accelerator based on soft-non-maximum suppression

LI Jing-lin, JIANG Jing-fei, DOU Yong, XU Jin-wei, WEN Dong

2021, 43(04): 586-593. doi:

Abstract ( 206 )

PDF (1092KB) ( 265 ) 　　

Object detection tasks usually use the non-maximum suppression algorithm (NMS) to remove redundant candidate boxes of convolutional neural network's outputs. Soft-NMS uses the method of gradually attenuating the score of candidate box to replace the method of directly deleting the candidate box larger than the predefined threshold in Hard-NMS, which can avoid deleting the overlapping object in the picture by mistake and improve the accuracy of the object detection task. However, the frequent change of candidate box score makes Soft-NMS more complex than Hard-NMS. In order to achieve high accurate, low-delay and low-power candidate box redundancy removals, this paper proposes a Soft-NMS based architecture, which uses logarithmic functions to optimize complex floating-point calculations and a two-level optimization structure with fine-grained flow and coarse-grained parallelism to improve the throughput of the algorithm. Experiments on Xilinx KU-115 FPGA show that our power consumption is 6.107 W, and the delay of processing 1000 boxes is 168.95μs. Compared with the Soft-NMS implemented by the CPU, the architecture achieves 36 times performance improvement and the performance power consumption ratio is 264 times that of CPU implementation.

Design and implementation of a high-precision reduction function based on MPI

HE Kang, HUANG Chun, JIANG Hao, GU Tong-xiang, QI Jin, LIU Jie,

2021, 43(04): 594-602. doi:

Abstract ( 204 )

PDF (616KB) ( 239 ) 　　

As the characteristics of large scale, high dimension, and long time of scientific and engineering computing become more and more obvious, the cumulative effect of floating-point rounding errors often makes calculation results unreliable. Improving the computing accuracy has become one of the hot spots in the field of parallel computing. Based on the MPICH3 framework, this paper uses error-free transformation to construct a new data format and corresponding operators, and designs a high-precision reduction function MPI_ACCU_REDUCE, which realizes three types of high-precision MPI reduction operations such as summation, production and L2 norm. Numerical experiment shows that the proposed MPI_ACCU_REDUCE function with the three types of high-precision reduction operations effectively improve the accuracy of numerical calculations.

Video processing acceleration based on frame buffer queue

CHENG Xiao-lan, , JIANG Cong-feng, OU Dong-yang, REN Yong-jian, ZHANG Ji-lin, WAN Jian,

2021, 43(04): 603-613. doi:

Abstract ( 211 )

PDF (1277KB) ( 252 ) 　　

Due to the limited computing power of edge devices, it is easy to cause frame accumulation when processing high-resolution, high-frame-rate video. In addition, the diversity of video parameters also affect video processing, system parameters need to be adjusted adaptively to ensure video processing performance. Aiming at the problem of frame accumulation in video processing, this paper proposes a method of adding a frame buffer queue between frame receiving and frame processing to process the buffered frames in parallel, so as to solve the problem of frame receiving delay and speed up video processing. The experimental results show that the frame buffer queue solves the problem of frame loss in the edge video processing system. It reduces system power consumption and improves the edge processing capability of real-time video data, while satisfying real-time frame processing.

A GPU-based elevation parallel interpolation algorithm for massive discrete points

WANG Zhi-guang, ZHANG Teng-chang, WU Xiang-jin, LU Qiang,

2021, 43(04): 614-619. doi:

Abstract ( 221 )

PDF (605KB) ( 256 ) 　　

A GPU-based elevation parallel interpolation algorithm is proposed, which realizes the parallel accelerated rendering of discrete points on the three-dimensional surface. The algorithm orga- nizes the three-dimensional surface grid elevation data through the elevation texture as the basis of discrete point rendering, and uses GLSL to write GPU shader programs to dynamically control the graphics rendering pipeline, so as to realize the parallel elevation interpolation algorithm related to the viewpoints. The experimental results indicate that, compared with the traditional memory interpolation algorithm, the GPU-based elevation parallel interpolation algorithm improves the rendering magnitude of discrete points on the three-dimensional surface from one million to ten million.

QoS design and verification of direct connection interface for multi-core processors

LUO Li, ZHOU Hong-wei, ZHOU Li, PAN Guo-teng, ZHOU Hai-liang, LIU Bin

2021, 43(04): 620-627. doi:

Abstract ( 208 )

PDF (901KB) ( 255 ) 　　

Direct connection of multi-core processors to build multi-way parallel systems has always been the main way to improve the parallelism of high-performance computers. This paper mainly studies the QoS design and verification of the multi-core processor’s direct connection interface. Through the direct connection interface, the cache consistent message across the chip can be effectively and reliably transmitted, and the SMP system (symmetric multiprocessing) sharing main memory can be realized. In this paper, the key technologies of QoS design for each protocol layer of direct connection interface are described in detail. After the validity of QoS design is verified by the reusable verification platform based on UVM method, it has been transplanted to the FPGA prototype verification platform and passed the test successfully. In order to improve the performance of multi-way servers, it is necessary to further study the direct connection technology of multi-core processors, which has good application and research prospects.

An efficient and scalable MobileNet accelerator based on FPGA

XIAO Jia-le, LIANG Dong-bao, CHEN Di-hu, SU Tao

2021, 43(04): 628-633. doi:

Abstract ( 520 )

PDF (723KB) ( 340 ) 　　

MobileNet network is a deep neural network mode widely used in the embedded field. In order to solve the problem of low hardware implementation efficiency and achieve certain scalability under different hardware resources, a MobileNet network accelerator structure based on FPGA is proposed. According to the stacking structure characteristics of the network, a three-level pipeline acceleration array is designed, and the computing efficiency is over 70% within 4000 multipliers. A 150 MHz fully working demo on XILINX Zynq-7000 ZC706 development board achieves 156 Gop/s performance and 61% calculation efficiency, which is higher than other MobileNet network accelerators.

RMC based performance optimization of Monte Carlo program

XU Hai-kun, KUANG Deng-hui, LIU Jie, GONG Chun-ye,

2021, 43(04): 634-640. doi:

Abstract ( 267 )

PDF (671KB) ( 234 ) 　　

Monte Carlo method (Monte Carlo, MC) is an important particle transport simulation method in nuclear reactor design and analysis. The MC method can simulate complex geometric shapes and the calculation results have high accuracy. The disadvantage is that it takes a lot of time to simulate hundreds of millions of particles to obtain accurate results. How to improve the performance of the Monte Carlo program has become a challenge for large-scale Monte Carlo numerical simulation. Based on the heap MC analysis program RMC, this paper has successively carried out a series of optimization methods such as dynamic memory allocation optimization based on TCMalloc, OpenMP thread scheduling strategy optimization, and vector memory alignment optimization, and parallel I/O optimization based on HDF5. Under the example of calculating 2 million particles, the overall program performance is improved by more than 26.45%.

A high performance FPGA-GPU-CPU heterogeneous programming architecture based on PCIe

SUN Zhao-peng, ZHOU Kuan-jiu

2021, 43(04): 641-651. doi:

Abstract ( 604 )

PDF (1382KB) ( 425 ) 　　

As a special parallel computing method, heterogeneous computing can make full use of the capabilities of different computing units according to the characteristics of computing tasks. It has great advantages in improving the computing performance, real-time performance and reducing the energy consumption of the processor. However, at present, there are some problems in heterogeneous computing environment, such as complex programming and unreliability. To solve these problems, this paper proposes a programming framework based on state transition matrix (STM), which can integrate GPU and FPGA resources. Application programming interfaces (APIs) of CUDA and Vivado are integrated through STM, and the standard C code for heterogeneous computing is automatically generated. By connecting GPU and FPGA devices through PCI Express bus, data can be transferred between these heterogeneous computing units without intermediate use of system CPU memory. Besides, GPUDirect RDMA is used to realize the PCIe communication with FPGA as the main controller, which breaks through the short board of read operation in the PCIe communication with GPU as the main controller. Experimental results show that the communication efficiency is 1.9 times higher than that of shared memory, and the realized data rate is close to the maximum of theoretical bandwidth.

An automatic verification method for GPDSP instruction flow control based on reference model

WANG Hui-li, GUO Yang

2021, 43(04): 652-661. doi:

Abstract ( 236 )

PDF (1439KB) ( 235 ) 　　

With the increasing complexity of scientific computing and artificial intelligence algorithms, as the control center of hardware design, the design of instruction flow control components is facing the challenge of increasing complexity and accuracy. FT-xDSP is a 64-bit GPDSP processor independently developed by our company. The design scale and complexity of its instruction flow control components are greatly increased, which makes its verification become a prominent problem. This paper proposes an automatic verification method of instruction flow control based on the instruction rearrangement reference model. Firstly, the abstract model of flow control components is established by taking the instruction input-output relationship as the main feature, which shields the internal complex logic, and reduces the analysis complexity on the basis of ensuring the accuracy of the analysis results. Secondly, by automatically generating random stimulations with constraints, the reference model and the design results to be tested are automatically compared and analyzed, and the code coverage and function coverage are improved when the cost of verification is equivalent. The experimental and practical results show that the method can be used to verify the weak points of instruction flow control verification, which greatly improves the verification efficiency and verification integrity of instruction flow control components.

High-performance implementation and optimization of Square Root function based on SIMD

ZHAO Yong-hao, JIA Hai-peng, ZHANG Yun-quan, ZHANG Si-jia

2021, 43(04): 662-669. doi:

Abstract ( 211 )

PDF (498KB) ( 219 ) 　　

In computer graphics, integral calculation, neural network and other application scenarios, the high-performance implementation of Square Root function plays a very important role in the construction of the basic software ecology of processors. With the widespread use of ARM architecture processors, it becomes more critical to study the fast algorithm implementation of functions under ARM architecture. At present, SIMD architecture is adopted by a large number of processors. Therefore, it is of great significance and development prospect to study the high performance function calculation method based on SIMD. To this end, this paper implements and optimizes the Square Root function with high performance. By analyzing the storage format of IEEE 754 standard float point number in memory, an efficient algorithm of Square Root function is designed, and then the algorithm precision is further improved by combining Square Root inverse and Taylor formula algorithm. Finally, the algorithm performance is further improved by SIMD optimization. According to the experimental results, on the premise of satisfying the accuracy, the performance of the implemented Square Root function is more than 7 times higher than the libm algorithm library, and more than 3 times higher than the instruction of calculating Square Root provided by ARM V8.

Duplicate bug report detection by combining distributed representations of documents#br#

ZENG Jie, BEN Ke-rong, ZHANG Xian, XU Yong-shi

2021, 43(04): 670-680. doi:

Abstract ( 177 )

PDF (696KB) ( 304 ) 　　

Duplicate bug report detection can avoid the repeated assignment and repair processes for multiple bug reports that describe the same bug, and thus greatly reduce the cost of software main- tenance. To improve the accuracy of detection, this paper proposes a duplicate bug report detection method by combining distributed representations of documents. Firstly, the Doc2Vec model is trained based on a large-scale defect report database, the distributed representations of bug reports are extracted, and the variable-sized bug reports are encoded into fixed-sized dense vectors. Secondly, the similarities between different bug reports are calculated by comparing their dense vectors, it is as a new feature and combined with traditional features commonly used in the process of duplicate bug report detection, and machine learning algorithm is used to train the binary classification model. Experimental results on public duplicate bug report datasets from Bugzilla show that, compared with the state of the art method D_TS, our method improves the F1 value by 2% on average, which indicates the effectiveness of the new feature.

Porting of reactor core programs in ARM environment

MING Ping-zhou, LI Zhi-gang, LIU Ting, LU Wei, LIU Dong, ZENG Hui, YU Hong-xing

2021, 43(04): 681-688. doi:

Abstract ( 180 )

PDF (682KB) ( 228 ) 　　

In order to demonstrate the feasibility of domestic chips in the field of reactor core calculation, some reactor core programs are ported in the ARM computing environment of Phytium processor, involving the diffusion prototype program NACK-R of the core fuel management software, and sub-channel analysis program CORTH, characteristic line transport program OpenMOC, and core assembly program KYLIN2. Through reasonable program code revision, the dependence on commercial function libraries are removed, and the OpenMP parallelism is introduced in the trace process of MOC in the ARM environment so that the parallel ability of multiple Phytium processor cores in one cluster node can be investigated. The frequency of the reference Intel commercial processor is about twice that of the Phytium processor, and the difference between the serial running efficiency of the ported programs are kept at 3~4 times. Due to the cache size of the Phytium processor, the performance difference of some large input data cases may be greater. After OpenMP parallelism, the running efficiency of KYLIN2 is close to the serial efficiency of Intel computing environment, which proves that the single node with multiple Phytium processors can replace some schemes of nuclear engineering calculation. The program porting results also show that the hybrid cluster system with different types of processors can fully utilize domestic hardware in the case of computing resource crisis and improve the overall utilization.

Improvement of code package level refactoring based on DBSCAN algorithm

LI Wen-hao, LI Ying-mei, BIAN Yi-xin

2021, 43(04): 689-696. doi:

Abstract ( 201 )

PDF (561KB) ( 247 ) 　　

In the research of code refactoring at the package level, in order to obtain the software structure of "high cohesion and low coupling", the hierarchical clustering algorithm is considered to be a better software clustering algorithm because of its simple and effective characteristics and high clustering accuracy. However, the time complexity of the hierarchical clustering algorithm is high, which is not conducive to processing large-scale software. The DBSCAN algorithm, on the other hand, has faster clustering speed but lower accuracy. Therefore, a software hierarchical clustering algorithm based on DBSCAN is proposed, which uses the classes generated by the DBSCAN algorithm to constrain the clustering space of the hierarchical clustering algorithm. This algorithm can keep the accuracy of the hierarchical clustering algorithm unchanged, and its time complexity lies between DBSCAN and the hierarchical clustering algorithm. The experimental results show that the algorithm can effectively divide the software reasonably, and prove that the performance of the algorithm is better than other common clustering algorithms through expert evaluation, module division metrics and algorithm running time comparison.

Prostate MR image segmentation based on adversarial learning and multi-scale feature fusion

CHEN Ai-lian, DING Zheng-long, ZHAN Shu

2021, 43(04): 697-703. doi:

Abstract ( 222 )

PDF (643KB) ( 268 ) 　　

The automatic segmentation of prostate MR images has been widely used in the diagnosis and treatment of prostate cancer. However, due to the significant changes in the shape of the prostate and low contrast with adjacent tissues, traditional segmentation methods still have disadvantages such as low accuracy and slow speed. Generative adversarial networks (GAN) have shown superior performance in computer vision tasks, so this paper proposes a method of training segmentation networks using the concept of adversarial learning to achieve end-to-end automatic segmentation of prostate MR images. The model framework is mainly composed of a segmentation network and a discriminant network. The segmentation network generates a segmentation prediction map, and the discrimination network judges whether the input comes from a real label or a segmentation prediction. At the same time, the receptive field block (RFB) is integrated in the segmentation network to acquire and fuse multi-scale information of deep features, improve the recognition rate and robustness of features, and improve the segmentation performance of the network. Through verification on the PROMISE12 data set, the DSC and HD of the model are 89.56% and 7.65 mm, respectively.

An expression recognition model based on deep learning and evidence theory

XU Qi-hua, SUN Bo

2021, 43(04): 704-711. doi:

Abstract ( 227 )

PDF (692KB) ( 367 ) 　　

Facial expression recognition is a further research based on face detection, which is an important research direction in the field of computer vision. The goal of the research is to automatically recognize facial expressions based on micro video and study how to use deep learning technology to assist and promote the development of facial expression recognition technology in a big data environment. A fully automated expression recognition model has been designed to address some of the key technical challenges in the expression intelligence recognition process. The model combines a deep auto-encoding network and a self-attention mechanism to construct a sub-model for automatic extraction of facial expression features, and then the evidence theory is used to fuse the results of multi-feature classification. Experimental results show that the model can significantly improve the accuracy of expression recognition, which has important theoretical significance and research value.

An image semantic segmentation method based on path aggregation Atrous convolutional network

LI Shu-ao, XIE Qing, MA Yan-chun, LIU Yong-jian

2021, 43(04): 712-720. doi:

Abstract ( 172 )

PDF (1085KB) ( 228 ) 　　

The deep full convolutional neural network based on encoder-decoder structure has made significant progress in image semantic segmentation. However, the path of transferring low-level positioning information in the deep network to the high-level network is too long, which makes it difficult to use low-level positioning information in the decoder stage to restore the boundary structure of the object. Aiming at this problem, a path aggregation structure used in the decoder part of segmentation network is proposed. This structure shortens the propagation path of low-level information to high-level information in the segmentation network and provides multi-scale contextual semantic information, so that the segmentation network can produce more refined boundary segmentation results. Aiming at the pro- blem that the softmax cross-entropy loss function often used in semantic segmentation is insufficient to distinguish samples with similar appearance, this paper reforms the softmax cross-entropy loss function and proposes a bidirectional cross-entropy loss function. Combining the proposed path aggregation Atrous convolutional network with the new loss function method can obtain better results on the PASCAL VOC2012Aug data set, which increases the mIoU value from 78.77% to 80.44%.

A deep encoder-decoder network for human sperm head segmentation based on residual hybrid dilated convolution

Lv Qi-xian, FAN Chao-gang, ZHAN Shu

2021, 43(04): 721-728. doi:

Abstract ( 232 )

PDF (881KB) ( 226 ) 　　

Sperm head shape is an important indicator in the analysis of sperm morphology, which is very important for diagnosing male infertility. Therefore, it is very important to segment the sperm head accurately and efficiently. Based on this, this paper builds a new encoder-decoder segmentation network that combines stacked residual block and residual hybrid dilated convolution. Firstly, we build a dataset for segmenting the head of human sperm, which contains 1207 images, and then use it to train and test our network. The proposed network is able to achieve excellent segmentation results in low-quality images that are unstained and contain multiple sperms, and obtain a Dice coefficient of 96.06% on the validation set. The experimental results show that the stacked residual module and the residual hybrid dilated convolution module significantly improve the segmentation performance. In addition, the proposed network processes the images that show the original true state of the sperm, and the accurate segmentation results are very helpful for the doctor’s clinical diagnosis.

Saliency detection based on multi-feature fusion convolutional neural network

ZHAO Ying-ding, YUE Xing-yu, YANG Wen-ji, ZHANG Ji-hao, YANG Hong-yun,

2021, 43(04): 729-737. doi:

Abstract ( 235 )

PDF (1007KB) ( 285 ) 　　

With the development of deep learning technology and the prominent performance of con- volutional neural networks in many computer vision tasks, deep saliency detection methods based on convolutional neural networks have become the mainstream methods in saliency detection. However, the convolutional neural network is limited by the size of the convolution kernel, which can only extract features in a small region at the bottom of the network, and cannot detect the objects that are not notable in the region but are globally remarkable. On the other hand，the convolutional neural network can obtain the global information of the image by stacking the convolutional layers, but when the information is transferred from shallow layers to deep layers, it will lead to the loss of information, and stacking too deep will also make the network difficult to optimize. For these reasons, a saliency detection method based on multi-feature fusion convolutional neural network is proposed. In this method, the convolutional neural network is enhanced by several local feature enhancement modules and global context mo- deling modules. Specifically, the local feature enhancement module is used to increase the feature extraction range, and the global information of the feature map is obtained by global context modeling, which effectively suppresses the interference of objects in the region which are notable in the region but not significant in the whole image to the saliency detection. It can also extract multi-scale local features and global features simultaneously for salient detection, which effectively improves the accuracy of detection results. Finally, through experiments, the effectiveness of the proposed method is verified and compared with other 11 saliency detection methods. The results show that the proposed method can improve the accuracy of saliency detection and outperform the other 11 methods involved in the comparison.

Current Issue

Author center

Review center

Online journal