Computer Engineering & Science

Design of BLAS level3 computation on a matrix multiplication coprocessor

JIA Xun, QIAN Lei, YUAN Hao, ZHANG Kun, WU Dong

2020, 42(11): 1913-1921. doi:

Abstract ( 262 )

PDF (668KB) ( 291 ) 　　

BLAS level3 subprograms have high computation complexity, which usually become applications' performance bottleneck. By organizing largescale floatingpoint units into a linear array architecture, the matrix multiplication coprocessor can perform highperformance and efficient matrix multiplication. Achieving efficient BLAS level3 computation on the matrix multiplication coprocessor is essential for the acceleration of largescale science and engineering applications.
By taking matrix multiplication as the kernel and combining the characteristics of the underlying linear array architecture, this paper proposes the design of BLAS level3 computation on a matrix multiplication coprocessor, and construct a corresponding performance model. Experimental results show that SYMM, SYRK and TRMM subprograms on the matrix multiplication coprocessor achieves the computation efficiency of 99%, 98% and 80% respectively, at most 31% higher than those on the SW26010 and NVIDIA V100 GPU.

Heterogeneous cooperative computing of particle transport based on Monte Carlo method on the Tianhe 2A system

LI Biao, LIU Jie,

2020, 42(11): 1922-1928. doi:

Abstract ( 221 )

PDF (709KB) ( 285 ) 　　

Particle transport simulation plays an important role in the field of nuclear science and medical radiation therapy. Based on Monte Carlo method, this paper proposes a heterogeneous cooperative algorithm of particle transport on the Tianhe2A system. Based on the asynchronous communication modes (BCL and ACL) of the Tianhe 2A system, a simple and efficient symmetric communication mode between the CPU and the Matrix2000 accelerator is proposed. On the Matrix2000 accelerator, the threadlevel parallelism of the program is developed through OpenMP instructions. The original serial data collection communication mode is optimized, and a new communication mode based on binary tree structure is proposed, which greatly reduces the communication time. On the Tianhe2A system, the parallel program based on CPU/Matrix2000 heterogeneous collaborative computing can be scaled up to 450k cores, and the parallel efficiency compared to 50k cores is stabilized at 22.54%.

Advance in memristorbased computing storage fusion architecture

FANG Xudong, WU Junjie

2020, 42(11): 1929-1940. doi:

Abstract ( 242 )

PDF (1266KB) ( 324 ) 　　

Memristor is an enabling device with nonvolatile resistance, low power consumption, high durability, ease of integration, and CMOS compatibility. The stateful logic of memristors can rea lize the true fusion of computing and storage, and is complete in logic, which is expected to break the limitation of Von Neumann architecture and effectively alleviate the memory wall bottleneck. These excellent properties gain memristors great interest from academia and industry. In light of this, this paper summarizes the research progress of applicationoriented computing storage fusion architecture based on stateful logic. Firstly, the implementation principle and improvement method of state logic are analyzed in detail. Secondly, the state logic design based on the memristor crossbar is reviewed, including the parallel implementation of the basic logics, copy operation and comparison operation, and then the design principle and implementation structure of the data storage structure based on the memristors are summarized. The paper then revisits an applicationoriented computing storage fusion architecture in detail, and finally summarizes the problems in the research of this direction, and looks forward to the future direction.

Design and implementation of event extraction model and accelerator based on FPGA

HAN Zhe, JIANG Jingfei, QIAO Linbo, DOU Yong, XU Jinwei, KAN Zhigang

2020, 42(11): 1941-1948. doi:

Abstract ( 273 )

PDF (796KB) ( 280 ) 　　

Event extraction technology is important to achieve the quickly extraction of specific information, and it can be widely used in information retrieval, sentiment analysis and other scenarios. Chinese event extraction is more difficult than English event extraction due to the characteristics of Chinese language. Based on the stateoftheart English event extraction neural network model, a CEEDGCNN (Chinese Event Extraction based on multilayer Dilate Gated Convolutional Neural Network) is proposed, which is suitable for hardware implementation. CEEDGCNN achieves 71.71% F1score of trigger classification on the ACE2005 Chinese corpus. The accelerator of CEEDGCNN is designed and implemented, and the model size is further optimized by quantization. The accelerator can achieve 97 GOP/s on the Xilinx XCKU115 FPGA, which is 67 times faster than CPU.

Research and implementation of multi-precision algorithm based on SCILAB

LAN Jing, LIU Wenchao, JIANG Hao, LIN Wenqiang

2020, 42(11): 1949-1955. doi:

Abstract ( 157 )

PDF (671KB) ( 202 ) 　　

Currently, generalpurpose processors generally support 64bit floating point operations. In largescale and longtime scientific numerical calculation, the cumulative effect of rounding errors in floatingpoint operations may lead to unreliable numerical results. Therefore, to effectively control errors, designing highprecision, efficient and reliable floatingpoint numerical algorithms is very important. By using errorfree transform and doubledouble format, this paper realizes a highprecision mathematics library based on SCILAB software platform. The evaluation of the polynomials in power basis, Bernstein form and Chebyshev basis is carried out on the Intel platform and the domestic FT processor platform. The results prove the validity of our proposed highprecision mathematics library. This library has independent intellectual property right and can run on the selfdependent and manageable domestic processor, which will support the national high technology research.

Configuration and scheduling mechanism of spot instances meeting the execution time limit of workflow

LIAO Jianjin, SUN Qingxiao, YANG Hailong, LUAN Zhongzhi, QIAN Depei

2020, 42(11): 1956-1964. doi:

Abstract ( 180 )

PDF (816KB) ( 227 ) 　　

With the development of cloud computing, deploying workflows onto cloud computing platforms has become a popular choice. Compared with the traditional local workflow, cloud workflow not only needs to consider the requirements such as the execution time, but also considers the economic cost. In order to improve the resource utilization, cloud computing service providers provide spot instances, which are very cheap but unstable. Aiming at the problem of workflow scheduling and execution in cloud computing, this paper proposes a spot instance configuration and scheduling method that meets the workflow execution time budget. This method uses Markov models and dynamic programming methods to predict the price of spot instances and obtain the lowest cost bid strategy. At the same time, to satisfy the execution time budget of the workflow, the instances used in the workflow are configured under the estimated bid strategy. Experimental results show that, compared with using ondemand instances, our method can save up to 89.9% computation cost, while meeting the workflow execution time budget.

Research and implementation of lowlatency forward error correction coding for HPC interconnection network

WANG Chao, CAO Jijun, LUO Zhang, LAI Mingche, XU Weixia

2020, 42(11): 1965-1972. doi:

Abstract ( 258 )

PDF (844KB) ( 308 ) 　　

At present, the port rate of the mainstream highperformance interconnection network reaches 100~400 Gbps, and the single channel rate reaches 25~50 Gbps. For the data transmission at this rate, Forward Error Correction (FEC) coding is a necessary technology to improve the reliability. The Ethernet international standard IEEE 802.3 uses Forward Error Correction codes RS(528,514) and RS(544,514), but these two code types are difficult to meet the performance requirements of high performance interconnect networks in terms of low latency. Firstly, this paper analyzes the encoding and decoding structures of RS code and quantitatively studies the relationship between RS code type parameters and encoding and decoding delays. Secondly, a new coding type RS(271,257) for lowlatency highperformance interconnection network is proposed, and its advantages and disadvantages in bandwidth consumption and error correction capability are compared. Finally, based on RS(271,257), this paper implements the network coding sublayer and performs the resource consumption evaluation and delay performance simulation. Considering the factors of resource consumption, error correction ability and delay performance, RS (271,257) is an ideal lowlatency forward error correction coding type, which can meet the design requirements of the current HPCoriented lowlatency highperformance interconnection network coding sublayer.

Cold start optimization on function computing for high performance computing

LI Zhe, TAN Yusong, LI Bao, YU Jie

2020, 42(11): 1973-1980. doi:

Abstract ( 240 )

PDF (640KB) ( 285 ) 　　

High performance computing problems usually have the characteristics of parallelization of subtasks, and a lot of computing resources are consumed in the process of execution. It has been proved that traditional cloud computing based on virtual machine can deal with such problems, but the management of distributed environment and the distributed design of solutions make the processing more complex. Function computing is a new type of serverless cloud computing paradigm, its automatic expansion and considerable computing resources can be well combined with HPC problems. However, the cold start delay is an unavoidable problem on the public cloud function computing platform, especially in the task of HPC problems having high concurrent jobs of which delay will be further magnified. In this paper, we first analyze the completion time of a simple HPC task under cold start and hot start conditions, and analyze the causes of additional delay. According to these analyses, we combine the time series ana lysis tools and the platform's automatic expansion mechanism to propose an effective preheating method, which can effectively reduce the cold start delay of HPC tasks on the function computing platform.

Reduction operation offloading optimization based on Tianhe interconnect MPI collective

WANG Hao, ZHANG Wei, XIE Min, DONG Yong

2020, 42(11): 1981-1987. doi:

Abstract ( 295 )

PDF (646KB) ( 426 ) 　　

MPI collective communication operation is widely used in parallel scientific application, which has an important influence impact on the scalability of the program. Tianhe interconnect network supports the trigger communication operations,
which can offload the messaging and processing work and improve the performance between nodes. Allreduce and Reduce
algorithms under different tree topological structures are designed by using the triggered operations to lower the latency the reduction operation communication between nodes. Tests based on the actual system platform show that that, compared with the pointtopoint implementation of these two types of operations in MPICH, the offload algorithm based on trigger can reduce the running time by up to 59.6% at different node scales.

A cloud cipher job stream scheduling algorithm based on associated data localization

GUAN Chuanjiang, LI Jianpeng, SHI Guozhen, MAO Ming

2020, 42(11): 1988-1995. doi:

Abstract ( 151 )

PDF (1560KB) ( 225 ) 　　

Aiming at the problems that there are various service requests and data dependent job streams and nondatadependent job streams are randomly cross-concurred in the cloud cipher service system, in order to avoid the system communication performance overhead and data security threats caused by the interaction of associated data between computing nodes, a cloud cipher job stream sche- duling algorithm based on association data localization is designed. Firstly, the mapping of cryptographic function of the task request is used to ensure the correct implementation of the multijob streams request function. Secondly, for the problem of different working modes crossing in different tasks with the same request cryptographic function, on the basis of the proposed task priority calculation method to promote the fairness of multi job streams scheduling, the classified scheduling method is adopted to realize the localization of associated data and guarantee the overall performance of the scheduling system. The simulation results show that the algorithm can not only effectively reduce the task completion time and improve resource utilization and fairness, but also has good stability.

Link scheduling in energyharvesting sensor networks with nonideal batteries

WANG Ningbo, WANG Luyao, XU Xiaobin

2020, 42(11): 1996-2004. doi:

Abstract ( 135 )

PDF (697KB) ( 220 ) 　　

In recent years, in order to solve the problem of limited energy of sensor nodes, energyharvesting wireless sensor network has become a research hotspot. Aiming at the shortcomings such as limited capacity, charging/discharging loss, and energy leakage of batteries in sensor nodes, a harvestusestore structure with nonideal batteries is proposed. A mathematical model is established by combining three aspects of routing, link scheduling and energy allocation. The shortest frame is obtained by solving the mix integer linear equation. The simulation results show that, the frame length is decreased by up to 48% with the charging/discharging efficiency increased from 0.6 to 0.9. When the energy leakage ratio is reduced from 0.04 to 0.01, the frame length is reduced by up to 33%. Expanding battery capacity has little effect on the frame length. In contrast to harveststorageuse structure, the frame length of harvestusestore structure is decreased by up to 11%. It is verified that the proposed method improves the network throughput greatly by improving the charging/discharging efficiency and reducing the energy leakage rate.

A searchable encryption scheme supporting multi-keyword retrieval on blockchain

NIU Shufen, WANG Jinfeng, WANG Bobin, CHEN Jingmin, DU Xiaoni

2020, 42(11): 2005-2012. doi:

Abstract ( 388 )

PDF (892KB) ( 371 ) 　　

In cloudbased singlekeyword searchable encryption schemes, cloud servers are not completely trusted, and the existing singlekeyword retrieval cannot accurately return search results. Therefore, a multikeyword searchable encryption scheme is constructed by using blockchain technology. Our scheme uses the symmetrical encryption algorithm to improve the encryption efficiency, takes advantage of blockchain technology to solve the problem of dishonest search in cloud server, and also improves the accuracy of search results based on multikeyword index structure. The scheme is proved secure against indistinguishably chosen keyword attack (INDCKA) under the random oracle model. Furthermore, the performance analysis shows that our proposals are secure and efficient.

Reversible data hiding of JPEG image by DCT coefficient of selective sorting

WANG Ruofei, LIU Feng

2020, 42(11): 2013-2019. doi:

Abstract ( 154 )

PDF (622KB) ( 231 ) 　　

JPEG image compression algorithm can provide users with good compression performance and improve the quality of image file reconstruction. It has a wide range of use value in the field of image and video processing. This paper proposes a feasible and effective method of reversible information hiding for JPEG image. In this scheme, the quantized DCT coefficients of all 8 × 8 sub blocks in JPEG image are rearranged into a new matrix, with the coefficient values of each block listed vertically and at the same frequency listed horizontally. The coefficients at the same frequency are simulated to embed bits, and the coefficients at the frequency with small distortion are preferentially selected to embed information until the secret bitstream information is embedded, and the invalid bitstream expansion is reduced according to the decoding matrix when embedding the secret information. Experimental results show that the method can achieve better visual quality and less image bitstream expansion under the same bitstream embedding.

An automatic fine crack recognition algorithm for airport pavement under significant noises

LI Haifeng, WU Zhilong, NIE Jingjing

2020, 42(11): 2020-2029. doi:

Abstract ( 222 )

PDF (1247KB) ( 330 ) 　　

Cracks on airport pavement are extremely fine, and depth camera based crack detection technology is faced with the interference from both complex pavement apparent structure and severe vibration of the platform. To handle this problem, a main profile modeling algorithm by combining L2 regularization and dynamic threshold greedy strategy is proposed to achieve accurate crack detection results of millimeter level. Firstly, the main profile of pavement is modelled constrained with L2 regularization, thus overcoming the overfitting problem caused by the complex apparent structure. Secondly, an improved greedy algorithm based on dynamic threshold is proposed to suppress noise interference by iteratively removing abnormal points caused by platform vibration. Finally, based on the constructed main profile model, the multidirection main profiles of the airport pavement are extracted and fused, and the crack depth and morphology information are used to extract the crack. Experiments on real airport pavement data show that the proposed algorithm can reconstruct the main profile of the pavement accurately, detect the fine cracks successfully, and have better crack detection performance than the existing techniques.

Traffic road sign recognition based on SqueezeNet model with deep residual network and GRU

HUO Aiqing, ZHANG Wenle, LI Haoping

2020, 42(11): 2030-2036. doi:

Abstract ( 296 )

PDF (827KB) ( 265 ) 　　

Existing traffic road sign recognition methods are all based on convolutional neural networks. As the number of the model network layers increases, the recognition accuracy will also be improved, but there are still some problems such as the reduction of efficiency and the increase of the number of parameters. Therefore, an improved SqueezeNet model combining deep residual network with GRU neural network (SqueezeNetIRGRU) is proposed. In order to enhance the learning efficiency, ELU function is used as the activation function. To avoid the disappearance of gradients when the network layer is too deep, a deep residual network is introduced to guarantee the stability of the model, GRU neural network that can memorize the important past features is utilized. Experiments were performed on the Cafir10 and GTSRB datasets, and their recognition accuracy rates are above 99.13% and 88.25%respectively. The experimental results show that the SqueezeNetIRGRU model not only reduces the parameter amount greatly, but also its convergence, stability and recall rate are all much better than others.

Free parameter optimization of the cubic Cardinal spline function

LI Juncheng, LIU Chengzhi

2020, 42(11): 2037-2041. doi:

Abstract ( 161 )

PDF (387KB) ( 195 ) 　　

In order to reasonably determine the free parameter of the cubic Cardinal spline function, the optimization of the free parameter of the cubic Cardinal spline function in the interpolation problems are discussed. Firstly, the influence of the free parameter on the curve shape of the cubic Cardinal spline function is analyzed. Secondly, the schemes for computing the optimal free parameter in the two cases of data interpolation and function approximation are given, and the cubic Cardinal spline function with mi nimal quadratic average oscillation and approximation error are obtained respectively. When it is necessary to construct the cubic Cardinal spline function with good shape preserving effect or approximation effect, the optimal free parameter can be selected by the proposed schemes.

Multi-source image fusion with SPCNN and SR based on image features

ZHANG Lixia, ZENG Guangping, XUAN Zhaocheng

2020, 42(11): 2042-2049. doi:

Abstract ( 184 )

PDF (889KB) ( 222 ) 　　

In order to highlight the different features of different input images, a SPCNN model with automaticsetting parameter based on features is proposed, which is combined with sparse representation to fuse the multisource images. The fusion process has four steps. Firstly, the source images are decomposed into high frequency coefficients and low frequency coefficient by NSST. Each high frequency coefficient is fired by the SPCNN model with automaticset parameters based on the inherent characteristics, and the fused image is completed according to the total number of firing and the weighted fusion strategy. The low frequency coefficients are fused by a sparse representation. Finally, the fused image is reconstructed by inverse NSST. The experimental results show that the proposed method is superior to the other five classical methods and the fused image conforms to the human visual perception system, with clear structure and obvious details.

Research on health management system of largecaliber artillery based on deep learning

ZHANG Yuan, JIANG Huancheng

2020, 42(11): 2050-2058. doi:

Abstract ( 241 )

PDF (1086KB) ( 367 ) 　　

Largecaliber artillery can limit the enemy's movement to the maximum range at the least cost. It is a very critical fire suppression weapon on the battlefield. However, due to its harsh working environment, largecaliber artillery performs very unstable in missions. Based on the research project of the health management system of largecaliber artillery, while monitoring and recording the working status of largecaliber artillery in real time, this paper proposes a design idea of failure prediction and analysis of largecaliber artillery based on deep learning by combining expert analysis and other health ma nagement methods. The unsupervised and efficient feature extraction capabilities of the deep belief network and the supervised data classification capabilities of the multilayer perceptron are adopted to establish a fault prediction deep learning model, in order to realize the prediction of the failure state of largecaliber artillery and provide technical support for the premaintenance of largecaliber artillery, thereby improving the reliability of largecaliber artillery.

Entity relationship extraction fusing self-attention mechanism and CNN

YAN Xiong, DUAN Yuexing, ZHANG Zehua

2020, 42(11): 2059-2066. doi:

Abstract ( 420 )

PDF (762KB) ( 336 ) 　　

At present, the neural network model plays an important role in entity relationship extraction tasks. Features can be automatically extracted by a convolutional neural network, but it is limited because a fixed window size convolution kernel in a convolutional neural network is used to extract contextual semantic information of words in a sentence. Therefore, this paper proposes a new relational extraction method fusing selfattention and convolutional neural network. The original word vector is calculated by the selfattention mechanism to obtain the relationship between the words in the sequence. The input word vector expresses richer semantic information, which can make up for the deficiency of the automatic extraction features of the convolutional neural network. The experimental results on the SemEval2010 Task 8 dataset show that, after adding the selfattention mechanism, our model is beneficial to improve the entity relationship extraction effect.

An improved wavelet thresholdCEEMDAN algorithm for ECG signal denoising

ZHANG Peiling, LI Xiaozhen, CUI Shuaihua

2020, 42(11): 2067-2072. doi:

Abstract ( 642 )

PDF (951KB) ( 527 ) 　　

Electrocardiogram (ECG) signal denoising has always been a hot research issue. In order to eliminate the noises in ECG signal, a denoising method based on adaptive complete set empirical mode decomposition (CEEMDAN) and wavelet improved threshold function is proposed. Firstly, this method firstly decomposes the ECG signal by CEEMDAN to obtain a set of intrinsic modal functions (IMFs) from high frequency to low frequency. CEEMDAN decomposition is performed on ECG signal to yield several modal components (IMF). Secondly, the correlation coefficient method is used to perform wavelet denoising with improved threshold on the high frequency IMFs. For the lowfrequency IMFs, by setting a fixed threshold, the IMFs below the threshold is considered to be the baseline drift signal and removed. Finally, the denoised IMFs and the retained IMFs are reconstructed. The experimental results show that the proposed method is more effective than the empirical mode decomposition (EMD) wavelet denoising, and the global average empirical mode decomposition (EEMD) wavelet denoising method.

Current Issue

Author center

Review center

Online journal