Loading...
中国计算机学会会刊
中国科技核心期刊
中文核心期刊
Home
Introduction
Journal honors
Database
Editorial board
Guide
Submission
Publishing ethics
Subscription
Download
Contact us
中文
Current Issue
2020, No. 11 Published:25 November 2020
Last issue
Next issue
Columns in this issue:
Design of BLAS level3 computation on a matrix multiplication coprocessor
JIA Xun, QIAN Lei, YUAN Hao, ZHANG Kun, WU Dong
2020, 42(11): 1913-1921. doi:
Abstract
(
234
)
PDF
(668KB) (
243
)
BLAS level3 subprograms have high computation complexity, which usually become applications' performance bottleneck. By organizing largescale floatingpoint units into a linear array architecture, the matrix multiplication coprocessor can perform highperformance and efficient matrix multiplication. Achieving efficient BLAS level3 computation on the matrix multiplication coprocessor is essential for the acceleration of largescale science and engineering applications.
By taking matrix multiplication as the kernel and combining the characteristics of the underlying linear array architecture, this paper proposes the design of BLAS level3 computation on a matrix multiplication coprocessor, and construct a corresponding performance model. Experimental results show that SYMM, SYRK and TRMM subprograms on the matrix multiplication coprocessor achieves the computation efficiency of 99%, 98% and 80% respectively, at most 31% higher than those on the SW26010 and NVIDIA V100 GPU.
Heterogeneous cooperative computing of particle transport
based on Monte Carlo method on the Tianhe 2A system
LI Biao, LIU Jie,
2020, 42(11): 1922-1928. doi:
Abstract
(
195
)
PDF
(709KB) (
207
)
Particle transport simulation plays an important role in the field of nuclear science and medical radiation therapy. Based on Monte Carlo method, this paper proposes a heterogeneous cooperative algorithm of particle transport on the Tianhe2A system. Based on the asynchronous communication modes (BCL and ACL) of the Tianhe 2A system, a simple and efficient symmetric communication mode between the CPU and the Matrix2000 accelerator is proposed. On the Matrix2000 accelerator, the threadlevel parallelism of the program is developed through OpenMP instructions. The original serial data collection communication mode is optimized, and a new communication mode based on binary tree structure is proposed, which greatly reduces the communication time. On the Tianhe2A system, the parallel program based on CPU/Matrix2000 heterogeneous collaborative computing can be scaled up to 450k cores, and the parallel efficiency compared to 50k cores is stabilized at 22.54%.
Advance in memristorbased computing storage fusion architecture
FANG Xudong, WU Junjie
2020, 42(11): 1929-1940. doi:
Abstract
(
199
)
PDF
(1266KB) (
242
)
Memristor is an enabling device with nonvolatile resistance, low power consumption, high durability, ease of integration, and CMOS compatibility. The stateful logic of memristors can rea lize the true fusion of computing and storage, and is complete in logic, which is expected to break the limitation of Von Neumann architecture and effectively alleviate the memory wall bottleneck. These excellent properties gain memristors great interest from academia and industry. In light of this, this paper summarizes the research progress of applicationoriented computing storage fusion architecture based on stateful logic. Firstly, the implementation principle and improvement method of state logic are analyzed in detail. Secondly, the state logic design based on the memristor crossbar is reviewed, including the parallel implementation of the basic logics, copy operation and comparison operation, and then the design principle and implementation structure of the data storage structure based on the memristors are summarized. The paper then revisits an applicationoriented computing storage fusion architecture in detail, and finally summarizes the problems in the research of this direction, and looks forward to the future direction.
Design and implementation of event extraction model and accelerator based on FPGA
HAN Zhe, JIANG Jingfei, QIAO Linbo, DOU Yong, XU Jinwei, KAN Zhigang
2020, 42(11): 1941-1948. doi:
Abstract
(
231
)
PDF
(796KB) (
228
)
Event extraction technology is important to achieve the quickly extraction of specific information, and it can be widely used in information retrieval, sentiment analysis and other scenarios. Chinese event extraction is more difficult than English event extraction due to the characteristics of Chinese language. Based on the stateoftheart English event extraction neural network model, a CEEDGCNN (Chinese Event Extraction based on multilayer Dilate Gated Convolutional Neural Network) is proposed, which is suitable for hardware implementation. CEEDGCNN achieves 71.71% F1score of trigger classification on the ACE2005 Chinese corpus. The accelerator of CEEDGCNN is designed and implemented, and the model size is further optimized by quantization. The accelerator can achieve 97 GOP/s on the Xilinx XCKU115 FPGA, which is 67 times faster than CPU.
Research and implementation of multi-precision algorithm based on SCILAB
LAN Jing, LIU Wenchao, JIANG Hao, LIN Wenqiang
2020, 42(11): 1949-1955. doi:
Abstract
(
137
)
PDF
(671KB) (
169
)
Currently, generalpurpose processors generally support 64bit floating point operations. In largescale and longtime scientific numerical calculation, the cumulative effect of rounding errors in floatingpoint operations may lead to unreliable numerical results. Therefore, to effectively control errors, designing highprecision, efficient and reliable floatingpoint numerical algorithms is very important. By using errorfree transform and doubledouble format, this paper realizes a highprecision mathematics library based on SCILAB software platform. The evaluation of the polynomials in power basis, Bernstein form and Chebyshev basis is carried out on the Intel platform and the domestic FT processor platform. The results prove the validity of our proposed highprecision mathematics library. This library has independent intellectual property right and can run on the selfdependent and manageable domestic processor, which will support the national high technology research.
Configuration and scheduling mechanism of spot
instances meeting the execution time limit of workflow
LIAO Jianjin, SUN Qingxiao, YANG Hailong, LUAN Zhongzhi, QIAN Depei
2020, 42(11): 1956-1964. doi:
Abstract
(
156
)
PDF
(816KB) (
186
)
With the development of cloud computing, deploying workflows onto cloud computing platforms has become a popular choice. Compared with the traditional local workflow, cloud workflow not only needs to consider the requirements such as the execution time, but also considers the economic cost. In order to improve the resource utilization, cloud computing service providers provide spot instances, which are very cheap but unstable. Aiming at the problem of workflow scheduling and execution in cloud computing, this paper proposes a spot instance configuration and scheduling method that meets the workflow execution time budget. This method uses Markov models and dynamic programming methods to predict the price of spot instances and obtain the lowest cost bid strategy. At the same time, to satisfy the execution time budget of the workflow, the instances used in the workflow are configured under the estimated bid strategy. Experimental results show that, compared with using ondemand instances, our method can save up to 89.9% computation cost, while meeting the workflow execution time budget.
Research and implementation of lowlatency forward
error correction coding for HPC interconnection network
WANG Chao, CAO Jijun, LUO Zhang, LAI Mingche, XU Weixia
2020, 42(11): 1965-1972. doi:
Abstract
(
220
)
PDF
(844KB) (
209
)
At present, the port rate of the mainstream highperformance interconnection network reaches 100~400 Gbps, and the single channel rate reaches 25~50 Gbps. For the data transmission at this rate, Forward Error Correction (FEC) coding is a necessary technology to improve the reliability. The Ethernet international standard IEEE 802.3 uses Forward Error Correction codes RS(528,514) and RS(544,514), but these two code types are difficult to meet the performance requirements of high performance interconnect networks in terms of low latency. Firstly, this paper analyzes the encoding and decoding structures of RS code and quantitatively studies the relationship between RS code type parameters and encoding and decoding delays. Secondly, a new coding type RS(271,257) for lowlatency highperformance interconnection network is proposed, and its advantages and disadvantages in bandwidth consumption and error correction capability are compared. Finally, based on RS(271,257), this paper implements the network coding sublayer and performs the resource consumption evaluation and delay performance simulation. Considering the factors of resource consumption, error correction ability and delay performance, RS (271,257) is an ideal lowlatency forward error correction coding type, which can meet the design requirements of the current HPCoriented lowlatency highperformance interconnection network coding sublayer.
Cold start optimization on function computing for high performance computing
LI Zhe, TAN Yusong, LI Bao, YU Jie
2020, 42(11): 1973-1980. doi:
Abstract
(
213
)
PDF
(640KB) (
228
)
High performance computing problems usually have the characteristics of parallelization of subtasks, and a lot of computing resources are consumed in the process of execution. It has been proved that traditional cloud computing based on virtual machine can deal with such problems, but the management of distributed environment and the distributed design of solutions make the processing more complex. Function computing is a new type of serverless cloud computing paradigm, its automatic expansion and considerable computing resources can be well combined with HPC problems. However, the cold start delay is an unavoidable problem on the public cloud function computing platform, especially in the task of HPC problems having high concurrent jobs of which delay will be further magnified. In this paper, we first analyze the completion time of a simple HPC task under cold start and hot start conditions, and analyze the causes of additional delay. According to these analyses, we combine the time series ana lysis tools and the platform's automatic expansion mechanism to propose an effective preheating method, which can effectively reduce the cold start delay of HPC tasks on the function computing platform.
Reduction operation offloading optimization based on Tianhe interconnect MPI collective
WANG Hao, ZHANG Wei, XIE Min, DONG Yong
2020, 42(11): 1981-1987. doi:
Abstract
(
249
)
PDF
(646KB) (
358
)
MPI collective communication operation is widely used in parallel scientific application, which has an important influence impact on the scalability of the program. Tianhe interconnect network supports the trigger communication operations,
which can offload the messaging and processing work and improve the performance between nodes. Allreduce and Reduce
algorithms under different tree topological structures are designed by using the triggered operations to lower the latency the reduction operation communication between nodes. Tests based on the actual system platform show that that, compared with the pointtopoint implementation of these two types of operations in MPICH, the offload algorithm based on trigger can reduce the running time by up to 59.6% at different node scales.
A cloud cipher job stream scheduling algorithm
based on associated data localization
GUAN Chuanjiang, LI Jianpeng, SHI Guozhen, MAO Ming
2020, 42(11): 1988-1995. doi:
Abstract
(
132
)
PDF
(1560KB) (
178
)
Aiming at the problems that there are various service requests and data dependent job streams and nondatadependent job streams are randomly cross-concurred in the cloud cipher service system, in order to avoid the system communication performance overhead and data security threats caused by the interaction of associated data between computing nodes, a cloud cipher job stream sche- duling algorithm based on association data localization is designed. Firstly, the mapping of cryptographic function of the task request is used to ensure the correct implementation of the multijob streams request function. Secondly, for the problem of different working modes crossing in different tasks with the same request cryptographic function, on the basis of the proposed task priority calculation method to promote the fairness of multi job streams scheduling, the classified scheduling method is adopted to realize the localization of associated data and guarantee the overall performance of the scheduling system. The simulation results show that the algorithm can not only effectively reduce the task completion time and improve resource utilization and fairness, but also has good stability.
Link scheduling in energyharvesting sensor networks with nonideal batteries
WANG Ningbo, WANG Luyao, XU Xiaobin
2020, 42(11): 1996-2004. doi:
Abstract
(
122
)
PDF
(697KB) (
164
)
In recent years, in order to solve the problem of limited energy of sensor nodes, energyharvesting wireless sensor network has become a research hotspot. Aiming at the shortcomings such as limited capacity, charging/discharging loss, and energy leakage of batteries in sensor nodes, a harvestusestore structure with nonideal batteries is proposed. A mathematical model is established by combining three aspects of routing, link scheduling and energy allocation. The shortest frame is obtained by solving the mix integer linear equation. The simulation results show that, the frame length is decreased by up to 48% with the charging/discharging efficiency increased from 0.6 to 0.9. When the energy leakage ratio is reduced from 0.04 to 0.01, the frame length is reduced by up to 33%. Expanding battery capacity has little effect on the frame length. In contrast to harveststorageuse structure, the frame length of harvestusestore structure is decreased by up to 11%. It is verified that the proposed method improves the network throughput greatly by improving the charging/discharging efficiency and reducing the energy leakage rate.
A searchable encryption scheme supporting
multi-keyword retrieval on blockchain
NIU Shufen, WANG Jinfeng, WANG Bobin, CHEN Jingmin, DU Xiaoni
2020, 42(11): 2005-2012. doi:
Abstract
(
343
)
PDF
(892KB) (
322
)
In cloudbased singlekeyword searchable encryption schemes, cloud servers are not completely trusted, and the existing singlekeyword retrieval cannot accurately return search results. Therefore, a multikeyword searchable encryption scheme is constructed by using blockchain technology. Our scheme uses the symmetrical encryption algorithm to improve the encryption efficiency, takes advantage of blockchain technology to solve the problem of dishonest search in cloud server, and also improves the accuracy of search results based on multikeyword index structure. The scheme is proved secure against indistinguishably chosen keyword attack (INDCKA) under the random oracle model. Furthermore, the performance analysis shows that our proposals are secure and efficient.
Reversible data hiding of JPEG image
by DCT coefficient of selective sorting
WANG Ruofei, LIU Feng
2020, 42(11): 2013-2019. doi:
Abstract
(
132
)
PDF
(622KB) (
183
)
JPEG image compression algorithm can provide users with good compression performance and improve the quality of image file reconstruction. It has a wide range of use value in the field of image and video processing. This paper proposes a feasible and effective method of reversible information hiding for JPEG image. In this scheme, the quantized DCT coefficients of all 8 × 8 sub blocks in JPEG image are rearranged into a new matrix, with the coefficient values of each block listed vertically and at the same frequency listed horizontally. The coefficients at the same frequency are simulated to embed bits, and the coefficients at the frequency with small distortion are preferentially selected to embed information until the secret bitstream information is embedded, and the invalid bitstream expansion is reduced according to the decoding matrix when embedding the secret information. Experimental results show that the method can achieve better visual quality and less image bitstream expansion under the same bitstream embedding.
An automatic fine crack recognition algorithm for airport pavement under significant noises
LI Haifeng, WU Zhilong, NIE Jingjing
2020, 42(11): 2020-2029. doi:
Abstract
(
188
)
PDF
(1247KB) (
231
)
Cracks on airport pavement are extremely fine, and depth camera based crack detection technology is faced with the interference from both complex pavement apparent structure and severe vibration of the platform. To handle this problem, a main profile modeling algorithm by combining L2 regularization and dynamic threshold greedy strategy is proposed to achieve accurate crack detection results of millimeter level. Firstly, the main profile of pavement is modelled constrained with L2 regularization, thus overcoming the overfitting problem caused by the complex apparent structure. Secondly, an improved greedy algorithm based on dynamic threshold is proposed to suppress noise interference by iteratively removing abnormal points caused by platform vibration. Finally, based on the constructed main profile model, the multidirection main profiles of the airport pavement are extracted and fused, and the crack depth and morphology information are used to extract the crack. Experiments on real airport pavement data show that the proposed algorithm can reconstruct the main profile of the pavement accurately, detect the fine cracks successfully, and have better crack detection performance than the existing techniques.
Traffic road sign recognition based on SqueezeNet model with deep residual network and GRU
HUO Aiqing, ZHANG Wenle, LI Haoping
2020, 42(11): 2030-2036. doi:
Abstract
(
271
)
PDF
(827KB) (
223
)
Existing traffic road sign recognition methods are all based on convolutional neural networks. As the number of the model network layers increases, the recognition accuracy will also be improved, but there are still some problems such as the reduction of efficiency and the increase of the number of parameters. Therefore, an improved SqueezeNet model combining deep residual network with GRU neural network (SqueezeNetIRGRU) is proposed. In order to enhance the learning efficiency, ELU function is used as the activation function. To avoid the disappearance of gradients when the network layer is too deep, a deep residual network is introduced to guarantee the stability of the model, GRU neural network that can memorize the important past features is utilized. Experiments were performed on the Cafir10 and GTSRB datasets, and their recognition accuracy rates are above 99.13% and 88.25%respectively. The experimental results show that the SqueezeNetIRGRU model not only reduces the parameter amount greatly, but also its convergence, stability and recall rate are all much better than others.
Free parameter optimization of the cubic Cardinal spline function
LI Juncheng, LIU Chengzhi
2020, 42(11): 2037-2041. doi:
Abstract
(
136
)
PDF
(387KB) (
161
)
In order to reasonably determine the free parameter of the cubic Cardinal spline function, the optimization of the free parameter of the cubic Cardinal spline function in the interpolation problems are discussed. Firstly, the influence of the free parameter on the curve shape of the cubic Cardinal spline function is analyzed. Secondly, the schemes for computing the optimal free parameter in the two cases of data interpolation and function approximation are given, and the cubic Cardinal spline function with mi nimal quadratic average oscillation and approximation error are obtained respectively. When it is necessary to construct the cubic Cardinal spline function with good shape preserving effect or approximation effect, the optimal free parameter can be selected by the proposed schemes.
Multi-source image fusion with SPCNN and SR based on image features
ZHANG Lixia, ZENG Guangping, XUAN Zhaocheng
2020, 42(11): 2042-2049. doi:
Abstract
(
160
)
PDF
(889KB) (
179
)
In order to highlight the different features of different input images, a SPCNN model with automaticsetting parameter based on features is proposed, which is combined with sparse representation to fuse the multisource images. The fusion process has four steps. Firstly, the source images are decomposed into high frequency coefficients and low frequency coefficient by NSST. Each high frequency coefficient is fired by the SPCNN model with automaticset parameters based on the inherent characteristics, and the fused image is completed according to the total number of firing and the weighted fusion strategy. The low frequency coefficients are fused by a sparse representation. Finally, the fused image is reconstructed by inverse NSST. The experimental results show that the proposed method is superior to the other five classical methods and the fused image conforms to the human visual perception system, with clear structure and obvious details.
Research on health management system of largecaliber artillery based on deep learning
ZHANG Yuan, JIANG Huancheng
2020, 42(11): 2050-2058. doi:
Abstract
(
203
)
PDF
(1086KB) (
284
)
Largecaliber artillery can limit the enemy's movement to the maximum range at the least cost. It is a very critical fire suppression weapon on the battlefield. However, due to its harsh working environment, largecaliber artillery performs very unstable in missions. Based on the research project of the health management system of largecaliber artillery, while monitoring and recording the working status of largecaliber artillery in real time, this paper proposes a design idea of failure prediction and analysis of largecaliber artillery based on deep learning by combining expert analysis and other health ma nagement methods. The unsupervised and efficient feature extraction capabilities of the deep belief network and the supervised data classification capabilities of the multilayer perceptron are adopted to establish a fault prediction deep learning model, in order to realize the prediction of the failure state of largecaliber artillery and provide technical support for the premaintenance of largecaliber artillery, thereby improving the reliability of largecaliber artillery.
Entity relationship extraction fusing self-attention mechanism and CNN
YAN Xiong, DUAN Yuexing, ZHANG Zehua
2020, 42(11): 2059-2066. doi:
Abstract
(
372
)
PDF
(762KB) (
266
)
At present, the neural network model plays an important role in entity relationship extraction tasks. Features can be automatically extracted by a convolutional neural network, but it is limited because a fixed window size convolution kernel in a convolutional neural network is used to extract contextual semantic information of words in a sentence. Therefore, this paper proposes a new relational extraction method fusing selfattention and convolutional neural network. The original word vector is calculated by the selfattention mechanism to obtain the relationship between the words in the sequence. The input word vector expresses richer semantic information, which can make up for the deficiency of the automatic extraction features of the convolutional neural network. The experimental results on the SemEval2010 Task 8 dataset show that, after adding the selfattention mechanism, our model is beneficial to improve the entity relationship extraction effect.
An improved wavelet thresholdCEEMDAN algorithm for ECG signal denoising
ZHANG Peiling, LI Xiaozhen, CUI Shuaihua
2020, 42(11): 2067-2072. doi:
Abstract
(
560
)
PDF
(951KB) (
413
)
Electrocardiogram (ECG) signal denoising has always been a hot research issue. In order to eliminate the noises in ECG signal, a denoising method based on adaptive complete set empirical mode decomposition (CEEMDAN) and wavelet improved threshold function is proposed. Firstly, this method firstly decomposes the ECG signal by CEEMDAN to obtain a set of intrinsic modal functions (IMFs) from high frequency to low frequency. CEEMDAN decomposition is performed on ECG signal to yield several modal components (IMF). Secondly, the correlation coefficient method is used to perform wavelet denoising with improved threshold on the high frequency IMFs. For the lowfrequency IMFs, by setting a fixed threshold, the IMFs below the threshold is considered to be the baseline drift signal and removed. Finally, the denoised IMFs and the retained IMFs are reconstructed. The experimental results show that the proposed method is more effective than the empirical mode decomposition (EMD) wavelet denoising, and the global average empirical mode decomposition (EEMD) wavelet denoising method.
Author center
Submission
Note to authors
Paper template
Copyright agreement
Review center
Expert
Committee
Editor in chief review
Office editorial
Online journal
Current issue
Accepted
Archive
Special issue
Download ranking
Cited ranking