Computer Engineering & Science

Implementation and optimization of high-precision summation and dot product algorithms on Phytium processor

HUANG Chun, JIANG Hao, GU Tong-xiang, QI Jin, LIU Wen-chao

2021, 43(01): 1-8. doi:

Abstract ( 393 )

PDF (786KB) ( 364 ) 　　

In large-scale and long-term numerical calculations, the cumulative effect of rounding errors in floating-point operations may lead to unreliable numerical results. Sum and dot multiplication are the most basic operations in floating-point numerical calculations. They are frequently called during large-scale scientific calculations, and the accuracy of their numerical results is very important. Oriented to the domestic Phytium processor, based on OpenBLAS, this paper uses error-free transformation technology to design efficient assembly kernel functions, and implements and optimizes the high-precision sum and dot product algorithms. Numerical experiments show that the accuracy of the numerical results of our high-precision algorithms is the same as that of the original algorithm under double working accuracy, which verifies the effectiveness of the algorithm. The running time of our algorithms is 1.57 and 1.76 times the running time of the original algorithms in the single-threaded case, and the efficiency is not significantly reduced while the accuracy is improved. In the case of multi-threading, it has almost the same running time as the original algorithms, which reflects the efficiency of our algorithms. Theoretical error analysis further ensures the reliability of our algorithms.

Availability analysis of datacenter peak power shaving battery

LU Yu, ZHANG Lu, HOU Xiao-feng, ZHENG Wen-li, LI Chao

2021, 43(01): 9-16. doi:

Abstract ( 176 )

PDF (843KB) ( 231 ) 　　

Research shows that datacenter backup batteries have shown great potential in peak power shaving. The use of batteries for peak power shaving can greatly improve the power usage efficiency of datacenter, thereby saving a lot of construction cost of datacenter power infrastructure. However, due to the accelerated battery aging phenomenon, the battery often needs to be replaced several times during the life cycle of a data center. This makes the battery cost an important part of the cost of the data center. Under the more advanced distributed backup system, the battery cost accounts for a greater proportion. Therefore, how to use batteries more economically has become a key issue for cost saving. In this paper, a revenue model for predicting battery availability is proposed, which can evaluate whether the aging battery in the traditional sense still has practical value, and how to balance the performance degradation and backup reliability problem in the aging battery use. This paper also proposes an optimized battery control method, which reduces the cost of data center backup power.

Performance optimization of quantum circuit simulator QuEST on multi-GPU platform

ZHANG Liang, CHANG Xu, QIN Zhi-kai, SHEN Li

2021, 43(01): 17-23. doi:

Abstract ( 373 )

PDF (640KB) ( 301 ) 　　

In the current quantum computing research, as an important research tool, quantum circuit simulators have always been highly valued by researchers. QuEST is an open source general-purpose quantum circuit simulator that can run flexibly on multiple test platforms such as a single CPU node, multiple CPU nodes, and a single GPU. The inherent parallelism of quantum circuit simulator makes it very suitable for running on the GPU, and obtain greater performance acceleration. However, the disadvantage is that the memory space consumed is huge. A single GPU is limited by the memory capacity and cannot simulate a quantum system with more qubits. This paper designs and implements a multi-GPU version of the QuEST simulator, which solves the problem of insufficient memory of single GPU and can simulate more qubits. Moreover, compared with the single CPU version, it can achieve 7~9x perfor- mance acceleration, and compared with the multi-CPU version, it achieves 3x performance acceleration.

Comparison of design schemes of a MobileNetV2 neural network processor

CHEN Yong-hao, XIAO Jia-le, SU Tao

2021, 43(01): 24-32. doi:

Abstract ( 295 )

PDF (1146KB) ( 310 ) 　　

Aiming at the linear bottleneck structure of MobileNetV2, we study the design scheme of the dedicated processor chip. Based on the Layer Fusion mode and the configurable block structure, we design a pipeline structure for bottleneck convolution as well as a corresponding analysis framework. Then, a design space is proposed accordingly, and a software simulator is used to traverse and compare the performance of various schemes in this space. We then find the rules for optimal parameter selection. The validity of the conclusion is verified by hardware behavior simulation. The study can help system chip designers to select or design a suitable MobileNetV2 processor IP design scheme based on their own resource constraints and performance requirements. The paper also provides some inspiration for future automatic processor design.

FD-LSTM: A fault analysis model based on large-scale system logs

FANG Jiao-li, ZUO Ke, HUANG Chun, LIU Jie, LI Sheng-guo, LU Kai

2021, 43(01): 33-41. doi:

Abstract ( 356 )

PDF (1079KB) ( 366 ) 　　

Reliability research is a classic problem in the field of high-performance computing. With the continuous development of process technology and integrated technology, the current scale of the entire system has grown exponentially, which has brought great challenges to reliability research, especially failure analysis. This paper collects 203510247 pieces of work failure log information after the operation of the independent high-performance computing system, from January 28, 2016 to December 6, 2016. Firstly, the K-Means clustering method is used to classify the faults and analyze the fault distribution characteristics. Secondly, based on the clustering results, a time-based fault analysis model FD-LSTM is designed. After training with structured logs, the occurrence time and space of different fault types are predicted. The results show that the accuracy of the proposed FD-LSTM prediction model can reach 80.56%. The research in this paper shows that, compared with the traditional fault analysis mo- del, in terms of time prediction and spatial prediction, the time series model FD-LSTM based on log information have practical guiding significance in improving the accuracy of fault analysis, enhancing the efficiency of machine operation and maintenance, improving the rationalization of collaborative whole system design, and other aspects.

A model parallel training optimization algorithm for hybrid heterogeneous platforms

GAO Kai, GUO Zhen-hua, Chen Yong-fang, Wang Li, ZHAO Ya-qian, ZHAO Kun

2021, 43(01): 42-48. doi:

Abstract ( 235 )

PDF (1092KB) ( 409 ) 　　

With the development of hybrid heterogeneous platforms, different types of acceleration devices have appeared. How to make full use of these different types of devices in hybrid heterogeneous platforms and how to deploy deep learning models among multiple computing devices to train large and complex models is becoming more and more important. Data parallelism (DP) is the most widely used parallelization strategy, but if the device number in data parallel training continues to grow, the communication overhead between devices will become a bottleneck. In addition, the total amount of batches processed in each step due to the difference in device performance will lead to a loss of accuracy, that is, a larger training period is required to converge to the desired accuracy. These factors will affect the overall training time and will affect the operating efficiency of certain equipment. Except for data parallelism (DP), each training step can be accelerated by model parallelism (MP). This paper proposes a model parallel training optimization algorithm suitable for hybrid heterogeneous platforms. First of all, in order to solve the problem of uneven distribution of device performance in hybrid heterogeneous platforms, this paper proposes a parallel division strategy of mixed hierarchical and channel parallel models. At the same time, it combines some low-performance devices to reduce the length of the pipeline and ease the communication pressure. Then, in order to optimize the pipeline effect between devices, by analyzing the influence of pipeline establishment time and device performance utilization on the overall training time, this paper proposes a micro-batch division method that can balance the two parts. Experiments prove that the proposed model parallel training optimization algorithm has a better speedup than the traditional model parallel algorithm. The training performance speedup on the heterogeneous platform of single type devices is increased by about 4%. The platform's training performance speedup can be increased by about 7% compared to the previous optimization method.

Intel Cascade Lake architecture CPU evaluation with SPEC CPU2017

DU Qi, HUANG Hui, GONG Sheng, LIU Xin-wa, HUANG Chun

2021, 43(01): 49-57. doi:

Abstract ( 506 )

PDF (1784KB) ( 261 ) 　　

The SPEC CPU2017 benchmark package, which includes the next generation of industry standards, is one of the objective and trusted benchmarks for evaluating CPU performance today. This paper uses SPEC CPU2017 to test Intel Xeon Gold 6252N CPU with Intel Cascade Lake architecture under different memory frequencies, different copies, and turning on/off Turbo, and summarizes the performance of different applications in different configuration combinations. Meanwhile, this paper also compares and tests the Intel Xeon E5-2692 V2 CPU of Intel Ivy Bridge architecture and the Intel Xeon E5-2620 V3 CPU of Intel Haswell architecture. By introducing the concept of PBR, we analyze the impact of the increase of hardware features of the three architectures on the performance of the application.

YH-ACT：Parallel analysis code of thermohydraulics

LIU Jie, GONG Chun-ye, YANG Bo, GUO Xiao-wei, GAN Xin-biao, LI Sheng-guo, LI Chao, CHEN Xu-guang, XIAO Tiao-jie, MU Li-an, SONG Min, ZHAO Dong-yong, JU Yu-zhong

2021, 43(01): 58-69. doi:

Abstract ( 344 )

PDF (1104KB) ( 338 ) 　　

Commercial CFD programs have been widely used in the thermal and hydraulic simulation of reactors, but they cannot fully meet the application requirements of reactors. Open-source CFD programs have some applications, but, compared with commercial CFD programs, there are still gaps on the comprehensive physical model, calculation accuracy, calculation efficiency and ease of use. In order to better meet the needs of thermohydraulics analysis, it needs a more comprehensive physical model, higher calculation accuracy and better parallel computing efficiency, and it is necessary to develop independent thermal CFD software. This paper describes the design, implementation, and test results of YH-ACT: parallel analysis code of thermohydraulics. Three typical cases are selected, and the correctness of the software is verified by comparing it with the simulation results by typical commercial software Fluent. The software parallel computing scale reaches 400 nodes with 9600 processes. The speedup is 111.7 and the parallel efficiency is 27.9% for the steady-state model. The speedup is 37.2 and the parallel efficiency is 9.3% for the transient-state model.

Wireless data center network:Advances, challenges and perspectives

HAN Biao, WANG Tao, WANG Bao-sheng

2021, 43(01): 70-81. doi:

Abstract ( 292 )

PDF (810KB) ( 347 ) 　　

With more and more dynamic traffic, the complex wired data center network architecture poses tremendous challenges to network expansion, energy consumption management, operation and maintenance. High-speed wireless technology has the advantages of high bandwidth, dynamic connection, and flexible controllability. It has become potential data center networking solution, which can alleviate the long-term issue of traffic hotspot in data centers and reduce the time, energy and cost it takes to deploy and maintain optical cables. This paper first introduces the current development trend of data center network architecture, and analyzes and compares the advantages and disadvantages of millimeter wave, terahertz and free space optical as potential candidate high-speed wireless technologies. Secondly, we review the current typical wireless data center network architectures with investigations on the challenges of wireless data center network design and deployment. Finally, we point out the future perspectives of wireless data center networks.

A real-time HMAC-SM3 acceleration engine for large network traffic

LI Dan-feng, WANG Fei, ZHAO Guo-hong

2021, 43(01): 82-88. doi:

Abstract ( 328 )

PDF (995KB) ( 308 ) 　　

基于Hash的消息验证码；SM3；现场可编程门阵列FPGA；报文认证

Construction of a class of linear codes with three or four weights

XUE Wen-fang, WANG Wei-qiong, LI Ya-wei

2021, 43(01): 89-94. doi:

Abstract ( 196 )

PDF (373KB) ( 236 ) 　　

Linear codes with a few weights have important applications in secret sharing schemes, authentication codes, association schemes, and strong regular graphs. A class of 3-weight or 4-weight linear codes is provided with Boolean functions. The parameters and weight distributions of these codes are determined by the theory of character sums and the Walsh spectrum of Boolean functions. The proposed 3-weight codes can be used to construct secret sharing schemes and association schemes. The dual codes of the provided codes are optimal or almost optimal with respect to the sphere-packing bound.

A deep learning model of small object detection based on attention mechanism

WU Xiang-ning, HE Peng, DENG Zhong-gang, LI Jia-qi, WANG Wen, CHEN Miao

2021, 43(01): 95-104. doi:

Abstract ( 1137 )

PDF (1413KB) ( 636 ) 　　

Small target detection is used to identify small pixel size targets in image. Traditional target recognition algorithms has poor generalization ability, and general depth convolution neural network algorithms are easy to lose the characteristics of small target, so these algorithms are not ideal for small target recognition. To solve the above problems, a deep learning model of small target detection based on attention mechanism is proposed. The model uses channel attention and spatial attention in resnet101 backbone network and region proposal network. The channel attention module implements feature weighting calibration in channel dimension, and the spatial attention module realizes feature focusing in spatial dimension, thus improving the capture effect of small targets. In addition, the model uses data enhancement technology and multi-scale feature fusion technology to ensure the effectiveness of small target feature extraction. The experiment of ship recognition in remote sensing image data set shows that the attention module can improve the performance of small target detection.

Human motion recognition based on deformable convolutional neural network

WANG Xue-jiao, ZHI Min

2021, 43(01): 105-111. doi:

Abstract ( 291 )

PDF (996KB) ( 328 ) 　　

To solve the problem of low accuracy of human motion recognition in complex scenes, an improved human motion recognition system based on deformable convolution network (DCN) and deformable part model (DPM) is constructed. Firstly, the number of the DPM component filters are increased from 5 to 8, and the branch and bound method is combined to improve the accuracy by about 11% and the speed by about 3 times. Secondly, DCN is used to sample the points of interest according to the movements of human body. Then, the improved DPM and DCN are fused before deformable pool- ing. Finally, the input data is identified by the full connection layer.Experimental results show that this method can identify the results more quickly and accurately on the human movement dataset.

A survey of single image super-resolution reconstruction based on deep learning

LI Bin, YU Xia-qiong, WANG Ping, FU Rui-gang, ZHANG Hong

2021, 43(01): 112-124. doi:

Abstract ( 559 )

PDF (885KB) ( 538 ) 　　

Single image super-resolution (SISR) refers to the recovery of a high-resolution image from a single low-resolution image. With deep learning used in the field of image super-resolution, deep networks can independently learn the mapping relationship between low-resolution and high-resolution training images, showing better reconstruction performance than the traditional methods. Therefore, deep learning has become dominant in super-resolution. This paper focuses on the exploration of the existing deep network model of super-resolution in terms of reconstruction mode, network structure, and loss function. By comparing the similarities and differences between different models, the advan- tages and disadvantages of different model building methods and the applicable application scenarios are analyzed. Meanwhile, the reconstruction results of different network models on the benchmark test datasets are compared and the potential directions are concluded.

Unsupervised learning for face sketch-photo synthesis using generative adversarial network

CHEN Jin-long, LIU Xiong-fei, ZHAN Shu

2021, 43(01): 125-133. doi:

Abstract ( 260 )

PDF (791KB) ( 263 ) 　　

The research in verification of human face issue has impelled the demand and interest of law enforcement agencies and digital entertainment industry in transferring sketches to photo-realistic images. However, sketch-photo synthesis remains a significant challenging problem despite the rapid development of neural networks in image-to-image generation tasks. So far, existing approaches still have inextricable limitations due to the lack of paired data in the training stage and the fact of the striking differences between sketch and photo. To solve this problem, a new framework is proposed to translate face sketches to photo-realistic images in an unsupervised fashion. Compared with current unsupervised image-to-image translation methods, the network leverages an additional semantic consistency loss to keep the input semantic information in the output, and replaces the pixel-wise cycle-consistency with perceptual loss to generate sharper images for face sketch-photo synthesis. This network also employs PGGAN's generator and train it with a GAN loss for realistic output and a cycle consistency loss for driving the same input and output to remain constant. Experiments on two open source data sets verify the effectiveness of our proposal in subjective evaluation and objective standards.

Enhancing information transfer in neural machine translation

SHI Xiao-jing, NING Qiu-yi, JI Bai-jun, DUAN Xiang-yu

2021, 43(01): 134-141. doi:

Abstract ( 203 )

PDF (710KB) ( 233 ) 　　

In the field of Neural Machine Translation (NMT), the multi-layer neural network model structure can significantly improve the translation performance. However, the structure of multi-layer neural network has an inherent problem with information transfer degeneracy. To alleviate this problem, this paper proposes an information transfer enhancement method by fusing layers information and sublayers information. By introducing a "retention gate" mechanism to control the fused information transfer weight, which is aggregated with the output of the current layer and then serves as the input of the next layer, thus making fuller information transfer between layers. Experiments were carried out on the most advanced NMT model Transformer. Experimental results on the Chinese-English and German-English tasks show that our method improves BLEU score by 0.66, and 0.42 in comparison to the baseline system.

Anonymity of dynamic trajectory based on genetic algorithm

JIA Jun-jie, QIN Hai-tao

2021, 43(01): 142-150. doi:

Abstract ( 228 )

PDF (927KB) ( 276 ) 　　

Most of the existing trajectory privacy protection technologies protect the static trajectory data of mobile objects, but ignore the risk of privacy disclosure of the dynamic trajectory of mobile objects. In order to solve this problem, this paper studies the dynamic trajectory anonymity based on genetic algorithm. The proposed algorithm uses the characteristics of genetic algorithm to search the global optimal solution, establishes the track behavior mode in the current historical track of the moving object, forecasts the track of the moving object through the track behavior mode, and constantly updates the track behavior mode according to the new predicted track of the moving object, so as to achieve higher accuracy of track prediction. In order to protect the privacy information of the mobile object, K-anonymity technology is used to generate the false trajectory for the new prediction trajectory. Expe- riments show that, compared with the existing track anonymity algorithm, the proposed algorithm can protect the privacy of the track and further improve the quality of the track data.

Early warning of cyber-crime based on social viewpoint modeling

WEI Mo-ji, ZHAO Yan-qing, ZHU Shi-wei, LI Chen

2021, 43(01): 151-160. doi:

Abstract ( 240 )

PDF (1345KB) ( 293 ) 　　

Cyberspace has turned into the "second battlefield" for security prevention and control. How to provide technical support for crime early warning in the new battlefield has become one of most important part of security work. A modeling method based on social viewpoint is proposed by analyzing text features of cyber-crime. Firstly, the top-level ontology of cyber-crime is constructed by extracting knowledge from the experiences of domain experts in solving cases, and domain ontologies extending top-level ontology are constructed according to policy categories. Secondly, with theme crawlers, opinions classified by subjects focused by police from official websites are collected, and correlative social viewpoints are built. Finally, by ontology instance reasoning, each speech from social media is matched to corresponding social viewpoint, and by similarity calculation a judgment of early warning is predicted. By analyzing five sets of results from emotion-based and viewpoint-based cyber-crime early warning experiments, it turns out that the text of cyber-crime is non-emotional sensitivity, however, the viewpoint-based modeling method can predict the truly cyber-crimes effectively.

Experiment on path decision behavior in the context of social information

YU Hao, CHEN Jian

2021, 43(01): 161-169. doi:

Abstract ( 147 )

PDF (1260KB) ( 248 ) 　　

In order to solve the problem that the influence of social information on route selection lacks quantitative analysis, based on the behavior experiment theory, the experiment of subject's route selection behavior is designed under the three scenarios of no traffic information, partial traffic information and complete traffic information, which are realized by z-Tree and z-leaf software. The experiment shows that: (1) in the state of no traffic information, the subjects are more inclined to choose the shortest possible path for decision-making; (2) in the state of partial traffic information, the subjects are more inclined to choose the optimal road section of the current node, and the path selection result is better than that in the situation of no information on the whole, but the specific situation is that the time-consuming of individual path selection increases; (3) in the state of complete traffic information, the overall and individual path The best choice.

An elitist-archive-based differential evolutionary algorithm for multi-objective clustering

ZHANG Ming-zhu, CAO Jie, WANG Bin

2021, 43(01): 170-179. doi:

Abstract ( 217 )

PDF (646KB) ( 227 ) 　　

Determining the number of clusters is a basic yet challenging problem in clustering analysis. On one hand, the optimal number of clusters varies according to different evaluation criteria, user preferences or demands, hence it makes sense to provide the user with multiple clustering results for different number of clusters. On the other hand, increasing the number of clusters without any penalty usually optimizes the within-cluster compactness while deteriorating the between-cluster separation. Therefore, selecting an appropriate number of clusters is, in fact, a multi-objective optimization problem, which needs to choose a balanced solution among a set of tradeoffs between the minimum number of clusters and the maximum compactness or separation of clusters. As a result, in order to deal with the clustering problem with unknown number of clusters, we directly take the number of clusters as one optimization objective, and simultaneously optimize it with another objective function reflecting the within-cluster compactness by a newly designed multi-objective differential evolutionary algorithm with an elitist archive. The proposed algorithm obtains a nearly Pareto-optimal set, which contains multiple clustering results for distinct number of clusters, in a single run. Experiments on several datasets and comparative experiments demonstrate the practicability and effectiveness of our proposed algorithm.

Current Issue

Author center

Review center

Online journal