Computer Engineering & Science

Development status analysis of China HPC 2018

YUAN Guoxing1,ZHANG Yunquan2，YUAN Liang2

2018, 40(12): 2097-2102. doi:

Abstract ( 147 )

PDF (857KB) ( 314 ) 　　

According to the latest China High Performance Computer (HPC) TOP100 rank list released by CCF TCHPC in late October 2018, we discuss and analyze the development status of domestic HPCs from the aspects of overall performance, manufacturers, application areas, and deployment sites. We also provide an outlook for the future development of domestic HPC.

A GPU-based high-performance optimization method

of sparse convolutional neural networks

FANG Cheng,XING Zuocheng,CHEN Xuhao,ZHANG Yang

2018, 40(12): 2103-2111. doi:

Abstract ( 342 )

PDF (1126KB) ( 339 ) 　　

Review attachment

As an important branch of neural networks, the convolutional neural network (CNN) is currently more suitable for learning and expressing image features than other neural network methods. With the continuous development of the CNN, there are more challenges. The parameters scale of the CNN is growing larger, which makes the demand for computation enormous. There are many ways to compress CNN scale, however, the compressed CNN usually introduces a number of sparse data structures. These sparse data structures can hurt the performance of the CNN on GPU. In order to solve this problem, we adopt the direct sparse convolution algorithm proposed in 2017 to accelerate GPU’s processing of sparse data. According to the characteristics of this algorithm, we transform convolution operation into an inner product of the sparse vector and dense vector on GPU platform. Our optimization makes full use of the sparse data and network structure to allocate threads for task scheduling, and uses data locality to manage memory replacement. It enables the GPU to deal with the operation on the convolution layer efficiently in the sparse CNN. Compared with the cuBLAS, our proposal achieves a speedup of 1.07×~1.23×, 1.17×~3.51×and 1.32×~5.00× on AlexNet, GoogleNet and ResNet respectively. Compared with the cuSPARSE, our method achieves a speedup of 1.31×~1.42×, 1.09×~2.00×and 1.07×~3.22× on AlexNet, GoogleNet, and ResNet respectively.

Design and implementation of a

Docker dynamic scheduling algorithm

LIU Shuyang，ZHANG Manyi，CAO Qiang

2018, 40(12): 2112-2119. doi:

Abstract ( 322 )

PDF (968KB) ( 328 ) 　　

Review attachment

As a container’s runtime infrastructure, Docker can efficiently deploy, execute, and manage containers. However, the existing Dockercontainer management mechanism based on prerun static configuration cannot dynamically allocate resources according to the characteristics of different application categories and runtime resource requirements. Therefore, based on the analysis of experiments on the resource usage and performance of Docker when running different workloads, we design and implement a runtimebased Docker dynamic scheduling algorithm to prioritize realtime application container service while ensuring the performance of batchapplication containers. The algorithm can also recommend the most suitable application container that can be created according to current running status of nodes, thus maximizing the resource utilization of nodes. Experiments show that this algorithm does not introduce significant performance overhead. When resource competition occurs among containers, the service time for realtime application containers can be prolonged by 87.5%, with a 2.9% performance overhead at most for the batchapplication containers that running at the same time. The algorithm recommendation mechanism increases the number of container instances that can run on nodes by 2.3 times, resulting in a maximum of 9.3% performance loss for batchapplication containers.

Fault recovery evolution technique based on FPGA

WANG Jie，KANG Junjie，ZHOU Kuanjiu

2018, 40(12): 2120-2125. doi:

Abstract ( 137 )

PDF (574KB) ( 265 ) 　　

Review attachment

The selfrepairing feature of evolution hardware can effectively recover repairable faults of circuit systems. Due to the slow rate and low success rate in circuit evolution process, how to accomplish the evolution within the time constraint becomes a major challenge. We propose a realtime faulttolerant system based on evolution hardware, in which fault analysis trees are used to monitor circuit faults in real time，a fault compensation mechanism to maintain the normal operation of the system, and the evolution hardware technique is adopted to repair circuit faults so as to realize online realtime repair of faults. We test the faulttolerant system on FPGA through random fault injection. Several evolutionary algorithms are used to verify the selfrepairing capability of the faulttolerant system. The results show that the repair rate of the fault circuit can reach 95% under the realtime constraint and the stability and reliability of the system are improved.

An FPGA-based HEVC post-processing

CNN hardware accelerator

XIA Jun1，Qian Lei2，YAN Wei3，CHAI Zhilei1

2018, 40(12): 2126-2132. doi:

Abstract ( 214 )

PDF (642KB) ( 372 ) 　　

Review attachment

Aiming at the shortcomings of the post-processing CNN algorithm running on the common platform according to the high-efficiency video code standard, we propose a postprocessing convolutional neural network hardware parallel architecture based on field programmable gate array (FPGA) to improve the overall parallelism of the convolution module and the hardware flow of the module by optimizing the concurrent data input and output buffering process. Experiments on 176×144 video streams on the Xilinx ZCU102 show that the proposed CNN hardware accelerator can achieve an equivalent computational performance of 360.5G floating-point operation per second. The computation speed can satisfy 81.01 FPS, which is 76.67 times faster than that of the Intel i7-4790K with a clock frequency of 4Ghz. The speedup is 32.50 times faster than the NVIDIA GeForce GTX 750Ti. In the calculation of energy efficiency ratio, the proposal’s power consumption is 12.095W, 512.9 times of that of the Intel i74790K and 125.78 times that of the NVIDIA GeForce GTX 750Ti.

A tree topology generation algorithm for

wireless content distribution networks

HUANG Hongtao1,WU Jigang1,ZHENG Lulu1,MIAO Yuqing2

2018, 40(12): 2133-2140. doi:

Abstract ( 200 )

PDF (680KB) ( 342 ) 　　

In the wireless content distribution network, network topology can be constructed as a number of minimum spanning trees rooted at the base station and the WiFi access points, and the depth of spanning trees and the degree of each node are limited to reduce the transmission pressure of the backbone network. This minimum spanning tree problem with depth and degree constraints is an NP-complete problem. Aiming ta this problem, we propose an efficient heuristic algorithm to generate high-quality approximate spanning trees. The proposed heuristic algorithm constructs spanning trees by selecting edges with lowest weight and connected to the current spanning trees from the edges connected to service nodes, and adding them to the spanning trees without violating the depth and degree constraints. Then, the approximate solution is further optimized by using the customized tabu search algorithm and simulated annealing algorithm. Experimental results show that the tabu search algorithm is more efficient than existing genetic algorithms under the given constraints. For the cases that the depth constraint is 4 and the degree constraint is 10, the solution of the genetic algorithm can be improved up to 18.5%. The proposed algorithms run 10 times faster than the genetic algorithm.

Design and implementation of a 4-channel

Loongson-3B server based on CC-NUMA architecture

ZHANG Peng

2018, 40(12): 2141-2145. doi:

Abstract ( 213 )

PDF (545KB) ( 341 ) 　　

Review attachment

Focusing on highperformance computing, highbandwidth communications, and autonomous and controllable requirements of servers in specific areas, we analyze the characteristics of Loongson3B3000 processor architecture, and design a 4way Loongson3B3000 highperformance server core module based on CCNUMA parallel processing architecture. Thanks to the TOE chip, the network response efficiency is improved, and the consumption of processor resources of the 10G Ethernet interface is greatly reduced, so the overall performance of the server is effectively improved. Testing and verification show that the proposed server can achieve high-performance parallel computing capability and 10G Ethernet communication capability, and domestic components can reach more than 95%.

A low computational optimization method for

spatiotemporal voxel encoding and decoding storage

GU Qinghua,MA Long,LU Caiwu

2018, 40(12): 2146-2155. doi:

Abstract ( 138 )

PDF (1040KB) ( 329 ) 　　

Review attachment

Regarding the large storage space occupied by the encoding and decoding data of spatiotemporal grid voxels, we propose a low computational method to store the encoding and decoding of spatiotemporal voxels based on hexadecimal tree index structure, and establish a mathematical model of encoding and decoding to realize the identification and location index of voxels. We employ the automatic encoding and decoding method of 3DGIS to achieve the conversion of encoding and decoding storage representation of spatiotemporal voxels. Secondly, using the Galois finite field theory, we build a binary encoding and decoding matrix of grid voxels as well as a low computational optimization algorithm to realize calculation optimization in the storage process of voxel encoding and decoding. Finally, taking the deposit block data of a mine as an example, the grid voxel codec model, storage representation conversion and low computational optimization algorithm are put into practical application. The proposal is compared with the Morton code of the octree index structure. The results show that the proposal can effectively reduce computational cost of encoding and decoding storage by about 30%, as well as the spatiotemporal efficiency for storing grid voxels.

A multilevel dynamic trusted measurement

model based on information flow

ZE Kai1，CHEN Dan1,2，ZHUANG Yi1

2018, 40(12): 2156-2163. doi:

Abstract ( 176 )

PDF (1156KB) ( 337 ) 　　

System runtime environment and multiple external factors together with internal multientity information flow mutual interference can break system credibility, and result in unexpected outputs. Existing research mainly aims at the integrity measurement of entities under the initialized trusted hardware environment, failing to consider the trusted influence brought by the confidentiality, and the frequency of the trusted measurement of entities cannot be synchronized with the progress. We propose a multilevel dynamic trusted measurement model based on information flow theory. By using the basic idea of intransitive noninterference theory of information flow as reference and introducing a trusted proxy module, we design a multilevel security access control policy, hence the trusted measurement of entities can be measured dynamically from aspects of entity integrity and confidentiality. We describe the formal description and trusted proof of the model and verify the model through an abstract system example. Compared with existing research, it has a better realtime measurement performance, and it is a contextaware fine-grain trusted measurement model.

A DDoS attack detection method based on

HMM time series prediction and chaos model

DONG Zhe1，TANG Xiangyan1，CHENG Jieren1,2，ZHANG Chen1，LIN Fusheng1

2018, 40(12): 2164-2172. doi:

Abstract ( 234 )

PDF (643KB) ( 419 ) 　　

The distributed denial of service (DDoS) attack is one of the most destructive attacks in the network environment. Existing attack detection algorithms based on machine learning often use the eigenvalues of a time to be classified to perform classification. However, the correlation with the features of its adjacent time is not taken into account. The false positive rate and false negative rate therefore are high. We propose a DDoS attack detection method based on hidden Markov model (HMM) time series prediction and chaos model. Aiming at the burstiness of mass attack traffic, we firstly define the network traffic weighted features (NTWF) and network flow average rate (NFAR) to describe the features of network traffic. Then, we use the hierarchical clustering algorithm to classify training sets to get the hidden layer state (HLS) sequences. We employ the NTWF sequence and HLS sequence to conduct supervised learning of the HMM, and predict the NTWF sequence by the state transition matrix and confusion matrix obtained before. Finally, we analyze the prediction error of NTWF sequences by the chaotic model, which is combined with the NFARbased rules, to distinguish attack behavior. Experimental results show that compared with similar methods, the propose method has lower false positive rate and false negative rate.

A flow analysis and planning method

based on dynamic DCVS network model

WU Bo1,WANG Bin1,XUE Jie2,LIU Hui1,XIONG Xin1

2018, 40(12): 2173-2182. doi:

Abstract ( 132 )

PDF (1315KB) ( 314 ) 　　

In order to solve the resource waste problem caused by lack of flow planning in the destination coded vehicle system (DCVS), we present a flow analysis and planning method based on dynamic DCVS network model. Firstly, we build a static DCVS network model using graph theory. Then, according to the package flow demands during a particular time period the flow analysis and planning method based on minimum time and maximum flow algorithm is applied to realize the dynamic DCVS resources scheduling. Experimental results show that this method can accomplish the overall flow analysis and planning of DCVS in a very short period, and achieve a reliable prediction for remote airport baggage transportation, thus providing an effective reference for DCVS resource control.

Formal analysis and improvement of RCIA:

An ultra-lightweight RFID- mutual authentication protocol

ZHONG Xiaomei，XIAO Meihua，LI Wei，CHEN Jia，LI Yanan

2018, 40(12): 2183-2192. doi:

Abstract ( 209 )

PDF (804KB) ( 358 ) 　　

Radio frequency identification (RFID) is a noncontact automatic identification technology in the Internet of things. It is widely used in the construction of RFID system for the interconnection of things. RCIA is an ultralightweight RFID mutual authentication protocol (UMAP), which provides high security and claims to be resistant to desynchronization attack. Formal method is a powerful tool to analyze the security of cryptographic protocols. We analyze the RFID tag authentication protocol RCIA based on formal method. Model checker SPIN is used to verify the authenticity and consistency of the RCIA protocol. The results show that the RCIA protocol is vulnerable to synchronization attack, for which we propose a patching scheme based on the key synchronization mechanism to improve the protocol. Formal analysis and verification results of the improved protocol show that the improved RCIA protocol has higher security. An abstract protocol modeling method we proposed in this paper is used to build an abstract protocol model of RCIA protocol formally, which has important reference for the formal analysis of such ultralightweight RFID mutual authentication protocol. The proposed vulnerability patching scheme based on key synchronization is proved to be effective against desynchronization attack, and it can be applied to the design and analysis of such ultralightweight RFID mutual authentication protocol.

Analysis and improvement of an ID-based

partially blind signature scheme

CAO Suzhen1,DAI Wenjie1,WANG Caifen1,WANG Xiuya1,SUN Han1,ZUO Weiping2

2018, 40(12): 2193-2197. doi:

Abstract ( 122 )

PDF (423KB) ( 284 ) 　　

Partially blind signature is designed to adress the contradiction between anonymity and controllability, and it can protect the privacy of users and trace user identity when it is necessary. The problem is that the public information can be tampered by malicious parties, which exists in partially blind signature schemes based on identity. Security analysis of the Liu scheme shows that the user can modify public information illegally. On this basis, we propose an improved IDbased partially blind signature. Based on the discrete logarithm problem, this scheme can satisfy the requirement of partial blindness while being capable of resisting against existential unforgeability attack of the adaptive chosen message under the random oracle model. The new scheme does not use the bilinear pairing operation with higher computational cost, and avoids public information tampering. Compared with existing schemes, it improves security and efficiency significantly.

A monocular-SLAM algorithm based on

fusion of line feature and optical flow

JIA Zhe，LENG Jianwei

2018, 40(12): 2198-2204. doi:

Abstract ( 253 )

PDF (736KB) ( 365 ) 　　

In order to solve the problem of localization and mapping of mobile robots, we propose a new simultaneou localization and mapping (SLAM) algorithm based on the fusion of line feature and optical flow. Firstly, the mainstream visual SLAM algorithms use points as features, resulting in that the point cloud map is sparse and it is difficult to accurately express the environmental structure information. Aiming at this problem, we take straight lines instead of points as a feature to construct the map, and employ the graph optimization method to improve the accuracy of localization and map construction. Secondly, the processing speed of the localization system cannot achieve realtime requirement, we introduce the optical flow method to realize realtime localization. Experimental results demonstrate that the map construction based on line features has higher mapping accuracy, and the fusion algorithm overcomes the shortcomings of poor localization accuracy of the optical flow and the low processing speed of the feature matching method, providing more accurate realtime localization output. And it is robust to circumstances such as illumination change and low scene texture.

A multi-scale wide line detection method

#br# QU Zhiguo,TAN Xiansi,LIN Qiang,WANG Hong,ZHANG Wei

QU Zhiguo,TAN Xiansi,LIN Qiang,WANG Hong,ZHANG Wei

2018, 40(12): 2205-2210. doi:

Abstract ( 164 )

PDF (651KB) ( 303 ) 　　

We propose a multiscale wide line detector based on isotropic nonlinear filtering to overcome the shortcomings of the basic wide line detector. The basic detector extracts wide lines using a single-scale circular mask, as a result, it can only detect lines thinner than the mask radius and has poor detection performance at intersections of two wide lines. Different from that, the multiscale wide line detector computes the response with several circular masks of different radii, and chooses the maximum response among the normalized detection results on different scales. Finally, poststeps such as thresholding and filtering are designed to remove interference and obtain the final results. Simulated and real images are adopted for performance test of the multiscale wide line detector. Experimental results demonstrate that it outperforms the basic wide line detector and can effectively detect wide lines in images.

A clearness algorithm for smog images

based on dark channel prior

MA Xiao,SHAO Limin,XU Guanlei

2018, 40(12): 2211-2218. doi:

Abstract ( 175 )

PDF (1170KB) ( 417 ) 　　

The traditional dehazing algorithm based on dark channel prior is stable and has good haze removal effect, but its running time is very long. To balance the operation time and the clearness of the image, we propose a new clearness algorithm for smog images based on the traditional dark channel prior based de-hazing algorithm. The clearness algorithm replaces the fixed size of local area in the traditional dehazing algorithm by a changing value that varies with the size of the image when seeking the dark channel of the original smog image, thus enhancing the adaptability of the algorithm. We set a threshold for the atmospheric light to prevent the overall image from transitioning to the white field due to the too high estimated value of atmospheric light. We use the guided filtering algorithm instead of the soft matting algorithm to improve algorithm efficiency. Finally, we employ the auto-color algorithm to adjust the color lighting distribution of the dehazed image. Experimental results show that the image outputted by the proposed algorithm has good contrast and clearness, high color fidelity, and reasonable lighting distribution. Besides, the algorithm is stable and has short running time, thus realizing a balance between running time and clearness of the image.

A defect detection method

based on saliency discrimination

ZHOU Zhou,HAUNG Qian,HU Zhihui

2018, 40(12): 2219-2223. doi:

Abstract ( 165 )

PDF (514KB) ( 348 ) 　　

X ray imaging has many problems such as noise, penumbra and scattering, which results in blurred edge of the image and the uneven background gray level and seriously affects the accuracy of defect identification. We propose a new defect detection method for Xray images based on LoG edge detection and local contrast to carry out saliency discrimination. On the basis of the salient edge detection based on the double thresholds of LoG edge detection, the local background of undetermined defects is obtained by the isotropic diffusion method. We also set a third threshold by utilizing the contrast saliency between undetermined defects and the local background to remove the false defects, so that the defects can be accurately extracted and the contour and area of defects can be determined at the same time. Experimental results show that the algorithm has high accuracy for defect identification and can be used for online real-time detection systems.

Multi-strategy candidate construction and entity linking

YANG Ziyi,SHENG Chen,KONG Fang,ZHOU Guodong

2018, 40(12): 2224-2233. doi:

Abstract ( 126 )

PDF (801KB) ( 355 ) 　　

Aiming at the candidate set construction problem in entity linking systems, we propose a multistrategy candidate set construction method. We use a variety of strategies to extract the complete mentions in the context, reduce the number of candidates and improve the recall of the correct entity in order to construct a high quality candidate set. We conduct experiments on the TAC2014 English entity linking corpus and then analyze the results. The optimal construction strategy is chosen. Experimental results prove that the proposed method can improve the recall and precision of candidate sets. We further validate that the quality of the candidate sets has a significant effect on the performance of the whole entity linking system. Compared with the baseline algorithm, the candidate set extracted by the optimal candidate set construction strategy can improve the performance of the whole entity linking system by 3.7%.

An uncertain ant colony clustering algorithm

based on approximate backbone

for landslide hazard prediction

LIU Weiming1,2,LI Zhongli1,MAO Yimin1

2018, 40(12): 2234-2242. doi:

Abstract ( 113 )

PDF (831KB) ( 442 ) 　　

The uncertain factor rainfall is hard to accurately handle and the ant colony clustering algorithm is easy to get caught into suboptimal solution and the searching speed is low in searching space. In order to improve the prediction accuracy of landslide hazard, we propose an uncertain ant colony clustering algorithm based on approximate backbone. Firstly, it utilizes the Gauss point probability model to describe the uncertain data and measure their similarity. Secondly, we introduce the pheromone redistribution and adaptive dynamic variables to update the local pheromone and global pheromone for improving the algorithm's searching speed, and load the genetic algorithm to prevent it from falling into local optimum early. Finally, combining the approximate backbone theory, we build an uncertain ant colony clustering algorithm model based on approximate backbone, which reduces the iteration times and obtains the clustering solution rapidly. Experiments on UCI true datasets and landslide experiment datasets of the Baota district of Yan'an show that the proposed method achieves a higher clustering quality and the prediction accuracy reaches 93.3%, which verifies its feasibility.

An adaptive table compression method

for STR algorithms optimization

LI Shaoxing,LI Zhanshan,YU Haihong

2018, 40(12): 2243-2251. doi:

Abstract ( 94 )

PDF (578KB) ( 262 ) 　　

Table constraint, also called extensional constraint, is the most general form of finite domain constraint in the field of constraint programming. Table compressing methods can greatly reduce space consumption by compact representation of tuple set and accelerate the GAC algorithm. Cartesian product representation and short support are the two most common table compression methods in table constraint. We find that the compression ratio of the two table compression methods on the same problem is the main factor that affects their optimization results, and propose an adaptive table compression method based on STR algorithm. It adaptively selects the table compression method with larger compression ratio when solving the problem. We apply the adaptive table compression method to the STR2 algorithm, and propose an STR2-adaptive algorithm, which can cover the advantages of the two table compression methods. Experimental results show that the STR2-adaptive algorithm can adaptively select the best table compression method on most instances and effectively reduces space cost and CPU running time of the STR2 algorithm. Then we extend the adaptive table compression method to the STRbit algorithm using efficient bit vector representation, and we propose the STRbit-adaptive algorithm. Experimental results show that the efficiency of STRbit-adaptive algorithm generally has higher efficiency than that of the STRbit algorithm.

Current Issue

Author center

Review center

Online journal