Computer Engineering & Science

A regional shared and high concurrent storage architecture based on NVMeoF storage pool

LI Qiong, SONG Zhen-long, YUAN Yuan, XIE Xu-chao

2020, 42(10高性能专刊): 1711-1719. doi:

Abstract ( 342 )

PDF (992KB) ( 406 ) 　　

In the era of exascale computing and big data, High Performance Computing (HPC) systems have been widely deployed as the infrastructure for big data analytics, in order to leverage their parallel computing capabilities. As the I/O patterns in HPC systems get increasingly complicated and heterogeneous, breaking through the I/O bottleneck is challenging and urgent for HPC systems. In recent years, flash-based storage arrays and storage servers have been gradually deployed in HPC storage systems. However, the conventional shared storage architectures, I/O software stack, and storage networking designs are primarily for Hard Disk Drives (HDD), which induces severe I/O overhead in the I/O path and prevents the HPC storage systems from taking full advantage of the performance benefits from Non-Volatile Memory (NVM). To achieve low I/O latency, high concurrent I/O throughput, and high burst I/O bandwidth, this paper proposes a regional shared and high concurrent storage architecture. We design an NVMeoF-based burst I/O storage pool (NV-BSP), which implements the key techniques such as virtualized storage pool resource management and NVeoF network storage communication based on Tianhe high-speed Internet. It has horizontal and vertical expansion capabilities and can effectively support Burst I/O acceleration and low-latency remote for specific computing tasks. Besides, we further propose a Quality-of-Service (QoS) control strategy for the storage systems with HPC and big data mixed applications. The experimental results on a prototype system show that NV-BSP achieves the scalable write performance as the number of I/O handling threads increases. Compared with the built-in MD-RAID in Linux, NV-BSP obtains higher I/O bandwidth. Compared with the node-local storage pool, I/O latencies of NVMeoF-based remote storage only increase 59.25us for read and 54.03us for write. By disaggregating storage from computation, NV-BSP significantly improves the system scalability and reliability while delivering the comparable performance to local storage.

Implementation of scalable communication framework on TH-express interconnection

XIE Min, ZHANG Wei, ZHOU En-qiang, DONG Yong

2020, 42(10高性能专刊): 1720-1729. doi:

Abstract ( 271 )

PDF (871KB) ( 384 ) 　　

Open source communication framework defines standard communication APIs between the parallel programming model and the interconnection network, which provides high performance communication operations independent from the characteristics of interconnection network. Its purpose is to improve the efficiency of developing programming models on new interconnection networks. The performance and scalability of communication frameworks on TH-express interconnection are solved through the design and implementation of new multi-channels data transfer protocols. Performance test shows that open source communication frameworks have low software overhead and provide high performance data transfer very close to the design performance of TH-express interconnection. This provides a good foundation for supporting parallel programming models and distributed computing frameworks efficiently on TH-express interconnection.

Performance evaluation of large-scale HPC interconnection network topologies

JIANG Ju-ping, DONG De-zun, TANG Hong, QI Xing-yun, CHANG Jun-sheng, PANG Zheng-bin

2020, 42(10高性能专刊): 1730-1736. doi:

Abstract ( 330 )

PDF (930KB) ( 508 ) 　　

Interconnection network is one of the most important infrastructures in high performance computing systems (HPC). Network topologies determine the scalability of HPC interconnection networks. In this paper, we study several typical HPC topologies, and conduct extensive performance eva- luation using our in-house large-scale interconnection network simulator. We compare the performance of different topologies by varying different configuration settings, including traffic patterns and routing algorithms.

A fusion network architecture for computing and interconnection

LU Ping-jing, LAI Ming-che, WANG Bo-chao, CHANG Jun-sheng

2020, 42(10高性能专刊): 1737-1741. doi:

Abstract ( 226 )

PDF (772KB) ( 302 ) 　　

A fusion network architecture for computing and interconnection (FIC) is proposed, which integrates the network interface into the processor core via interconnection bus. In FCI, the XBAR structure with the characteristic of single-cycle forwarding is first designed, so long as the absence of the packet confliction, which is suitable for diverse routing algorithms and traffic workload. Secondly, the link layer protocol is designed for the packet reliability transmission, and the scramble coding measure is designed for the PCS in order to enhance the signal transmission quality and reduce the latency by increasing the bitrate disturbance. Experimental results show that, compared with the traditional network interconnected by PCI-E, FCI can improve the bandwidth by 30% and reduce the transmission latency by 16.7%. FIC possesses the advantages of high bandwidth and low latency, and can provide the deep-fused interconnection network system with a feasible solution.

The review of state-of-the-art processor architectures for high performance computing

WANG Yao-hua, GUO Yang

2020, 42(10高性能专刊): 1742-1748. doi:

Abstract ( 274 )

PDF (510KB) ( 370 ) 　　

High Performance Computing (HPC) delivers huge processing performance, and plays a key role in the national welfare and the people's livelihood. As the source of computation power, high performance processors mostly determine the overall performance of a HPC system, and are being intensively studied in many countries nowadays to maintain a competitive edge in the HPC domain. This paper provides a review of the state-of-the-art processor architectures from mainstream processor designers like NVIDIA, Intel, and AMD. We carry out the review in terms of computation resource organization, memory subsystem design, and the connection technology among multi-cores. Based on the analysis, we summarize the mainstream trends in high performance processor design. We believe that this paper can serve as a reference for future research on HPC oriented processor designs.

A data queue scheduling method supporting multi-priority and multi-output channels and its hardware implementation

XU Jin-bo, CHANG Jun-sheng, LI Yan

2020, 42(10高性能专刊): 1749-1756. doi:

Abstract ( 218 )

PDF (662KB) ( 350 ) 　　

To meet the requirements of mapping multiple input data to multiple output channels in ASIC (application specific integrated circuit) design, this paper proposes a data queue scheduling method that supports multi priority and multi output channels. Firstly, the proposed method can be used in a wide range of applications, either to achieve load balancing by scheduling in a random mode or to diffe- rentiate the quality of service (QoS) by configuring different priorities. In the random mode, multiple output channels in idle state will receive input data in round-robin fashion. In the QoS mode, all input sources and output channels are divided into different priorities, so that a certain output channel only receives data from input sources with the corresponding priority. Secondly, this method has the advantage of low hardware implementation cost, due to multiple output channels sharing a single arbiter instead of multiple individual arbiters. The proposed method is applied in the network interface chip design of Tianhe supercomputer system to optimize the data queue scheduling for the software/hardware interface. The design is tested in the verification environment. Current test results show that, compared with the traditional single-output queue scheduler, the proposed method only increases the scheduling time cost by 3‰ to 2% and the hardware resource cost by about 1.5%, but achieves a two-fold increase in the speed of processing for direct memory read transactions. At the same time, when it is configured for QoS mode, the execution time ratio between high-priority threads and low-priority threads is about 1∶3, and it is also flexible for reconfiguration.

Performance model and experimental implementation of entangled photon source

WANG Dong-yang, WU Jun-jie, LIU Ying-wen, YANG Xue-jun

2020, 42(10高性能专刊): 1757-1764. doi:

Abstract ( 228 )

PDF (790KB) ( 392 ) 　　

Quantum computing is a frontier research field in high performance computing, and photon system is one of the important ways to implement quantum computing. In optical quantum computing, the entangled photon source is used to generate photons to encode quantum information, and its performance affects the quality and quantity of quantum bits directly. This paper systematically studies the entangled photon source based on spontaneous parametric down conversion (SPDC), which is the most popular technique to this aim, analyzes the performance model to study purity and brightness of entangled photons generated by type-I SPDC, and derives the key condition to obtain multi-photon entanglement via two-photon entanglement. Meanwhile, using barium metaborate, this paper realizes two photons entangled based on type-I SPDC, and mainly verifies our performance model. The model and the experiment results provide theoretical and technical support for generating more quantum bits in optical quantum computing system.

Improving the performance of BeeGFS parallel file system

SONG Zhen-long, LI Xiao-fang, LI Qiong, XIE Xu-chao, WEI Deng-ping, DONG Yong, WANG Rui-bo

2020, 42(10高性能专刊): 1765-1773. doi:

Abstract ( 688 )

PDF (889KB) ( 511 ) 　　

As we embark on a new era of big data and Artificial Intelligence (AI), supercomputing centers and data centers raise an ever-increasing demand for high-performance storage systems from petabyte-scale to exabyte-scale. In recent years, High-Performance Computing (HPC) systems have been widely used for big data and AI applications. The I/O patterns of new emerging AI applications show a characteristic of small batch-processing file accesses, which makes HPC storage system designs increasingly complicated. Parallel File System (PFS) primarily designed for bandwidth-oriented applications is one of the most effective ways to manage data for HPC systems. However, existing PFSs are not capable of providing high performance for AI applications. This paper focuses on investigating and improving the system performance of BeeGFS, which is a new emerging PFS for HPC systems. We propose a Key-Value (KV)-based metadata management module to improve IOPS of metadata accesses, introduce asynchronous I/O and multi-threading technologies into parallel I/O processing module to improve I/O processing concurrency, and employ multi-track communication mechanism to increase networking bandwidth. Our experimental results show that the modified BeeGFS can significantly improve the performance of both metadata and data accesses, and achieve as high as 2 times scores than the original BeeGFS under the IO500 benchmark.

A survey on the process management interface for exascale computing systems

ZHANG Kun, ZHANG Wei, LU Kai, DONG Yong, DAI Yi-qin

2020, 42(10高性能专刊): 1774-1783. doi:

Abstract ( 160 )

PDF (566KB) ( 209 ) 　　

With the continuous development of high-performance computing, the scale of the system increases constantly, and the number of nodes and processor cores in the system has expanded to a new level. Under the condition of hyperscale systems, the startup time of parallel applications becomes an important factor, which limits the system’s operating efficiency and reduces the ease of use. The process management interface is used to deploy a communication channel for the process during the parallel application startup phase for subsequent communication of the process. In exascale systems, the traditional process management interface cannot quickly obtain communication information at the startup phase, resulting in long startup time and the reduced system performance. We first introduce the role of the process management interface in the parallel program startup process, focus on the process management interface PMIx for exascale systems, compare and discuss the role of PMIx in improving the startup of large-scale parallel programs, analyze the optimization of PMIx in improving system performance, and discuss future development directions.

A memory initialization optimization algorithm for operating systems

HE Sen, CHI Wan-qing

2020, 42(10高性能专刊): 1784-1790. doi:

Abstract ( 173 )

PDF (567KB) ( 258 ) 　　

During the startup of the operating system kernel, the initialization of NUMA memory nodes’ with large holes cause a significant time loss. Especially when the kernel is started up on the si- mulation platform with low frequency, this effect becomes more significant, and the time loss is significantly enlarged. In response to this problem, a NUMA node initialization optimization algorithm is proposed. This optimization algorithm can identify and skip memory holes during memory node initialization, and achieve efficient initialization of memory during kernel's startup. Comparative experiment between this optimization algorithm and the current kernel initialization algorithm shows that this optimization algorithm significantly improves the initialization speed of NUMA nodes with huge memory holes, thereby increasing the startup speed of the operating system.

Compilation-oriented code analysis and optimization for Matrix DSP

XUN Chang-qing, CHEN Zhao-yun, WEN Mei, SUN Hai-yan, MA Yi-min

2020, 42(10高性能专刊): 1791-1800. doi:

Abstract ( 223 )

PDF (1565KB) ( 390 ) 　　

Digital Signal Processor (DSP) are widely used in numerous fields such as image proces- sing, automation control, and signal processing. Matrix DSPs, which are independently developed by ourselves, adopt a typical vectorization architecture of Single Instruction Multiple Data (SIMD) + Very Long Instruction Word (VLIW). Therefore, it is a prominent challenge to implement efficient vecto- rized programming and optimization for such architecture. According to the characteristics of Matrix DSP and the compilation performance, the analysis and optimization methods commonly used in the kernels are summarized. Furthermore, an example of general matrix multiplication (GEMM) is used to show that the execution performance can be improved by up to 1 order of magnitude. Based on the summary of optimization methods, some follow-up thoughts and discussions are proposed from the perspective of compiler optimization and programmers’ efficient programming.

An automated monitoring system for large-scale supercomputers

YANG Jie, ZENG Ling-bo, PENG Yun-yong, JIANG Qian-qian, DU Liang

2020, 42(10高性能专刊): 1801-1806. doi:

Abstract ( 223 )

PDF (694KB) ( 331 ) 　　

he number of large-scale cluster system nodes is increasing, the internal structure is becoming more and more complex, and the pressure on cluster availability and stability is also increasing. In order to solve the problems of the availability and stability of large-scale clusters and the difficulty of system management, operation and maintenance, an automated monitoring system for large-scale clusters is realized. The automated monitoring system is deployed on a large-scale cluster system. By collecting monitoring data of each cluster component and using microservices to process the monitoring data, the real-time monitoring of the cluster components are realized.

Design and implementation of the maintenance and management platform powered by Magic Cube-3 high-performance computer

ZHAO Qi-qi

2020, 42(10高性能专刊): 1807-1814. doi:

Abstract ( 190 )

PDF (2587KB) ( 216 ) 　　

With the progress of science and technology, high-performance computers, as important infrastructure for scientific research, have provided strong support for the development of various indu- stries. It is administrators’ wishes and responsibilities to guarantee that high-performance computers can operate stably and efficiently. This paper mainly introduces the maintenance and management system powered by “magic cube-3” supercomputer. The introduction includes platform structure design, underlying data collection interface and methods, and various functions achieved by the platform including system monitoring, automatic detection and data analysis. This platform enables administrators to directly know the operation status of computers and timely find and handle malfunction. Through collecting and analyzing data from multiple perspectives, administrators can find out bottlenecks that slow down the operation efficiency, thus offering scientific decision-making basis for subsequent optimization and upgrading.

Research progresses of large-scale parallel computing for high-order CFD on the Tianhe supercomputer

XU Chuan-fu, CHE Yong-gang, LI Da-li, WANG Yong-xian, WANG Zheng-hua

2020, 42(10高性能专刊): 1815-1826. doi:

Abstract ( 574 )

PDF (1454KB) ( 509 ) 　　

The rapid progress of high performance computing (HPC) technology provides a solid foundation for large-scale complex computational fluid dynamics (CFD) applications. In recent years, heterogeneous architectures have evolved to be one of the most important methods for developing large-scale HPC systems. Heterogeneous HPC systems with different processing capabilities, memory availability, and communication latencies make the development and optimization of large-scale CFD applications exceptionally difficult. NUDT is a research base for HPC systems in China, and the CFD application team of NUDT has long been devoted to parallelize and optimize large-scale complex CFD software on the Tianhe/Yinhe series supercomputers. They have tackled some key technologies such as heterogeneous collaborative parallel computing and preliminarily realized the convergence of HPC and CFD. Due to their efforts, some important In-house CFD software in China have been ported and run efficiently on the Tianhe/Yinhe series supercomputers. This paper summarizes some important research progresses of large-scale parallel computing for high-order CFD on Tianhe-2, and analyzes some problems of CFD application development on the forthcoming exasale supercomputers.

Quick bleeding point detection in WCE image based on multi-scale convolutional neural network

XIE Xue-jiao, LU Feng, LI Shu-zhan, ZHOU Dao

2020, 42(10高性能专刊): 1827-1832. doi:

Abstract ( 182 )

PDF (774KB) ( 299 ) 　　

With the full application of Wireless Capsule Endoscopy (WCE) in the detection of gastrointestinal diseases, screening out a small number of lesion images from the massive imaging data brings a heavy burden to doctors. To solve the problems existing in the automatic detection of WCE images, such as inconspicuous colour and texture features, ease of being confused with healthy organs, fuzzy detail features and different sizes of lesions, and high impurities, we propose a residual-based multi-scale fully convolutional neural network to classify and detect lesions in WCE image. By introducing the concepts of skip connection in residual learning network and multi-scale convolution kernel in the inception network, the model can effectively extract the detailed features of various lesions in the image. The experimental results show that the sensitivity of the model reaches 98.05%, the specificity reaches 97.67%, and the accuracy reaches 97.84%. It is better than the classical deep residual network ResNet50 and the standard width multi-scale Inception-v4 algorithm. The model has high recognition rate, fast convergence speed, and improved computing performance. In short, the algorithm model takes into account the efficiency and performance of bleeding point detection, and has strong practicability.

Design and optimization of CCFD overlapping grid parallel algorithm

LIU Xia-zhen, YUAN Wu, MA Wen-peng, HU Xiao-dong, LU Zhong-hua, ZHANG Jian

2020, 42(10高性能专刊): 1833-1841. doi:

Abstract ( 219 )

PDF (4022KB) ( 218 ) 　　

This paper introduces the efficient parallel implementation of the parallel computational fluid dynamics software CCFD on the overlapping grid method, which includes: the development of a new hole mapping model with local data characteristics, and the identification method of cell attributes suitable for the model. The implicit hole optimization method for the distributed system is studied, and a combination parameter of grid cell criteria is proposed. A two-level load balancing strategy that takes into account both the amount of calculation and communication is designed, and the impact of the interpolation of the overlapping area on the calculation of the weight of the directed graph is considered. A communication mode based on blocks is considered, and the communication data structure and the sending and receiving process are tuned. The numerical simulation results show that the CCFD overlapping grid method has good parallel efficiency and scalability.

Parallel optimization of Tend_lin application on the Sunway TaihuLight supercomputer

JIANG Shang-zhi, TANG Sheng-lin, GAO Xi-ran, HUA Rong, CHEN Li, LIU Ying

2020, 42(10高性能专刊): 1842-1851. doi:

Abstract ( 184 )

PDF (764KB) ( 259 ) 　　

Numerical simulation of the global atmospheric circulation is one of the main tools to understand the formation and dynamic behaviors of global climate, and it is also a great challenge to port and optimize such a complex application onto large scale heterogeneous platforms. Tend_lin is the hot spot of the dynamic core of IAP AGCM-4 (the 4th generation of IAP atmospheric general circulation model), and it has a low compute-to-communication ratio. The paper ports Tend_lin to SunWay Taihulight (a large scale heterogeneous computing platform) using two different parallel application programming interfaces. The paper introduces how to parallelize the program using a data-driven parallel application programming interface AceMesh, the task parallelization method of computation loops and MPI communication, how to relax the sharing of the communication resources, and the task mapping diffe- rences between a single-level task graph and a nested task graph. The experimental results show that AceMesh can attain more than 2 times speedups compared with the OpenACC version when using 16 to 1 024 processes. The paper analyzes and explains the reasons of the performance improvement.

Cloudization and workflow supported new-generation supercomputing application platform

KANG Bo, MA Qing-zhen, SI Dao-jun, MENG Xiang-fei

2020, 42(10高性能专刊): 1852-1858. doi:

Abstract ( 163 )

PDF (1543KB) ( 205 ) 　　

Supercomputing has changed from simply providing hardware computing services to establishing application service ecosystem around supercomputing. The completeness, convenience and professionalism have become the outstanding features and requirements of the industrial service platform. Through cloud processing, the algorithms and software tools originally deployed on the supercomputer are pushed to the cloud platform, and the services can be provided through interaction. Through the workflow, software is converted to components that belong to workflow, where free customization to meet different needs of the ecological environment is realized. Petroleum seismic data processing is an important application field of supercomputing in China. By using technical details such as plug-in mechanism, nested workflow, and unified user view, a comprehensive service platform is established to satisfy the needs from oil industrials. Petroleum geophysical exploration is taken as an example to introduce the actual application effect of cloudization and workflow in industry. Practice shows that industry application service platform of cloudization and workflow is an important direction of future supercomputing services, and also an important reference for promoting supercomputing for industrial services.

HMAC-SHA1 password recovery based on multi-core FPGA

FENG Feng, ZHOU Qing-lei, LI Bin

2020, 42(10高性能专刊): 1859-1868. doi:

Abstract ( 163 )

PDF (1047KB) ( 244 ) 　　

HMAC-SHA1 is a widely used user password authentication mechanism, and efficient password recovery for HMAC-SHA1 is of great significance. In terms of password recovery, FPGAs have more advantages than traditional CPU and GPU platforms. Therefore, this paper uses a multi-core FPGA to perform password recovery for HMAC-SHA1. The HMAC-SHA1 password processing algorithm is analyzed, and the core operation SHA1 is implemented and optimized by pipeline, shortening the critical path, and introducing the Carry-Save Adder (CSA). HMAC-SHA1 password processing operator is implemented based on full pipeline and state machine modes. Finally, the password recovery architecture is designed and implemented. The experimental results show that the throughput of SHA1 implemented in this paper is 245.76 Gbps. The password recovery speed on the hardware platform of single board quadruple-core FPGA is 72 times faster than that of CPU and 2.6 times faster than that of GPU.

A task offloading and resource allocation algorithm under multiple constraints in mobile edge computing

TONG Zhao, YE Feng, LIU Bi-lan, DENG Xiao-mei, MEI Jing, LIU Hong

2020, 42(10高性能专刊): 1869-1879. doi:

Abstract ( 262 )

PDF (3027KB) ( 396 ) 　　

With the popularization and application of the Internet of Things and the vehicle network, the data at the near user end (data source end) has shown an explosive growth. To effectively deal with these rapidly growing data, mobile edge computing has emerged as a new computing model. Mobile edge computing refers to sinking some resources in the cloud center to the edge of the network, so that data can be processed at the edge of the network. How to efficiently offload tasks and allocate resources reasonably is a hot issue in the field of mobile edge computing research. However, most of the existing studies ignore the security of edge data and computing nodes and only guarantees the security of data and information, and the mobile edge computing can develop comprehensively. Therefore, based on data security, combined with deep reinforcement learning, a task offloading and resource allocation algorithm is proposed under multiple constraints. Experimental results show that, compared with several classic algorithms, the algorithm can effectively improve the task offloading success rate and task successful execution rate, reduces the local energy consumption, and better meets the user's QoS requirements.

Current Issue

Author center

Review center

Online journal