Computer Engineering & Science

State of the art analysis of China HPC 2013 and
system benchmarking technology

YUANG Guoxing1,SUN Jiachang2,ZHANG Linbo3,ZHANG Yunquan4

2013, 35(11): 1-5. doi:

Abstract ( 174 )

PDF (1063KB) ( 407 ) 　　

According to the latest China HPC TOP100 rank list released by SAMSS in the early November, the state of the art of China high performance computers is under indepth analysis in terms of the total performance, the manufacturers, the application area, etc. Besides, the benchmarking technology for HPC systems, especially Linpack and HPCG, is analyzed in brief.

The analysis of development on
high performance computer system and platform

CHI Xuebin,GU Beibei,WU Hong,WANG Long,ZHU Peng

2013, 35(11): 6-13. doi:

Abstract ( 203 )

PDF (660KB) ( 500 ) 　　

In recent years, when five national supercomputing centers are established, more and more attentions have been put on the operation cost rather than the acquisition cost in the field of high performance computer. With the survey of the technology and the trend of the development and application, the paper analyzes the high performance computer system, chips, other hardware facilities, and the soft environment of supercomputing center, and also discusses the problems about the sustainable development of supercomputing center in the future.

Latencyaware thread scheduling
scheme for threadlevel speculation

LI Yanhua,ZHANG Youhui,WANG Wei,ZHENG Weimin

2013, 35(11): 14-21. doi:

Abstract ( 151 )

PDF (800KB) ( 330 ) 　　

With the advent of largescale chipmultiprocessors (CMPs), more and more cores are integrated on a single chip. On the first hand, there always will be some idle cores. And on the other hand, with the energy consumption limit, cores integrated on the chip are relatively simple. ThreadLevel Speculation (TLS) remains a promising technique for exploiting the idle hardware resources to improve the performance of a sequential program. However, the usual distributed design of largescale CMPs, like the nonuniform cache architecture (NUCA), introduces some nonuniform architectureproperties which significantly increase the overhead of TLS execution (L2 cache access overhead, task squashing overhead and reexecution overhead). Some stateoftheart multithread scheduling algorithms work poorly for TLS because of ignoring these TLSrelative characteristics. The proposed latencyaware thread scheduling algorithm for threadlevel speculation, uses the memory access statistics gained in the profiling, compiling and realtime executing stages, to calculate the CDG (Center of Data Gravity) of the program, and then schedules the speculative threads to the cores around the CDG. At the same time, the proposed thread scheduling algorithm makes good use of the data remained in the cache by the committed and squashed threads. Evaluation results show that latencyaware thread scheduling algorithm observed 16.8% performance speedup over priority scheduling, and 10.1% performance speedup over clusteredthread scheduling.

A novel adaptive router microarchitecture

XIAO Canwen,DAI Zefu,ZHANG Minxuan

2013, 35(11): 22-26. doi:

Abstract ( 147 )

PDF (777KB) ( 279 ) 　　

Router chip is a key component of interconnection network. A novel router microarchitecture is presented, which supports the fully adaptive Dimensional Bubble Routing Algorithm (DBRA). According to the characteristics of DBRA, the design of input buffer, arbiter and switch of router is optimized. The area and delay of router chip is evaluated by Design Compiler under the process of TSMC 40nm. The results show that the novel router chip is easier to achieve the higher frequency compared with the router chip based on Dauto’s methodology.

Clusteringbased largescale haplotype phasing algorithm

PAN Weihua,CHEN Bo,XU Yun

2013, 35(11): 27-33. doi:

Abstract ( 136 )

PDF (628KB) ( 286 ) 　　

Largescale haplotype phasing is an important fundamental problem in genetic analysis. To overcome the weakness of existing algorithms, we introduce the concept of clustering into original WinHAP algorithm and propose the Clutering based WinHAP algorithm. This algorithm improves original WinHAP in computing speed and memory without decreasing the precision, and its memory has nothing to do with the number of sequences. Thus, it is suited to very large datasets. The algorithm is parallelized under SIMD shared memory model and greedy task designing strategy is devised. The experiment reveals a nearlinear speedup with respect to the sequential algorithm.

Design and implementation of largepoint 1D FFT on GPU

HE Tao1,2,ZHU Daiyin1

2013, 35(11): 34-41. doi:

Abstract ( 260 )

PDF (1345KB) ( 354 ) 　　

Considering the GPU’s powerful computing performance and advanced parallel processor architecture, a kind of concurrent design method is studied, which maps the FFT parallel algorithm onto CUDA architecture. This method follows optimized design principles for GPU platforms, such as, reducing global memory access, global memory access coalescing, efficient usage of shared memory, and intensive computing. Then, a largePoint 1D FFT is implemented on NVIDIA Tesla C2075 GPU based on the architecture of NVIDIA Fermi. Experimental results show that this method is superior to the CUFFT library when the number of points is not larger than 256K, and it runs two times faster than the CUFFT 4.0 library, which shows that the new method is feasible and effective.

ptimization and design of high radix asymmetric crossbar

WANG Yongqing1,WANG Kefei1,XIAO Liquan1,LIU Lu1,PANG Zhengbin2

2013, 35(11): 42-47. doi:

Abstract ( 136 )

PDF (863KB) ( 279 ) 　　

Head-of-line (HOL) blocking limits the throughput of high radix switches. An efficient architecture, OEASC, is proposed for highradix switches. This architecture takes advantages of two costeffective strategies for dealing with the HOL blocking problem. The first is referred to as asymmetric crossbar(ASC), through which a N×N switch can be formed by N/m smaller m×N asymmetric crossbars, and the second is oddeven queues scheme, through which the resources (mainly memory queues) of the switch can be used more efficiently. A tilebased microarchitecture of 32×32 high radix switch is proposed. Clock cycleaccurate simulation shows that the impact of HOL blocking is nearly all eliminated and the switch throughput can be as high as 98.6%. The proposed OE-ASC architecture shows up to 7.9% of throughput over ASC with buffer depth 16, and can achieve comparable performance with half buffers.

Technique of hiding multilevel communication
latency for SAR imaging algorithm on cluster platform

DU Jing,AO Fujiang,GUO Jin,ZHOU Ying

2013, 35(11): 48-53. doi:

Abstract ( 145 )

PDF (1036KB) ( 319 ) 　　

Realtime Synthetic Aperture Radar (SAR) imaging techniques have been attracting many interests in military and
remote sensing fields. SAR imaging algorithm features massive data and computation, which
presents a huge demand for high performance computing, and thus it is suitable for using a
representative high performance architecture, such as the cluster platform, to accelerate SAR
imaging programs. For the cluster system with distributed memory, the communication latency is
an allimportant optimization factor to improve the performance of parallel program.
Therefore, based on the cluster platform, the paper studies on the techniques of hiding the
multilevel communication latency for SAR imaging algorithm. Especially, three important
techniques are researched, including thread safety queue, nonblocking communication, and
multithread blocking communication. Moreover, the optimal communication size for hiding
communication latency is achieved. The experimental results show that the SAR imaging program
optimized by hiding communication latency can have high net utilization, and achieve obvious
performance improvement.

Fault monitoring and management system
for multiple computing clusters

ZHANG Yi，CHEN Liang，PANG Jian

2013, 35(11): 54-61. doi:

Abstract ( 134 )

PDF (1013KB) ( 377 ) 　　

With the increasing number and scale of high performance

computing cluster systems, the system maintenance becomes more difficult and the workload is

getting larger. The software system we introduce in the paper works in multiple Linux clusters

with different hardware and software environment, automatically monitors the important

operating states and indexes of clusters by command line scripts and programs, and sends

faults messages to the Windows terminal of system administrators in time by means of socket

communication. Results demonstrate that this system improves the efficiency of system

maintenance and speeds up the response time of faults handling. Using database, it also

records and manages faults event data, thus standardizing the process of faults handling.

Large scale particle cluster identification and analysis

SHEN Weichao1,2，CAO Liqiang2，XIA Fang1,2

2013, 35(11): 62-67. doi:

Abstract ( 133 )

PDF (897KB) ( 316 ) 　　

Cluster identification is a common problem of

cluster analysis in postprocessing of molecular dynamics numerical simulation data. The

cluster identification parallel algorithm and the cluster analysis parallel tool are designed

and implemented aiming at visual data outputted by JASMIN particle numerical simulation

program. The tool provides three parallel modes: time dimension parallel, space dimension

parallel and spatialtemporal parallel. Through virtual patch index structure, breadthfirst

search algorithm accelerated by PIC grid can work on multiple patches data directly so as to

identify particle cluster. The cluster analysis parallel tool has a good parallel scalability

when it applies to the actual numerical simulation data of ten millions of particles.

Dynamic aggregation of noncontractual
earth observation datasources

HUANG Keying1,3,GAO Yue2,LI Guoqing1

2013, 35(11): 68-75. doi:

Abstract ( 146 )

PDF (1100KB) ( 394 ) 　　

It is difficult to use the traditional technology to realize data aggregation and data sharing for the Internet,

which contains a large number of free, open and valuable noncontractual earth observation

data sources. These data sources have the characteristics of webpage query entrance, massive

data hidden in the network background database, data sharing platform diversity and different

kinds of spatial data platform to interconnect etc. Considering these problems, a non

contractual heterogeneous distributed data sources passive aggregation architecture is

proposed, which is based on deep web crawler technology. Meanwhile, we design a data source

identification standard, noncontractual data source discovery mechanism, noncontractual

data source search tree building mode, noncontractual data source indexing mechanism and

data source asynchronous update rules. Using this mechanism, we archive 5 data sources of

large data sharing system including NASA, USGS, ASAR, these three widely used data resources

and form earth observation data resource automatic aggregation and update tool sets.

Eventually, through a unified query interface, users can obtain noncontractual earth

observation data resource information.

A fine-grain data-level parallel algorithm
for fractional differential equations

GONG Chun-ye1,2,3,BAO Wei-min1,MIN Chang-wan1,ZHANG Ye-chen1,LIU Jie3

2013, 35(11): 76-79. doi:

Abstract ( 218 )

PDF (497KB) ( 409 ) 　　

The paper proposes a fine-grain data-level parallel algorithm for Riesz space fractional diffusion equation with explicit finite difference method and implements it with CUDA parallel programming model on GPU. The details of basic CUDA kernels for these operations and optimization of the production of grid points are described. The experimental results show that the parallel algorithm compares well with the exact analytic solution and runs more than four times faster on NVIDIA Quadro FX 5800 GPU than the parallel CPU solution on multi-core Intel Xeon E5540 CPU.

Load balance scheduling for performance
asymmetric multicore processors

XU Yuan-chao1,2,TAN Xu2,3,FAN Ling-jun2,3,SUN Wei-zhen1,ZHANG Zhi-min2

2013, 35(11): 80-86. doi:

Abstract ( 141 )

PDF (909KB) ( 382 ) 　　

Given the same chip area, performance asymmetric multicore processor outperforms symmetric multicore processor in potential performance per watt. This requires reasonable operating system scheduling. To address this issue, based on Linux scheduling framework, a comprehensive heterogeneity-aware load balance policy is proposed. This policy can ensure load-balance-first and there is no need to define threshold to classify programs. The experimental result shows that both load balance and heterogeneity-aware are guaranteed.

A novel MapReduce parallel model
in hybrid computing environment

TANG Bing1，HE Hai-wu2

2013, 35(11): 87-93. doi:

Abstract ( 136 )

PDF (726KB) ( 328 ) 　　

A novel MapReduce computation model in hybrid computing environment is proposed. Using this model, high performance cluster nodes and heterogeneous desktop PCs in Internet or Intranet can be integrated to form a hybrid computing environment, where MapReduce tasks can be executed to process large-scale datasets. In this way, the computation and storage capability of large-scale desktop PCs are fully utilized. Similar to the design of Hadoop, this model composes of storage layer and task layer. The paper introduces the architecture of the model briefly and describes the core HybridDFS and the MapReduce algorithms. Then, a prototype system is designed and implemented, and performance evaluations are accomplished. Evaluation results show that the proposed hybrid computation model is not only able to achieve reliable MapReduce computation, but also reduces the computation cost, hence being a potential effective computation model.

PEAK:Parallel evolutionary administration
framework towards cluster of wimpy nodes

ZHANG Lu-fei,WU Dong,XIE Xiang-hui

2013, 35(11): 94-99. doi:

Abstract ( 163 )

PDF (813KB) ( 368 ) 　　

Ant II is a new cluster architecture for low-power data-intensive computing which consists of big amounts of low-power embedded CPUs and local flash storage. The key contributions of this paper are the principles of the parallel administration framework towards cluster of wimpy nodes and the design and implementation of PEAK, which is an evolutionary, self-healing, hot-plugged, scalable, highly available, and high-performance distributed storage system and computing platform. It is developed with natural parallel programming language Erlang, which supports supervision tree, hot code swapping and sandboxing. It uses decentralized Dynamo architecture which provides great scalability and availability using chain replication on a consistent hashing ring. It builds on distributed meta-service management framework, and therefore can easily evolve. It is not only purely log-structured storage that provides the basis for high performance on flash storage, but also an analytic tool using MapReduce query language. The evaluation demonstrates that PEAK balances computation and I/O capabilities so as to enable efficient, massively parallel access to data.

A NVIDIA Kepler based acceleration of PIC method

WEN Min-hua1,James LIN1,2,Simon Chong Wee See1,3

2013, 35(11): 100-104. doi:

Abstract ( 153 )

PDF (540KB) ( 277 ) 　　

The PIC (Particle-In-Cell) method is widely used in computational plasma physics. However, a large number of computational particles have to be simulated in order to get high accuracy, which requires great compute capacity. Therefore, it is necessary to accelerate the PIC method in order to reduce the time cost. A NVIDIA Kepler GPU based PIC algorithm is designed and implemented using CUDA (Compute Unified Device Architecture). The most time consuming parts of PIC method, namely collision and mover, are ported onto GPU platform. In our experiments, NVIDIA's newly released Kepler K20 is used to evaluate the performance and maximum 30x speedup is achieved compared with Intel Sandy Bridge E5-2650.

Testing and analysis of typical examples for
ANSYS and Abaqus software GPU-accelerated performance

WANG Hui，GUO Pei-qing，CHEN Xiao-long

2013, 35(11): 105-110. doi:

Abstract ( 641 )

PDF (790KB) ( 609 ) 　　

In the field of HPC, CPU/GPU co-processing technology has become one of the effective approaches for obtaining quick computing results. In the latest version of typical structural mechanics calculation software: ANSYS and Abaqus, CPU/GPU co-processing technology is adopted for improving the efficiency of problem solving. In the paper, a research is conducted by using typical structural problems, and NVIDIA Tesla M2090 GPU and "Hummingbird" supercomputing platforms from Shanghai Supercomputing Center, to compare and analyze problem solving efficiency of ANSYS and Abaqus before and after the acceleration of GPU. Results indicate that in the situation of parallel scale less than 16 cores, GPU acceleration can reduce solution time at different levels. However, the performance shows a decreasing trend with increasing parallel scale. In addition, the effect of Multi-GPU collaborative solution on enhancing acceleration performance is not obvious. Therefore, in practical application, the selection of appropriate parallel approach and co-processing model should be associated with problem type and current hardware architecture.

Survey and evaluation of high-radix topology
in high performance interconnection network

LEI Fei，DONG De-zun，CHAI Yan-tao，WANG Ke-fei，LI Cun-lu

2013, 35(11): 111-118. doi:

Abstract ( 268 )

PDF (1094KB) ( 694 ) 　　

With the increasing peak performance of the high performance computers, high performance interconnection network is facing more design challenges. Enabled by advancements in high bandwidth serial transport technology and pin bandwidth, high performance interconnection network paves the way for high-radix network. How to utilize high-radix router providing more design choices is the key for high performance network topology. The paper studies several typical high-radix topologies. We analyze the performance and cost of those high-radix networks theoretically and compare them with common low-radix topologies. A high-radix interconnection network simulator, named xNetSim, is explored to evaluate these topologies. xNetSim is built in the OMNeT++ platform. In the simulation, the network loads is varied and the throughput and latency of different topologies are verified. At last, the performance gaps between these kinds of high-radix interconnection network topologies are briefly analyzed.

Research of virtualization of multitask oriented
general purpose computation on graphic processing unit

ZHANG Yun-zhou,YUAN Jia-bin,L Xiang-wen

2013, 35(11): 119-125. doi:

Abstract ( 228 )

PDF (818KB) ( 494 ) 　　

With the enrichment of hardware functions and the gradual maturity of software development environment, GPU is widely used in the field of general purpose computing, and GPU clusters are more and more used for scientific computing on huge amounts of data. However, GPU consumes more power than CPU, so the GPU clusters have large power consumption if every cluster node hosts a GPU. Virtualization technology makes it possible that GPU is used for general purpose computing in a virtual machine. For the sake of using GPU efficiently, according to the features of GPU, a multitask oriented GPU virtualization solution is proposed, which can support dynamic scheduling and multi-user concurrency. Based on the existed solutions of GPU virtualization, we establish CUDA manage end to manage the GPU resources by taking into account the virtual machine communication between domain generality and task’s turnaround time. In order to achieve load balance and shorten the turnaround time, we set a value of integrated load evaluation. Through designing large scale matrix operations, we verify the feasibility and efficiency of GPU virtualization applied in the designed system.

Study of settlement detection based on
high resolution remote sensing images

ZHANG Ning-xin,CHEN Zhong,GUO Li-li,XIE Ting

2013, 35(11): 126-133. doi:

Abstract ( 127 )

PDF (3027KB) ( 342 ) 　　

The study on information extraction of settlements has important realistic significance for territorial planning and human security. Currently proposed settlements detection algorithms possibly detect sparse vegetation as settlements, thus degrading the detection accuracy. In order to obtain an efficient and high-accuracy detection method, the paper gives an improved model based on the analysis of the rotation-invariant texture of remote sensing images. Firstly, the morphological top-hat is applied to strengthen the spectrum and the interference of scattered vegetation is effectively suppressed. Secondly, by introducing an asynchronous communication model based on MPI (Message Passing Interface) and OpenMP, a parallel processing of the improved algorithm is realized successfully, which improves the efficiency of the algorithm. The experiment shows that this parallel algorithm improves detection accuracy and robustness, as well as solves the computational problem of large remote sensing images effectively.

Current Issue

Author center

Review center

Online journal