Computer Engineering & Science

A survey on high-radix router architecture

YANG Wen-xiang,DONG De-zun,LEI Fei,LI Cun-lu,WU Ji,SUN Kai-xuan

2016, 38(08): 1517-1523. doi:

Abstract ( 246 )

PDF (569KB) ( 726 ) 　　

As the scale of high performance network increases, the design of high-radix router architecture is becoming a focus and a hotspot in the field of high performance computing. Using high-radix routers, the network can gain lower transmission latency, lower cost of network construction and lower power, and improve network reliability simultaneously. Over the past decade, high-radix routers have developed quickly. We present related research on high-radix routers and predict future development trends, mainly introducing the structured design represented by YARC and the “network within a network” approach. Our future study focuses on solving a number of problems such as buffer and arbitration, and we use some new techniques such as optical interconnection to design better architectures.Key words:high performance computing;high-radix router;structured design;optical interconnection

Integrated I/O hardware compression accelerators of Hadoop system architecture

LEI Li1,QIAN Bin-hai1,GUO Jun1,GU Xiong-li2,LIU Peng1

2016, 38(08): 1524-1529. doi:

Abstract ( 186 )

PDF (521KB) ( 481 ) 　　

With the development of big data, Hadoop systems become an important tool, but I/O operations impede their performance improvement in practical applications. Hadoop usually decreases its’ I/O operations by using software to compress data. However, data compression by software is slower than hardware accelerators. When Hadoop runs on Java virtual machines, it cannot directly call I/O hardware accelerators. To avoid getting data from the Hadoop system and transferring the data to I/O hardware accelerators, a compressor and decompressor class of Hadoop and a C++ dynamic linking library are employed in the Hadoop system. Experimental results show that both techniques can integrate I/O hardware accelerators into the Hadoop system frame work, the efficiency of I/O hardware compressor is 15.9Byte/s/Hz, and the performance of the Hadoop system can be improved by two times.

Design and implementation of large sparse matrix vector multiplication on FPGA over GF(2)

SU Jin-zhu，WU Gu-ming,JIA Xun

2016, 38(08): 1530-1535. doi:

Abstract ( 258 )

PDF (580KB) ( 568 ) 　　

As the kernel part of Wiedemannn algorithm, the sparse matrix vector multiplication (SpMV) is the main step for solving large sparse system of linear equations. In order to solve the problem of repeated SpMV computation, we propose a torus network architecture for large spare matrix vector multiplication based on FPGA over GF(2). The implementation simplifies the design of the algorithm, improves the algorithm parallelism and the utilization of Block RAM on chip and GTX, and obtains a speedup of 2.65 times in comparison with the partial reconfiguration design.

A columnar storage structure based on group sorting of key columns

XU Tao，GU Yu,WANG Dong-sheng

2016, 38(08): 1536-1541. doi:

Abstract ( 144 )

PDF (580KB) ( 478 ) 　　

As the main storage medium for massive data, disks have the advantages of large capacity and low cost. However, the I/O bandwidth of disks lags far behind the growing speed of data, which thus becomes the performance bottleneck of big data management systems. Therefore, optimizing the storage structure to improve the efficiency of writing and reading has become one important challenge in the era of big data. In this paper, we present a columnar storage structure based on key columns group sorting called KCGSStore. According to the groups of the key columns, the tables are divided into pools and the records in the same pool have the same value or value range. All pools belonging to one group are merged, and then the key columns are orderly arranged by taking the pool as the unit. In this way, irrelevant column values can be effectively filtered when executing SQL commands so as to reduce the amount of data being read. Consequently, the query performance can be improved. Meanwhile, using the pool matrix, we can reorganize the records at very little cost of time and space. Evaluation results show that compared with the ORCFile and the Parquet, the KCGSStore is superior in many aspects, including storage space, data loading and SQL querying.

A novel NoC topology for 3D many-core microprocessors

CHEN Ji-cheng,WANG Hong-wei,ZHANG Chuang

2016, 38(08): 1542-1549. doi:

Abstract ( 187 )

PDF (937KB) ( 412 ) 　　

Review attachment

3D microprocessors have advantages of high-level integration, short global interconnection and more connecting parts. However the traditional 3D topology cannot take full advantage of the characteristics of 3D integrated circuits in the vertical direction. It is difficult for these structures to meet the requirements of large-scale many-core processors, such as short diameter, high-bandwidth and highly scalability. We propose a 3D on-chip interconnection network topology, called V-Spidergon. In the horizontal layer, the V-Spidergon adopts the Spidergon topology in the horizontal layer and a domain-based interconnected structure in the vertically direction. The entire vertical network is divided into multi-level domains according to the hierarchical domain thought, and every domain interconnects with each other. Simulation results show that the time delays in 8 layers, 16 layers and 32 layers of the V-Spidergon are lower than those of the 3D-Mesh by 15.1%, 28.5%, 55.7% respectively, and are also lower than those of the NoC-Bus by 11.5% 32.7% and 77.6% respectively; under injections of 15% and 100% load rates, the average time delay of the V-Spidergon does not increase with the growth of horizontal layers.

A high efficient and low-latency resource scheduling method for Spark on Web service

DING Jing-jing，ZHANG Gong-xuan

2016, 38(08): 1550-1556. doi:

Abstract ( 193 )

PDF (588KB) ( 452 ) 　　

The processing speed of Spark which is a big data processing structure is highly influenced by resource scheduling modes and whether we can utilize the resource sufficiently. Taking memories and the number of free cores into consideration, we propose a new scalable resource scheduling method. In this method, we monitor the resource utilization of nodes in real time and examine CPU processing speed and CPU residual utilization. This method can be used to optimize Spark Web service so as to meet the requirements of high concurrent request and low latency response and efficiently reduce the imbalance of resource utilization, thus improving resource utilization. Experimental results show that our method can obtain better results.

A cloud-networking-oriented negotiation mechanism for cloud services

YANG Wan-lin1,WANG Xing-wei1,ZHANG Shuang2,HUANG Min3

2016, 38(08): 1557-1562. doi:

Abstract ( 106 )

PDF (602KB) ( 406 ) 　　

With the rapid development of cloud computing, more and more users begin to utilize services provided by cloud service providers. As a new research field of cloud computing, cloud networking can provide services across cloud service providers. When an individual cloud service provider cannot meet users’ service requirements, cloud service providers can cooperate to provide better services for users. We propose a cloud-networking-oriented cloud service negotiation mechanism to achieve interaction and negotiation among cloud service providers by using cloud networking and the improved classical contract net model jointly. To effectively select cooperators, we set up an acquaintance set for each cloud service provider. Experimental results show that the mechanism can effectively improve cooperation efficiency among cloud service providers and meet users' requirements better.

An improved ant colony optimization algorithm for#br# memory access scheduling

TIAN Shuo，DOU Qiang,WANG Yong，ZHANG Hong-guang，ZHOU Chao-bing,LI Shi-ming

2016, 38(08): 1563-1567. doi:

Abstract ( 159 )

PDF (681KB) ( 501 ) 　　

Review attachment

Memory access scheduling approaches are complicated since they not only depend on circuit timing parameters but also on memory access patterns. Based on the analysis of the characteristics of DRAM and memory access scheduling strategies, we propose an improved ant colony optimization algorithm using DDR3 timing for memory access scheduling. We evaluate the algorithm on four different traces.Compared with the greedy scheduling algorithm, the proposal can effectively reduce the overall average delay and improve bandwidth utilization.

Using multi-dimensional hierarchical Cache replacement policy to reduce the write traffic to PCM memory

RUAN Shen-chen,WANG Hai-xia,WANG Dong-sheng

2016, 38(08): 1568-1573. doi:

Abstract ( 188 )

PDF (652KB) ( 376 ) 　　

Searching for novel storage materials is a current hotspot. The phase change memory (PCM) draws wide attention due to the advantages of low power consumption, high density and non-volatility, however, it incurs finite endurance. So it is necessary to consider how to reduce the write traffic to memory. As for this problem, a optimizing cache replacement policies to reduce the amount of dirty blocks evicted from cache is an efficient method. The existing works mainly achieve the goal of protecting dirty blocks by setting higher protection priority for them when they are inserted in or re-referenced, however, they make no distinction between dirty blocks and clean ones in the process of demotion, which may leads to the result that dirty blocks are preferentially evicted even though there are still many clean blocks. We propose a novel cache replacement policy called multilayer ark for cache (MAC), which sets insurmountable boundaries between the dirty cache blocks and the clean ones through a multidimensional hierarchy structure, and dirty blocks therefore can be protected more effectively. Simulation results show that compared with the LRU replacement policy, the MAC can averagely reduce the writes to memory by about 25.12% with low hardware cost while the program performance is hardly affected.

A high performance and energy efficient KV accelerator for FPGA-based heterogeneous computing

SUN Zheng-zheng,LAN Ya-zhu,FU Bin-zhang

2016, 38(08): 1574-1580. doi:

Abstract ( 58 )

PDF (765KB) ( 251 ) 　　

The flourish of new applications, such as network function virtualization (NFV), has brought higher requirements on high performance and energy efficient Key-Value (KV) lookups. Traditionally, KV operations can be accelerated by software-based HASH tables or dedicated TCAM chips. Software-based solutions are cost efficient but can lead to much worse performance with the increase of data collisions. TCAM-based solutions, on the other hand, have sound performance but suffer high additional system cost and power consumption. Recently, FPGA-based heterogeneous computing becomes more and more popular, so it is quite reasonable to exploit the provided FPGA resources to accelerate the software-based KV operations. To this end, we discuss how to implement high scalable and energy efficient TCAM logics with RAMs on FPGA in this paper. Compared with the traditional TCAM architecture, the proposed FPGA-based TCAM is highly scalable and enables dynamical configurability of the range of lookups so that the power consumption can be reduced significantly. To validate the proposed architecture, we implemented it on Xilinx Virtex-7 FPGA. Experimental results show that the throughput can be as high as 234 Mpps and the latency as low as 25.56ns. Compared with traditional software-based solutions, the throughput is improved by 780 times and the latency is improved by 240 times.

Study and implementation of an elasticity evaluation model in cloud computing

DAI Rong-qian1,ZUO De-cheng1,ZHANG Zhan1,LI Shi-lei2

2016, 38(08): 1581-1587. doi:

Abstract ( 173 )

PDF (695KB) ( 565 ) 　　

Currently more and more enterprises transfer their core business and data to the cloud, and a number of the business needs elasticity service to deal with the real-time changing of workload. So there is an urgent need to evaluate elasticity, however, we lack a relatively overall evaluation method. To solve this problem, we conduct a comprehensive analysis from the perspectives of resource allocation, QoS and resource allocation time, and propose an evaluation method that can be applied to both providers and users. Moreover, we design computation models for resource allocation and resource allocation time to improve the existing penalty model. Finally, TPC-W benchmarks are tested on the CloudStack platform with two elasticity scaling strategies, namely auto-scaling and scale-out, which verifies the proposed method.

A decomposition-based multi-objective workflow scheduling algorithm in cloud environment

LI Ke-wu,ZHANG Gong-xuan,ZHU Zhao-meng

2016, 38(08): 1588-1594. doi:

Abstract ( 168 )

PDF (800KB) ( 436 ) 　　

Cloud service providers can provide large-scale virtual computing resources to users, but at the same time they are also facing a scheduling problem, that is how to schedule the computing resources with minimum cost (including makespan, monetary cost, resource utilization rate, etc.) to complete workflow execution. To address the workflow scheduling problem in IaaS, we propose a multi-objective workflow scheduling algorithm based on decomposition which optimizes both the makespan and the monetary cost simultaneously. This algorithm combines the designs of the list-based heuristic algorithm and the multi-objective evolutionary algorithm. Using a decomposition method, it decomposes the multi-objective optimization problem into a set of single objective optimization problems. Then the algorithm focuses on solving single-objective problems, in which way the scheduling process becomes more effective. The experiments based on the real-world applications published by the Pegasus project show that the proposed algorithm can achieve better Pareto fronts and at the same time has lower time complexity in comparison with the MOHEFT algorithm and the NSGA-II* algorithm.

Resource-aware scheduler based on bee colony algorithm with fast convergence

JIANG Tao1,YUAN Jing-ling1,CHEN Min-cheng1,SONG Hua-ming2

2016, 38(08): 1595-1601. doi:

Abstract ( 189 )

PDF (684KB) ( 443 ) 　　

In order to effectively deal with massive data, correlation analysis, business forecast and so on, the distributed cloud computing platform Hadoop has emerged. But with the wide application of Hadoop, the job scheduling problems become prominent. Many existing job schedulers have defects such as complexity of parameters setting, long start time and so on. Taking full advantages of the artificial bee colony algorithm in terms of strong self-organizing ability and fast convergence, we design and implement a resource-aware scheduler of Hadoop which can detect real-time internal resources utilization. Compared to those classical job schedulers, the proposed scheduler has advantages of few parameters and fast starting speed. Experiments on benchmark programs show that the proposal can schedule resource intensive work faster than the original one by 10% ~ 20% on heterogeneous clusters.

Impact of “one testing after multiple bondings” on the testing process in “mid-bond tests”

QIN Zhen-lu1，2,FANG Fang1,WANG Wei1，2,ZHU Xia1，2,GUO Er-hui3,REN Fu-ji1，2

2016, 38(08): 1602-1608. doi:

Abstract ( 121 )

PDF (628KB) ( 408 ) 　　

With the continuous development of semiconductor technology, 3D chip technology has become a hot research highlight, and the process of chip testing has a new requirement for “mid-bond tests”. But for the “mid-bond test”, “one testing after every bonding” makes some dies repeat testing, thus resulting in an increase of test time. We put forward a "one testing after multiple bondings" testing process which considers the effect of test power consumption and "theoretical manufacturing cost". Besides, we also introduce a breadth first traversal algorithm, which combines the relevant parameters of ITC'02 circuit. The combination reflects our proposal's practical significance in actual production.

A sensor transaction scheduling algorithm on multiprocessor platforms

BAI Tian1,LI Guo-hui2

2016, 38(08): 1609-1614. doi:

Abstract ( 198 )

PDF (509KB) ( 487 ) 　　

How to schedule sensor transactions to maintain data validity is an important research topic for cyber physical systems. Previous studies are generally restricted to uni-processor platforms. We therefore propose a sensor transaction scheduling algorithm on multiprocessor platforms. The algorithm appropriately allocates and adjusts the processors’ resources for each update instance to satisfy temporal consistency constraints. It calculates the global repeating schedule segment off-line in advance to reduce the runtime overhead. The schedulability analysis of the algorithm is also given. Experimental results show that the proposed algorithm performs well in terms of scheduling success ratio and update workload.

An overhead-aware energy-efficient real-time scheduling algorithm based on dynamic slack reclamation

ZHANG Dong-song1，WANG Jue1，ZHAO Zhi-feng1，WU Fei2，SUN Xian-kun2

2016, 38(08): 1625-1632. doi:

Abstract ( 149 )

PDF (723KB) ( 365 ) 　　

Review attachment

To meet the changeable reality of task sets for the runtime system and needs of non-ignorable switching overhead for processor state, we propose an overhead-aware energy-efficient real-time scheduling algorithm called a dynamic slack reclamation based overhead-aware energy-efficient real-time scheduling in multiprocessor systems (DSROM) for periodic tasks deployed on multi-core and multiprocessor systems. The main idea of the algorithm is to implement energy-efficient scheduling for real-time tasks at the initial time of each TL plane, and to reclaim dynamic slack time at the earlier completion time of a periodic task in each TL plane. Consequently, the algorithm can obtain a reasonable tradeoff between real-time constraint and energy-saving while guaranteeing the optimal feasibility of periodic tasks. Extensive simulation results demonstrate that the DSROM can guarantee the optimal feasibility of periodic tasks and save more energy on average than the existing algorithms when the total workload of the system exceeds a threshold, saving energy by about 20% at most.

A QoS heterogeneous multicast routing mechanism based on species evolutionary algorithm

LU Peng-fei 1，WANG Xing-wei2，LI Fu-liang1，MA Lian-bo2

2016, 38(08): 1633-1639. doi:

Abstract ( 127 )

PDF (565KB) ( 442 ) 　　

Review attachment

With the emergence of a large number of new network applications, traditional network technologies cannot meet the requirements of current applications on bandwidth, latency, error rate and etc. In this case, IP over DWDM optical Internet has become a research hotspot with unique characteristics and advantages. In this paper, we put forward a heterogeneous multicast routing mechanism for IP over DWDM optical Internet based on the species evolutionary algorithm. The scheme employs the theory of probability to solve the uncertainty problem of networks’state parameters information. We adopt the fuzzy method to support flexible QoS and design a bandwidth pricing method which considers the benefits of both users and network service providers. Simulation results show that the proposed mechanism can achieve good comprehensive performance indexes and effectively solve the QoS heterogeneous multicast routing problem in the IP over DWDM optical Internet.

Cognitive engine research based on simulated annealing and particle swarm optimization algorithm

XUE Meng-meng，MA Yong-tao，LIU Jing-hao

2016, 38(08): 1640-1646. doi:

Abstract ( 137 )

PDF (1035KB) ( 451 ) 　　

Review attachment

One of the basic functions of cognitive engine is to adjust adaptive wireless parameters and use multi-objective optimization strategies to achieve reliable communication in a dynamic environment according to complex wireless environments and service needs. Currently many studies focus on the genetic algorithm (GA) and its improvement, however, its slow convergence is not conducive to complex and high real-time systems. We propose an algorithm, called simulated annealing particle swarm optimization (SABPSO), which combines the simulated annealing algorithm and the particle swarm algorithm to search jointly for a better solution in an alternately iterative manner. It can effectively improve the convergence speed and overcome PSO's shortage of falling into local extreme values easily, thus enhancing the ability of global optimization. Finally, multi-carrier system simulation results in different communication scenarios show that the SABPSO outperforms the basic algorithms in convergence rate and average fitness.

An intelligent response mechanism of round-trip time delay changes in TCP transmission

CHEN Ting-ping,YU Wan-rong,WU Chun-qing

2016, 38(08): 1647-1653. doi:

Abstract ( 144 )

PDF (962KB) ( 456 ) 　　

Review attachment

To enhance the TCP transmission performance in high packet loss rate environment, we propose an intelligent response mechanism of round-trip time delay changes. The variation of round-trip time delay is normalized to a standard delay factor, and then the standard delay factor is used to amend the congestion window growth and the reduction rate, which makes the increasing rate of the congestion window adaptable to the round-trip time delay changes and can distinguish random packet losses from network congestion. The mechanism is implemented as a LINUX kernel module, which enables its quick deployment to all congestion control mechanisms based on AIMD. Simulation results show that the average throughput of the proposed mechanism is 57% higher than the cubic algorithm, so it can effectively improve the bandwidth utilization in high packet loss rate environment.

Trusted virtualization technology in embedded system and its application

ZHANG Ling-li，ZHANG Gong-xuan，WANG Tian-shu，CHENG Xiang

2016, 38(08): 1654-1660. doi:

Abstract ( 129 )

PDF (879KB) ( 451 ) 　　

Review attachment

Embedded systems have increasingly extensive applications in all fields. However, traditional security enhancement methods are unable to deal with various security issues. How to enhance the security of embedded systems becomes an urgent problem to be solved. In order to improve the security of embedded systems and promote their applications, we design and implement a trusted computing platform framework based on virtual trusted cryptography module, and the virtual trusted cryptography module and the trusted enhancement technology based on it are realized. We also propose and implement a session authentication method based on the virtual trusted cryptography module, extending the trust chain from the hardware operating system layer to the application software layer in the virtual domain. Experimental results show that the combination of the virtual trusted cryptography module and the physical trusted cryptography module can effectively ensure the security of embedded systems, virtual domains and applications.

Current Issue

Author center

Review center

Online journal