Computer Engineering & Science

Research progress and trends of
big data from a database perspective

LI Zhanhuai1,WANG Guoren2,ZHOU Aoying3

2013, 35(10): 1-11. doi:

Abstract ( 200 )

PDF (734KB) ( 796 ) 　　

"Big Data" is one of the hottest topics in 2012. We try to detangle Big Data from the view of database researcher, and describe the concept of Big Data and the relationship between Big Data and traditional databases. Revisiting database research in the Big Data scene includes reinvestigating the concept of database and essential issues of database research, discussing the relationship between Hadoop and Big Data, database research and Big Data research as well. Through tracking the inspiration and development of Hadoop, we try to explain why it has been such a big deal in Big Data. The basic ideas of the report are: (1) Big Data is a general concept. Classification of Big Data is helpful to have a deep understanding of it; (2) Big Data research closely correlates to its applications; (3) Hadoop is an enlightening exploration for database research going back to file system; (4) the philosophy and methodology of Big Data is consistent with those of traditional databases.

A survey on storage architectures and core
algorithms for big data management on new storages

JIN Peiquan,HAO Xingjun,YUE Lihua

2013, 35(10): 12-24. doi:

Abstract ( 299 )

PDF (651KB) ( 1092 ) 　　

Big data has been a hot topic both in academia and industries. At the same time, new storages, such as flash memory and phase changing memory, are greatly changing the design and applications of both software and hardware in modern computer systems. Basically, big data management has to deal with a lot of challenges, such as energy and performance, whereas new storages are superior to traditional magnetic disks in many aspects including I/O latency and energy consumption. Therefore, researchers conduct big data management study on new storages and expect to solve the critical issues in big data management. However, so far lots of issues are still to be further explored. We summarize the stateoftheart studies in big data management over new storages, and try to answer some key questions, e.g., “What new challenges and issues are brought by introducing new storages to big data management?”, “Can we solve or partially solve the key issues in big data management by using new storages?”. Particularly, we first discuss the special features of new storages, and then present the recent research advances in storage architectures and core algorithms for big data management over new storages. Finally, some future research directions in this area are proposed, which are expected to provide useful references for the future study in big data management and newstoragebased data management.

Reviewing the big data solution based on Hadoop ecosystem

CHEN Jirong,LE Jiajin

2013, 35(10): 25-35. doi:

Abstract ( 324 )

PDF (835KB) ( 1107 ) 　　

Solving big data must deal with three crucial problems: big data storage, big data analysis and big data management. Firstly, the definitions of big data and Hadoop ecosystem are summarized respectively. Secondly, how to face big data is discussed from the two aspects of commercial products and Hadoop ecosystem. The paper focuses on reviewing the big data solution based on Hadoop ecosystem:(1) HDFS, HBase and OpenTSDB are used to deal with storage problems;(2) Hadoop MapReduce(Hive) and HadoopDB do analytical problems; and (3) Sqoop and Ganglia solve management problems. For each partner, its architecture, principles and features are analyzed. And for some defects or problems existing in some key partners, we propose some solutions, ideas and viewpoints based on our research progress. It is predicted that Hadoop ecosystem is the preferable solution for the small and mediumsized enterprises.

A maximum coverage method of
trustworthy services based on social network

ZHANG Peiyun1,2,HUANG Bo3,Gong Xiuwen1

2013, 35(10): 36-43. doi:

Abstract ( 123 )

PDF (655KB) ( 286 ) 　　

Aiming at the existing problems of incredible nodes and services of services coverage in social network, a model of trustworthy services coverage is built up. The model represents the relationship between social network nodes and services. After recognizing the excellent nodes and ordinary nodes, it can enhance the credibility and maximum of services coverage by the excellent nodes. The optimalpathfinding algorithm is designed to find the optimal coverage path, which can ensure the nodes connectivity. As the excellent nodes is utilized as the source nodes, the services coverage algorithm is given to achieve the maximum coverage for trustworthy services within the specified coverage radius. We evaluate the performance of our approach under the social network services coverage model. The experimental results demonstrate the effectiveness and efficiency of our approach.

Interaction relation based user
tag prediction in Microblogging site

WANG Xiang1,JIA Yan1,ZHOU Bin1,CHEN Ruhua1,HAN Yi2

2013, 35(10): 44-50. doi:

Abstract ( 138 )

PDF (791KB) ( 447 ) 　　

In today’s social networks which take users as core, tags are important for users to mark or classify resources. In Sina microblogging website, user can freely tag himself (herself) to indicate his (her) interests and characteristics. User tags play an important role in network marketing, system recommendation and advertisement serving. For the issue that there are no tags or few tags for most users in Sina microblogging website, a method for tag prediction is proposed, which is based on interaction graph generated from actions of interaction between users. Experimental results on randomly generated test datasets show that our method for tag prediction gives better performance than the most used method.

Optimal selection algorithm in quality inspection plan
of big marine data based on Block-Nested-Loops

HUANG Dongmei,CHEN Kuo,WANG Zhenhua，SHI Lili

2013, 35(10): 51-57. doi:

Abstract ( 143 )

PDF (1059KB) ( 357 ) 　　

Big marine data possesses several typical characteristics such as large amount, multisource, multiple dimensions, multitype and so on. How to design an optimal quality inspection plan fast and control the ocean data timely becomes more and more important for the application of big marine data. Based on skyline, a method is proposed to select the optimal quality inspection plan for the quality inspection of big marine data. Firstly, the residual of acceptance quality probability of each quality inspection plans for ocean big data are calculated by Hypergeometric distribution model. Secondly, the optimal quality inspection plan is selected based on the algorithm of BlockNestedLoops (BNL), which compares the residual of acceptance quality probability of each quality inspection plans one by one. Finally, the proposed method is verified by inspecting the quality of the big marine data, which is collected by monitoring sites in a certain sea area.Key words:big marine data;quality inspection;blocknestedloops algorithm;residuals

Research of Redisbased distributed
storage method for massive small files

LIU Gaojun，WANG Diao

2013, 35(10): 58-64. doi:

Abstract ( 205 )

PDF (647KB) ( 456 ) 　　

As an important way of information transmission and storage, small file has been widely used in many fields. Meanwhile, its reliability and speed requirements need to be improved. For the inefficiency of small file storage, combining the advantage of big file storage of distributed storage system HDFS and the Redis cache technology, we propose a fast small file merging scheme. Small files are merged to Sequence File, which is then stored in HDFS. Loads are balanced by load coefficients that are determined by multiple linear regression analysis, and the efficiency of file access is guaranteed by cache. In experiments, the corresponding file platform is constructed to analyze and compare upload, access, delete, and memory footprint with the traditional direct upload. We can see that, compared with the traditional way of uploading files to HDFS, the improved small files treatment can ensure the reliability of files and enables users operations on small files faster.

ECLHadoop: efficient big data processing strategy
based on Hadoop for electronic commerce logistics

WEI Feifei

2013, 35(10): 65-71. doi:

Abstract ( 196 )

PDF (782KB) ( 334 ) 　　

With the rapid development of cloud computing, more and more electronic commerce applications are confronted with the problems of processing big data, such as big data from the social media posted by the customers of electronic commerce logistics .In order to improve the big data processing efficiency in electronic commerce logistics, an efficient big data processing strategy based on Hadoop is designed, which is named ECLHadoop. In ECLHadoop, those closely related data blocks are placed at the same nodes, which can help to reduce the MapReduce I/O cost, especially the I/O cost at the shuffling stage. The simulation experiment results show that, based on Hadoop, the ECLHadoop can improve the big data computing efficiency for dataintensive analysis in the electronic commerce logistics service.

Software testing architecture design
based on Hadoop cloud computing platform

PAN Hui,ZHU Xinzhong,ZHAO Jianmin,XU Huiying

2013, 35(10): 72-78. doi:

Abstract ( 132 )

PDF (1420KB) ( 388 ) 　　

Cloud computing can play an important role in software testing technology because of its high reliability and the benefits from ondemand access service pattern. It can also provide a new solution for hard to deploy problems in software testing with high efficiency and low cost. A novel idea called “testing as service” is proposed, which adopts the idea of “software as service” and extend its advantage into testing. The testing of service is a layered software testing architecture based on the Hadoop cloud computing platform. A competitive experiment is conducted to evaluate the load efficiency. The results demonstrate its fast response time and good performance in distributing the workloads when receiving requests from multiple users.

Granularity transform query on association rules
that mined from uniform distributed uncertain data

CHEN Aidong,LIU Guohua,XIAO Rui,WAN Xiaomei,SHI Danni

2013, 35(10): 79-88. doi:

Abstract ( 126 )

PDF (1470KB) ( 299 ) 　　

Cloud computing provides the platform for the associate rule mining and query of big data. Data often contains artificially added uncertainty to prevent the information disclosure. How to allow users to query the result of association rules mining from uncertain data transparently is an urgent problem to be solved in the query of big data mining results. The uncertain big data for sharing achieves uniform distributed characteristic through generalizing precise data, this characteristic is not conductive to accurate queries but can offer convenience for the query on association rules mining result set. Firstly, the association rule library is built by UFIDM algorithm and the Rtree indexes are constructed for both generalized identifiers and sensitive attributes separately in order to improve the query efficiency. Secondly, the generalization value granularity transform method and UARS query algorithm are proposed on this basis. Finally, theoretical analysis and experimental results demonstrate the feasibility and effectiveness of the algorithm.

Multitenant data dynamic migration strategy
of SaaS application in cloud

REN Xiaojun1，ZHENG Yongqing1,2，KONG Lanju1

2013, 35(10): 89-97. doi:

Abstract ( 188 )

PDF (1343KB) ( 369 ) 　　

In order to maximize profits, SaaS application service providers are motivated to use a data node to pack many tenants. Moreover, the resource footprint of each tenant is dynamic, and this can lead to hot data where a resource usage rate of a data node becomes overloaded and hurts the ServiceLevelAgreement (SLA). An effective solution is to perform data migration to prevent it from happening. Unfortunately, traditional database migration technology for the tenants’ data cannot well meet the multitenancy characteristics. Therefore, a multitenant data dynamic migration strategy of SaaS application in cloud is proposed. It can be aware of multitenancy characteristics. In order to ensure the continuous data access initiated by tenants to the source data node and target data node, the single write dual read mode is adopted, and the QoS is not affected. We also extend the traditional twophase commit strategy to ensure data consistency between source data node and target data node. The effectiveness of our multitenant data dynamic migration strategy is evaluated by the experiments.

A resource associated cloud service selection mechanism

ZHAO Pengfei1，WANG Zhijian1,2，YE Feng1,2，DU Jingjing1

2013, 35(10): 98-103. doi:

Abstract ( 110 )

PDF (1039KB) ( 254 ) 　　

The emerge of cloud computing pushes Web services onto a broader platform. On the cloud computing platform, dynamic allocation of virtual resources makes the running environment of web service more varied. When it is overloaded, the services will become invalid due to lack of resources, even bringing system collapse. From the perspective of users and system security, resource status turns into the primary factor that should be considered in the selection process, which makes it lose effectiveness to choose a safe and appropriate service if only QoS is considered. To solve this problem, both QoS and the virtual machine resource status are taken into account. According to the information about CPU usage and memory usage of virtual resource, the proposed service selection algorithm is used to select service. Experimental results show that, the service selection algorithm which takes the state of the resource information into account, can obtain the best service under the corresponding state, and answer customers’ requests more quickly.

A QoSQoE correlation model for streaming
media services based on cloud model

ZHOU Xiaomao,DU Haiqing,LIU Yong

2013, 35(10): 104-109. doi:

Abstract ( 133 )

PDF (1025KB) ( 272 ) 　　

As the streaming media services are widely used, the importance of the relation between QoS indexes and user perception QoE is becoming increasingly obvious. Nevertheless, today's mapping models have not taken various indexes into consideration and ignore fuzziness and randomness in both objective indexes and subjective user perception. A new evaluation model based on cloud model is introduced and each QoS index is described using onedimension cloud, and then the corresponding multidimensional standard judging cloud and multidimensional index cloud of system are established. The results are obtained by comparing the similarity degree between standard clouds and index cloud of system. The experiment shows that this model can reflect the impact of each index on users' perception with onedimension and multidimensional index correlation model.

Efficient Topk query algorithm on massdistributed data

WEI Xianquan，ZHENG Hongyuan，DING Qiulin

2013, 35(10): 110-115. doi:

Abstract ( 145 )

PDF (569KB) ( 311 ) 　　

For solving the shortage of existing distributed Topk query algorithms, a novel topk algorithm (named ECHT algorithm) is proposed, which is appropriate for massive distributed data. Taking care of the data distribution, ECHT algorithm designs a new algorithm of errorlimited histogram. For one thing, it avoids poor performance on uneven data distribution. For the other, it improves the accuracy of the threshold value, thus further reducing network bandwidth consumption. In addition, ECHT performs early clipping. Clipping before the transmission of large amounts of data priors brings better performance due to avoiding a lot of useless data transmission. The experiments are performed with the real datasets, demonstrating the viability and superior performance of the new algorithm.

A dynamical optimization approach
for service agent plan library

XU Qianyuan，CAO Jian，WANG Lei

2013, 35(10): 116-124. doi:

Abstract ( 113 )

PDF (865KB) ( 271 ) 　　

Service agent provides integrated and more powerful services relying on managed multiple services. It can improve the intelligence of the servicecomputing environment. The capability of a service agent is based on a set of service plans, which are organized into a plan library. In order to react to service requests, a service agent should take one of the approaches of using a predefined plan directly, composing existing plans or generating a new plan based on a search algorithm on demand. Therefore, to store which plans and how to update plan models of the library is very important to improve the efficiency and lower the space cost of a service agent. A suffix treebased optimization approach is proposed, which is based on the structure tree based plan representations. The algorithms, its complexity analysis and experiments are also presented. The experimental results show that the mechanism can improve its efficiency by discovering common plans.

Design and implementation of
a least spare time scheduler for Hadoop

YANG Hao，TENG Fei，LI Tianrui，LI Zhao

2013, 35(10): 125-130. doi:

Abstract ( 113 )

PDF (816KB) ( 259 ) 　　

As an open source platform of cloud computing, Hadoop is widely used in many fields, such as natural language processing, machine learning and largescale image processing. With the increase of the types of cloud services, the realtime requirement is strengthened by cloud users. Most existing schedulers are designed to shorten the response time which cannot guarantee a specific deadline. Least Sparetime Scheduler (LSS) is designed and implemented to improve the performance of hard realtime jobs in Hadoop. The spare time is estimated dynamically and the LSS updates the job priority of the job queue in realtime. Experimental results show that the LSS can improve the success ratio of the cluster dealing with hard realtime jobs.

Application of artificial fishschool algorithm
in overlapping community detection

WANG Yiping，SUN Ming

2013, 35(10): 131-136. doi:

Abstract ( 135 )

PDF (523KB) ( 277 ) 　　

With the phenomenon of small business big data emerged, “complex networks as complex system model” has been very popular. Community detection is one of the most important issues. But the existing community detection algorithms mostly assume that no overlaps exist. Aimed at the common phenomenon of overlapping community, an overlapping community detection algorithm, named AFSCDA, is proposed based on fishschool algorithm. In the initialization phase, a label propagation algorithm is utilized on optimization variables of each artificial fish for coding adjustment, trying to avoid illegal community. We will apply the deformation module of the Q function as the fitness function. In experiments, the algorithm is applied to three classic datasets with known community structures in order to demonstrate the algorithm's effectiveness, higher accuracy, capability of detecting the potential community structure quickly in networks.

An ensemble multilabel classification
method using feature selection

LI Ling1,LIU Huawen2,MA Zongjie2,ZHAO Jianmin1

2013, 35(10): 137-143. doi:

Abstract ( 185 )

PDF (474KB) ( 333 ) 　　

Similar to traditional learning methods, multilabel learning also suffers from the problems, such as overfitting and the curse of dimensionality, which are raised from high dimensionality of data. Although many multilabel learning algorithms have been proposed, the issue of the high dimensionality has not yet received enough attentions. To solve this problem, we exploit the correlation of features to classify labels by using conditional mutual information, and then perform feature selection on data. Furthermore, a new ensemble learning algorithm for multilabel data is proposed. Experiment results on several multilabel data sets show that the proposed algorithm outperforms the wellestablished multilabel learning algorithms in most cases.

Lowmemory and reducedcomplexity discrete
wavelet transform with decomposed flipping structure

LI Jingxu1,ZHANG Xiongming2

2013, 35(10): 144-148. doi:

Abstract ( 102 )

PDF (483KB) ( 277 ) 　　

By decomposing the Flipping Structure (FS) into even phase and odd phase, this paper proposes an onthefly implementation of DWT called Decomposed Flipping Structure (DFS), which is characterized by lowmemory budget, lowcomputational complexity and balanced workload. For the CDF 9/7 wavelet filterbank popular in waveletbased image/video coding schemes, the DFS has the same computational complexity as the FS, while the memory requirement reduces from six memory cells to five ones. The experimental results show that the proposed implementation has a speedup of about 44% and 14%, respectively, compared with the traditional lifting scheme and onthefly LS.

Design and implementation of
PCB soft startup of Tianhe computer system

SONG Fei，WANG Fayuan,HU Shiping,YAO Xinan

2013, 35(10): 149-153. doi:

Abstract ( 119 )

PDF (691KB) ( 390 ) 　　

（School of Computer Science,National University of Defense Technology，Changsha 410073,China）Abstract:The paper designs and implements the soft startup for the circuit board with 12V power supply of the Tianhe highperformance computer system. Using the latest control technique, we design the hot swap circuit with large current, and carry out corresponding tests. This technique has been applied on designing all kinds of PCBs in the Tianhe computer system.

Current Issue

Author center

Review center

Online journal