• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A columnar storage structure based on group sorting of key columns   

XU Tao,GU Yu,WANG Dong-sheng   

  1. (Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)
  • Received:2016-04-13 Revised:2016-06-19 Online:2016-08-25 Published:2016-08-25

Abstract:

As the main storage medium for massive data, disks have the advantages of large capacity and low cost. However, the I/O bandwidth of disks lags far behind the growing speed of data, which thus becomes the performance bottleneck of big data management systems. Therefore, optimizing the storage structure to improve the efficiency of writing and reading has become one important challenge in the era of big data. In this paper, we present a columnar storage structure based on key columns group sorting called KCGSStore. According to the groups of the key columns, the tables are divided into pools and the records in the same pool have the same value or value range. All pools belonging to one group are merged, and then the key columns are orderly arranged by taking the pool as the unit. In this way, irrelevant column values can be effectively filtered when executing SQL commands so as to reduce the amount of data being read. Consequently, the query performance can be improved. Meanwhile, using the pool matrix, we can reorganize the records at very little cost of time and space. Evaluation results show that compared with the ORCFile and the Parquet, the KCGSStore is superior in many aspects, including storage space, data loading and SQL querying.

Key words: Hadoop, columnar storage, group ranking, big data