• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于关键列分组排序的列存储结构

徐涛,顾瑜,汪东升   

  1. (清华大学计算机科学与技术系,北京 100084)
  • 收稿日期:2016-04-13 修回日期:2016-06-19 出版日期:2016-08-25 发布日期:2016-08-25
  • 基金资助:

    国家自然科学基金(61373025,61303002)

A columnar storage structure based on group sorting of key columns   

XU Tao,GU Yu,WANG Dong-sheng   

  1. (Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)
  • Received:2016-04-13 Revised:2016-06-19 Online:2016-08-25 Published:2016-08-25

摘要:

磁盘作为海量数据的主要存储介质,具有容量大、成本低的优点,但是磁盘IO带宽远远落后于数据增长速度,日益成为大数据管理系统的性能瓶颈。因此,优化存储结构、提高读写效率是大数据时代管理系统面临的重要挑战。提出了一种基于关键列分组排序的混合列存储结构KCGSStore,根据关键列分组将关系表划分为存储池,确保池内所有记录在关键列上的取值或取值范围相同,然后逐列进行池合并。合并后的关键列,以池为单位有序排列,执行条件查询时能够有效过滤无关列值,减少数据读取量,提升查询性能。同时利用池号索引,以少量时间空间代价完成记录重组。实验数据表明,与ORCFile、Parquet存储结构相比,KCGSSTORE在存储空间、数据加载、SQL查询等方面都有不同程度的优化。

关键词: Hadoop, 列存储, 组排序, 大数据

Abstract:

As the main storage medium for massive data, disks have the advantages of large capacity and low cost. However, the I/O bandwidth of disks lags far behind the growing speed of data, which thus becomes the performance bottleneck of big data management systems. Therefore, optimizing the storage structure to improve the efficiency of writing and reading has become one important challenge in the era of big data. In this paper, we present a columnar storage structure based on key columns group sorting called KCGSStore. According to the groups of the key columns, the tables are divided into pools and the records in the same pool have the same value or value range. All pools belonging to one group are merged, and then the key columns are orderly arranged by taking the pool as the unit. In this way, irrelevant column values can be effectively filtered when executing SQL commands so as to reduce the amount of data being read. Consequently, the query performance can be improved. Meanwhile, using the pool matrix, we can reorganize the records at very little cost of time and space. Evaluation results show that compared with the ORCFile and the Parquet, the KCGSStore is superior in many aspects, including storage space, data loading and SQL querying.

Key words: Hadoop, columnar storage, group ranking, big data