基于关键列分组排序的列存储结构

计算机工程与科学

基于关键列分组排序的列存储结构

徐涛，顾瑜，汪东升

（清华大学计算机科学与技术系,北京 100084）

收稿日期:2016-04-13 修回日期:2016-06-19 出版日期:2016-08-25 发布日期:2016-08-25
基金资助:
国家自然科学基金（61373025,61303002）

A columnar storage structure based on group sorting of key columns

XU Tao，GU Yu,WANG Dong-sheng

（Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China）

Received:2016-04-13 Revised:2016-06-19 Online:2016-08-25 Published:2016-08-25

摘要/Abstract

摘要：

磁盘作为海量数据的主要存储介质，具有容量大、成本低的优点，但是磁盘IO带宽远远落后于数据增长速度，日益成为大数据管理系统的性能瓶颈。因此，优化存储结构、提高读写效率是大数据时代管理系统面临的重要挑战。提出了一种基于关键列分组排序的混合列存储结构KCGSStore，根据关键列分组将关系表划分为存储池，确保池内所有记录在关键列上的取值或取值范围相同，然后逐列进行池合并。合并后的关键列，以池为单位有序排列，执行条件查询时能够有效过滤无关列值，减少数据读取量，提升查询性能。同时利用池号索引，以少量时间空间代价完成记录重组。实验数据表明，与ORCFile、Parquet存储结构相比，KCGSSTORE在存储空间、数据加载、SQL查询等方面都有不同程度的优化。

关键词: Hadoop, 列存储, 组排序, 大数据

Abstract:

As the main storage medium for massive data, disks have the advantages of large capacity and low cost. However, the I/O bandwidth of disks lags far behind the growing speed of data, which thus becomes the performance bottleneck of big data management systems. Therefore, optimizing the storage structure to improve the efficiency of writing and reading has become one important challenge in the era of big data. In this paper, we present a columnar storage structure based on key columns group sorting called KCGSStore. According to the groups of the key columns, the tables are divided into pools and the records in the same pool have the same value or value range. All pools belonging to one group are merged, and then the key columns are orderly arranged by taking the pool as the unit. In this way, irrelevant column values can be effectively filtered when executing SQL commands so as to reduce the amount of data being read. Consequently, the query performance can be improved. Meanwhile, using the pool matrix, we can reorganize the records at very little cost of time and space. Evaluation results show that compared with the ORCFile and the Parquet, the KCGSStore is superior in many aspects, including storage space, data loading and SQL querying.

Key words: Hadoop, columnar storage, group ranking, big data

徐涛，顾瑜，汪东升. 基于关键列分组排序的列存储结构[J]. 计算机工程与科学.

XU Tao，GU Yu,WANG Dong-sheng. A columnar storage structure based on group sorting of key columns [J]. Computer Engineering & Science.

[1]	陈侨安1，李峰1，曹越1，龙明盛1,2. 基于运行数据分析的Spark任务参数优化[J]. J4, 20160101, 38(01): 11-19.
[2]	苏丽，孙彦猛，张博为，杨先博，朱颖. 一种基于Hadoop+CUDA实现相关器的方法[J]. J4, 20160101, 38(01): 46-51.
[3]	钟权, 陈志广, 高蓝光. EMRI-Tree：面向多分辨率可视化的层次式数据结构[J]. 计算机工程与科学, 2024, 46(5): 776-784.
[4]	杨浩艺, 陈微, 姚泽欢, 谭郁松, 李非. 基于转录组学数据的抗真菌药物预测方法研究[J]. 计算机工程与科学, 2023, 45(2): 246-251.
[5]	葛旭冉, 刘洋, 陈志广, 肖侬. 基于MPI的并行大数据集生成器[J]. 计算机工程与科学, 2022, 44(7): 1152-1161.
[6]	刘世缘, 李云春, 陈晨, 杨海龙. 面向大数据存储的主动与被动相结合的性能评测方法体系结构与实现[J]. 计算机工程与科学, 2022, 44(4): 584-593.
[7]	杨柏蔼, 赵山, 刘芳. 无服务器计算技术研究综述[J]. 计算机工程与科学, 2022, 44(4): 611-619.
[8]	吕高锋, 王玉鹏, 杨鎔嘉, 唐竹. 基于聚合的FlowRadar网络数据采集加速模型设计[J]. 计算机工程与科学, 2022, 44(2): 220-226.
[9]	黄山, 房六一, 徐浩桐, 段晓东, . 面向容器环境的Flink的任务调度优化研究[J]. 计算机工程与科学, 2021, 43(7): 1173-1184.
[10]	刘亚波, 吴秋轩. 基于长短时记忆网络的电商大数据同一性标定[J]. 计算机工程与科学, 2021, 43(3): 407-415.
[11]	赵俊生, 王鑫宇, 尹玉洁, 张林. 基于蒙古语新闻领域本体的分布式检索方法[J]. 计算机工程与科学, 2021, 43(3): 560-570.
[12]	张元鸣, 虞家睿, 陆佳炜, 高飞, 肖刚. 基于Spark Streaming的视频大数据并行处理方法[J]. 计算机工程与科学, 2021, 43(10): 1736-1743.
[13]	马满福1,2，员欣淼1,2，李勇1,2，刘元喆1,2，王常青3. 虚拟空间中社会分层行为研究[J]. 计算机工程与科学, 2020, 42(5): 803-811.
[14]	林连海1,田立勤1,2,蔡铭楷1,李升宏1. 方差辗转的软集参数约简算法[J]. 计算机工程与科学, 2020, 42(2): 250-258.
[15]	李琼, 宋振龙, 袁远, 谢徐超. 一种基于NVMeoF存储池的分域共享并发存储架构[J]. 计算机工程与科学, 2020, 42(10高性能专刊): 1711-1719.