A columnar storage structure based on group sorting of key columns

Abstract

Abstract:

As the main storage medium for massive data, disks have the advantages of large capacity and low cost. However, the I/O bandwidth of disks lags far behind the growing speed of data, which thus becomes the performance bottleneck of big data management systems. Therefore, optimizing the storage structure to improve the efficiency of writing and reading has become one important challenge in the era of big data. In this paper, we present a columnar storage structure based on key columns group sorting called KCGSStore. According to the groups of the key columns, the tables are divided into pools and the records in the same pool have the same value or value range. All pools belonging to one group are merged, and then the key columns are orderly arranged by taking the pool as the unit. In this way, irrelevant column values can be effectively filtered when executing SQL commands so as to reduce the amount of data being read. Consequently, the query performance can be improved. Meanwhile, using the pool matrix, we can reorganize the records at very little cost of time and space. Evaluation results show that compared with the ORCFile and the Parquet, the KCGSStore is superior in many aspects, including storage space, data loading and SQL querying.

Key words: Hadoop, columnar storage, group ranking, big data

XU Tao，GU Yu,WANG Dong-sheng. A columnar storage structure based on group sorting of key columns [J]. Computer Engineering & Science.

[1]	CHEN Qiaoan1,LI Feng1,CAO Yue1,LONG Mingsheng1,2. Parameter optimization for Spark jobs based on runtime data analysis [J]. J4, 20160101, 38(01): 11-19.
[2]	SU Li,SUN Yanmeng,ZHANG Bowei,YANG Xianbo,ZHU Ying. A correlator implementation method based on Hadoop+CUDA [J]. J4, 20160101, 38(01): 46-51.
[3]	YANG Hao-yi, CHEN Wei, YAO Ze-huan, TAN Yu-song, LI Fei. Antifungal drug discovery base on transcriptome data of cell response [J]. Computer Engineering & Science, 2023, 45(02): 246-251.
[4]	GE Xu-ran, LIU Yang, CHEN Zhi-guang, XIAO Nong. A parallel large dataset generator based on MPI [J]. Computer Engineering & Science, 2022, 44(07): 1152-1161.
[5]	LIU Shi-yuan, LI Yun-chun, CHEN Chen, YANG Hai-long. Architecture combining active & passive performance evaluation methods and its implementation for big data storage [J]. Computer Engineering & Science, 2022, 44(04): 584-593.
[6]	YANG Bo-ai, ZHAO Shan, LIU Fang. A survey on serverless computing [J]. Computer Engineering & Science, 2022, 44(04): 611-619.
[7]	Lv Gao-feng, WANG Yu-peng, YANG Rong-jia, TANG Zhu. Design of aggregation-based FlowRadar acceleration model for network data collection [J]. Computer Engineering & Science, 2022, 44(02): 220-226.
[8]	ZHANG Yuan-ming, YU Jia-rui, LU Jia-wei, GAO Fei, XIAO Gang. A parallel processing approach for video big data based on Spark Streaming framework [J]. Computer Engineering & Science, 2021, 43(10): 1736-1743.
[9]	HUANG Shan, , FANG Liu-yi, , XU Hao-tong, DUAN Xiao-dong, . Task scheduling optimization of Flink in container environment [J]. Computer Engineering & Science, 2021, 43(07): 1173-1184.
[10]	. Identity calibration of E-commerce big data based on long short-term memory network [J]. Computer Engineering & Science, 2021, 43(03): 407-415.
[11]	ZHAO Jun-sheng, WANG Xin-yu, YIN Yu-jie, ZHANG Lin. A distributed retrieval method based on Mongolian news domain ontology [J]. Computer Engineering & Science, 2021, 43(03): 560-570.
[12]	LI Qiong, SONG Zhen-long, YUAN Yuan, XIE Xu-chao. A regional shared and high concurrent storage architecture based on NVMeoF storage pool [J]. Computer Engineering & Science, 2020, 42(10高性能专刊): 1711-1719.
[13]	MA Man-fu1,2,YUN Xin-miao1,2,LI Yong1,2,LIU Yuan-zhe1,2,WANG Chang-qing3. Social stratification behavior in virtual space [J]. Computer Engineering & Science, 2020, 42(05): 803-811.
[14]	LIN Lian-hai1,TIAN Li-qin1,2,CAI Ming-kai1,LI Sheng-hong1. A variance toss algorithm for parameter reduction of soft set [J]. Computer Engineering & Science, 2020, 42(02): 250-258.
[15]	QIN Fu-dian，LI Jing. Influence and exploration of big data on university teaching and research [J]. Computer Engineering & Science, 2019, 41(增刊S1): 238-241.

A columnar storage structure based on group sorting of key columns

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments