一种新的基于相似度过滤的大数据保序匹配与检索算法

计算机工程与科学

一种新的基于相似度过滤的大数据保序匹配与检索算法

姜文超1 ，林德熙1，孙傲冰2，伍小强2

（1.广东工业大学计算机学院，广东广州 510006;2.广东电子工业研究院，广东东莞 523808）

收稿日期:2016-12-26 修回日期:2017-03-21 出版日期:2017-07-25 发布日期:2017-07-25
基金资助:
国家自然科学基金（61672171）;广东省自然科学基金(2016A030313703);广东省科技计划(2015B010109001，2016B030305002，2016B030306003);广东省产学研合作项目(2015B090901051);广东省创新团队项目(201001D0104726115)

A novel big data order-preserving matching

algorithm based on similarity filtration

JIANG Wen-chao1,LIN De-xi1,SUN Ao-bing2,WU Xiao-qiang2

(1.School of Computer,Guangdong University of Technology，Guangzhou 510006;

2.Institute of Guangdong Electronics Industry,Dongguan 523808,China)

Received:2016-12-26 Revised:2017-03-21 Online:2017-07-25 Published:2017-07-25

摘要/Abstract

摘要：

伴随大数据时代的到来，数据快速保序匹配与检索成为众多大数据应用急需解决的关键问题，通过抽象与归约等措施，数据对象可抽象为具有若干属性的点集或序列,从而将数据匹配问题转化为字符或数字序列匹配问题。提出一种基于相似度过滤的数据保序匹配与检索算法，算法分三步：（1）数据转换，基于幅值变化趋势将原始序列转换为二进制，对序列中任何一个字符，通过判断包括其前后邻居在内的三个点的关系定义二进制序列，准确反映相邻三点之间的凸增长（降低）或凹增长（降低）关系；（2）数据归约，为方便候选序列与模式序列之间的相似度计算，运用基于幅度变化比例的数据归约方法，将候选序列与模式序列均归约到固定区间；（3）相似度计算，为区分不同趋势的凸增长（降低）或凹增长（降低）幅度，通过计算候选序列与模式序列对应点之间的差值绝对值之和作为相似度判断依据，提出基于相似度过滤的快速匹配方法，寻找与模式序列变化趋势一致的子序列集合，并按照相似度大小排序。理论分析与实验结果表明：（1）该算法具有亚线性时间复杂度；（2）该算法能有效解决Chhabra 等人算法对数据震荡幅度失控的问题，同时解决数据序列与模式序列分段规律但整体不相似的问题；（3）解决了Chhabra等人算法中对匹配序列排序造成的匹配结果疏漏问题。该方法不仅能更准确、更多地匹配出变化趋势一致的子字符串，同时将多个候选子串根据与模式之间的相似度进行排序，为进一步的数据精确检索提供判断依据。

关键词: 大数据应用, 模式匹配, 保序匹配, 相似度过滤

Abstract:

Data order-preserving matching is a key problem in big data applications. Data matching can be transformed into character or number matching through abstraction or reduction. We present a novel data order-preserving matching algorithm based on similarity filtration which includes three steps: data transformation, data reduction and similarity computation. Firstly, to reflect the relation of convex growth (descent) or concave growth (descent), the data is transformed into a binary string according to the relationship among the three neighbor numbers. Secondly, to compute the similarity more accurately, the data array and pattern array are both reduced into stable interval ［0,1］. Finally, according to the variety range of the relevant nodes between data array and pattern array, the similarity can be computed and sorted. Theory analysis shows that the time complex is O(n), which is lower than the algorithm presented by Cho et al. Furthermore, our algorithm can overcome the deficiencies of the algorithm presented by Cho et al. including the incontrollable min-max values and the subsection inconsistency. Based on the similarity computation, all the sub-strings can be sorted for data retrieval or searching in big data applications.

Key words: big data application, pattern matching, order-preserving matching, similarity filtration

姜文超1,林德熙1，孙傲冰2，伍小强2. 一种新的基于相似度过滤的大数据保序匹配与检索算法[J]. 计算机工程与科学.

JIANG Wen-chao1,LIN De-xi1,SUN Ao-bing2,WU Xiao-qiang2.

A novel big data order-preserving matching

algorithm based on similarity filtration

[J]. Computer Engineering & Science.

[1]	崔莹. 基于相似义原和依存句法的政外领域事件抽取方法[J]. 计算机工程与科学, 2020, 42(9): 1632-1639.
[2]	武优西,王博,高雪冬. 无重叠条件模式匹配的在线求解算法[J]. 计算机工程与科学, 2019, 41(12): 2239-2246.
[3]	郁伟生1，邓伟1，张瑶2，李蜀瑜1,2. 基于时间序列的音乐流行趋势预测研究[J]. 计算机工程与科学, 2018, 40(9): 1703-1709.
[4]	高冠东1,2,王晶1,刘菲1,段庆1,朱杰1. 一种基于极坐标变换的点模式匹配算法[J]. J4, 2016, 38(2): 331-337.
[5]	万虎1,徐远超1,2,孙凤芸1,闫俊峰1. 面向大数据应用的众核处理器缓存结构设计[J]. J4, 2015, 37(1): 28-35.
[6]	李贯峰，陈冬梅. 基于证据理论的不确定模式匹配方法[J]. J4, 2014, 36(6): 1108-1113.
[7]	唐湘滟, 程杰仁, 殷建平, 龚德良. 基于NP模式的报文检测方法[J]. 计算机工程与科学, 2014, 36(11): 2128-2131.
[8]	王浩，张霖，张庆. 基于双字符序检测的BM模式匹配改进算法[J]. J4, 2012, 34(3): 113-117.
[9]	李东[1] 古宁[1] 林育蓓[2]. 一种用于多模式匹配的高效二叉检索树[J]. J4, 2008, 30(8): 69-71.
[10]	钱颖. 发掘数据库模式间的复杂匹配[J]. J4, 2007, 29(10): 61-62.
[11]	王聪刘国华苑迎张凌宇. 一种基于子串运算的模式匹配方法[J]. J4, 2007, 29(10): 57-60.
[12]	陈祥松邓苏黄宏斌. 基于GLAV集成中的模式匹配方法研究[J]. J4, 2006, 28(6): 86-89.
[13]	王天江刘芳卢正鼎. 基于聚类汇总的记录匹配算法[J]. J4, 2004, 26(9): 62-63.
[14]	任晓峰董占球. 基于网络的入侵检测系统弱点分析[J]. J4, 2002, 24(6): 20-22.
[15]	陈海涛胡华平等. 网络入侵检测中高效散列模式树算法的研究[J]. J4, 2002, 24(5): 34-38.