• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

一种新的基于相似度过滤的大数据保序匹配与检索算法

姜文超1 ,林德熙1,孙傲冰2,伍小强2   

  1. (1.广东工业大学计算机学院,广东 广州 510006;2.广东电子工业研究院,广东 东莞 523808)
  • 收稿日期:2016-12-26 修回日期:2017-03-21 出版日期:2017-07-25 发布日期:2017-07-25
  • 基金资助:

    国家自然科学基金(61672171);广东省自然科学基金(2016A030313703);广东省科技计划(2015B010109001,2016B030305002,2016B030306003);广东省产学研合作项目(2015B090901051);广东省创新团队项目(201001D0104726115)

A novel big data order-preserving matching
algorithm based on similarity filtration
 

JIANG Wen-chao1,LIN De-xi1,SUN Ao-bing2,WU Xiao-qiang2   

  1. (1.School of  Computer,Guangdong University of Technology,Guangzhou 510006;
    2.Institute of Guangdong Electronics Industry,Dongguan 523808,China)

     
  • Received:2016-12-26 Revised:2017-03-21 Online:2017-07-25 Published:2017-07-25

摘要:

伴随大数据时代的到来,数据快速保序匹配与检索成为众多大数据应用急需解决的关键问题,通过抽象与归约等措施,数据对象可抽象为具有若干属性的点集或序列,从而将数据匹配问题转化为字符或数字序列匹配问题。提出一种基于相似度过滤的数据保序匹配与检索算法,算法分三步:(1)数据转换,基于幅值变化趋势将原始序列转换为二进制,对序列中任何一个字符,通过判断包括其前后邻居在内的三个点的关系定义二进制序列,准确反映相邻三点之间的凸增长(降低)或凹增长(降低)关系;(2)数据归约,为方便候选序列与模式序列之间的相似度计算,运用基于幅度变化比例的数据归约方法,将候选序列与模式序列均归约到固定区间;(3)相似度计算,为区分不同趋势的凸增长(降低)或凹增长(降低)幅度,通过计算候选序列与模式序列对应点之间的差值绝对值之和作为相似度判断依据,提出基于相似度过滤的快速匹配方法,寻找与模式序列变化趋势一致的子序列集合,并按照相似度大小排序。理论分析与实验结果表明:(1)该算法具有亚线性时间复杂度;(2)该算法能有效解决Chhabra 等人算法对数据震荡幅度失控的问题,同时解决数据序列与模式序列分段规律但整体不相似的问题;(3)解决了Chhabra等人算法中对匹配序列排序造成的匹配结果疏漏问题。该方法不仅能更准确、更多地匹配出变化趋势一致的子字符串,同时将多个候选子串根据与模式之间的相似度进行排序,为进一步的数据精确检索提供判断依据。

关键词: 大数据应用, 模式匹配, 保序匹配, 相似度过滤

Abstract:

Data order-preserving matching is a key problem in big data applications. Data matching can be transformed into character or number matching through abstraction or reduction. We present a novel data order-preserving matching algorithm based on similarity filtration which includes three steps: data transformation, data reduction and similarity computation. Firstly, to reflect the relation of convex growth (descent) or concave growth (descent), the data is transformed into a binary string according to the relationship among the three neighbor numbers. Secondly, to compute the similarity more accurately, the data array and pattern array are both reduced into stable interval [0,1]. Finally, according to the variety range of the relevant nodes between data array and pattern array, the similarity can be computed and sorted. Theory analysis shows that the time complex is O(n), which is lower than the algorithm presented by Cho et al. Furthermore, our algorithm can overcome the deficiencies of the algorithm presented by Cho et al. including the incontrollable min-max values and the subsection inconsistency. Based on the similarity computation, all the sub-strings can be sorted for data retrieval or searching in big data applications.

Key words: big data application, pattern matching, order-preserving matching, similarity filtration