• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (10): 110-115.

• 论文 • 上一篇    下一篇

一种海量分布式数据Top-k查询算法

魏贤全,郑洪源,丁秋林   

  1. (南京航空航天大学计算机科学与技术学院,江苏 南京 210016)
  • 收稿日期:2013-05-20 修回日期:2013-08-20 出版日期:2013-10-25 发布日期:2013-10-25

Efficient Topk query algorithm on massdistributed data      

WEI Xianquan,ZHENG Hongyuan,DING Qiulin   

  1. (College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China)
  • Received:2013-05-20 Revised:2013-08-20 Online:2013-10-25 Published:2013-10-25

摘要:

针对现有分布式环境下Topk查询算法的不足,提出了一种适用于海量分布式数据的Topk查询算法(ECHT)。该算法充分考虑了数据分布情况,提出了一种改进的限定误差直方图描述数据分布算法,避免了节点数据分布不均时Topk查询算法的低效性;另一方面,提高了Topk算法的阈值计算精度,从而进一步降低了网络带宽消耗。此外,提出了一种早裁剪思想,在大量数据传输之前提前进行数据裁剪,避免了大量无用数据的传输。实验表明,ECHT算法在网络带宽消耗和网络响应时间方面均优于同类算法。

关键词: 海量数据, Topk, 早裁剪, 改进限定误差直方图

Abstract:

For solving the shortage of existing distributed Topk query algorithms, a novel topk algorithm (named ECHT algorithm) is proposed, which is appropriate for massive distributed data. Taking care of the data distribution, ECHT algorithm designs a new algorithm of errorlimited histogram. For one thing, it avoids poor performance on uneven data distribution. For the other, it improves the accuracy of the threshold value, thus further reducing network bandwidth consumption. In addition, ECHT performs early clipping. Clipping before the transmission of large amounts of data priors brings better performance due to avoiding a lot of useless data transmission. The experiments are performed with the real datasets, demonstrating the viability and superior performance of the new algorithm.

Key words: massive data;Topk;early clipping;new error limited histogram