• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (07): 1193-1201.

• 高性能计算 • 上一篇    下一篇

基于分区层次图的海量高维数据学习索引构建方法

华悦琳1,2,3,周晓磊2,3,范强2,3,王芳潇2,3,严浩2,3   

  1. (1.南京信息工程大学计算机学院、网络空间安全学院,江苏 南京 210044;
    2.国防科技大学第六十三研究所,江苏 南京 210007;3.国防科技大学大数据与决策实验室,湖南 长沙 410073)

  • 收稿日期:2023-10-12 修回日期:2023-11-21 接受日期:2024-07-25 出版日期:2024-07-25 发布日期:2024-07-18

Learning indexing method for massive high-dimensional data based on partitioned hierarchical graph

HUA Yue-lin1,2,3,ZHOU Xiao-lei2,3,FAN Qiang2,3,WANG Fang-xiao2,3,YAN Hao2,3   

  1. (1.School of Computer Science,Nanjing University of Information Science & Technology,Nanjing 210044;
    2.The 63rd Research Institute,National University of Defense Technology,Nanjing 210007;
    3.Laboratory for Big Data and Decision,National University of Defense Technology,Changsha 410073,China)
  • Received:2023-10-12 Revised:2023-11-21 Accepted:2024-07-25 Online:2024-07-25 Published:2024-07-18

摘要: 学习索引是破解海量高维数据近似最近邻搜索问题的关键。然而,现有学习索引技术结果仅局限于单个分区中,且依赖于近邻图的构建。随着数据维度和规模的增长,索引难以对分区边界数据进行精确判断,并且构建时间复杂度增大,可扩展性难以保障。针对上述问题,提出了基于分区层次图的学习索引方法PBO-HNSW。该方法对分区边界数据进行重新分配,并行构建分布式图索引结构,从而有效应对近似最近邻搜索问题所面临的挑战。实验结果表明,该方法能够在百万级海量高维数据上实现毫秒级的索引构建。当召回率为0.93时,PBO-HNSW方法构建时间仅为基线方法的36.4%。

关键词: 近似最近邻搜索, 学习索引, 层次可导航小世界图, 分区学习, 索引结构

Abstract: Learning to index is the key to solving the problem of approximate nearest neighbor search in massive high-dimensional data. However, existing learning to index techniques are limited to individual partitions and rely on the construction of neighborhood graph. As the dimensionality and scale of data grow, indexing struggles to accurately judge boundary data of partitions, leading to increased construction time complexity and challenges in scalability. To address the above problems, a learn to index method based on partitioned hierarchical graphs, PBO-HNSW is proposed. The method redistributes partition boundary data and constructs distributed graph index structures in parallel. It effectively addresses the challenges faced by the approximate nearest neighbor search problem. Experimental results show that PBO-HNSW method is able to achieve millisecond index construction on millions of massive high-dimensional data. When the recall is 0.93, the construction time of the PBO-HNSW method is only 36.4%  of baseline methods.


Key words: approximate nearest neighbor search, learning to index, hierarchical navigable small world (HNSW), partition learning, index structure