• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (5): 166-172.

• 论文 • 上一篇    下一篇

基于Lucene的数字作品搜索引擎的研究与设计

吴洁明,韩云辉,冀单单   

  1. (北方工业大学信息工程学院,北京100144)
  • 收稿日期:2012-08-24 修回日期:2012-11-02 出版日期:2013-05-25 发布日期:2013-05-25
  • 基金资助:

    国家科技部支撑计划课题基金资助项目(2012BAH04f03);科研基地—科研创新平台资助项目(PXM2013_014212_000011)

Research and design of search engine
for digital works based on Lucene         

WU Jieming,HAN Yunhui,JI Dandan   

  1. (Information Engineering Institute,North China University of Technology,Beijing 100144,China)
  • Received:2012-08-24 Revised:2012-11-02 Online:2013-05-25 Published:2013-05-25

摘要:

在Lucene的全文检索工具包的基础上,分析现有的主流中文分词算法和Lucene相关度排序算法,提出了改进的分词算法和改进的相关度排序算法。还采用倒排索引、检索技术、分布式存储和并行计算等技术,分析并设计了一个对海量数字作品信息的搜索引擎,为用户提供对海量数字作品信息的快速、准确的搜索服务。实验分析比较了分词速度和分词效果,还比较了关键词搜索结果的响应时间、命中数量、准确率和召回率。实验结果表明,本系统在很大程度上提高了搜索速度,保证了搜索结果的准确性。关键词:

关键词: Lucene, 分词算法, 索引, 相关度排序算法, 分布式

Abstract:

On the basis of the Lucene’s fulltext retrieval toolkit, the current main Chinese word segmentation algorithm and the Lucene relevance sorting algorithm was analyzed, and an improved segmentation algorithm and an improved relevance sorting algorithm were proposed. The paper also used the inverted index, search technologies, distributed storage and parallel computing to analyze and design a search engine for the massive digital works, thus providing users with fast and accurate search service of massive digital works. The experiments compared the segmentation speed, segmentation results and the response time of the keyword search results, the hit number, accuracy and recall rate. The experiment results show that this system does improve the search speed and ensure the accuracy of search results.

Key words: Lucene;segmentation algorithm;index;relevance sorting algorithm;distributed