• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于词向量的专利自动扩展查询研究

刘梦兰1,2,刘斌1,2 ,彭智勇1,2   

  1. (1.武汉大学软件工程国家重点实验室,湖北 武汉 430072;2.武汉大学计算机学院,湖北 武汉 430072)
  • 收稿日期:2017-07-10 修回日期:2017-09-15 出版日期:2017-12-25 发布日期:2017-12-25
  • 基金资助:

    湖北省科技支撑计划(2015BAA127)

Automatic patent query expansion
based on word embedding

LIU Meng-lan1,2 ,LIU Bin1,2,PENG Zhi-yong1,2
 
  

  1. (1.State Key Laboratory of Software Engineering,Wuhan University,Wuhan 430072;
    2.School of Computer,Wuhan University,Wuhan 430072,China)
     
  • Received:2017-07-10 Revised:2017-09-15 Online:2017-12-25 Published:2017-12-25

摘要:

专利检索与普通的文本检索有着极大的不同,专利文本包括权利声明、摘要、全文等不同部分,自然不能简单地将普通文本的检索方法应用到专利检索当中来。专利检索通常面临着召回率低下的问题,首先,由于专利文本具有极强的专业性,有着复杂的术语表达方式,用户输入的关键词通常无法明确捕捉到检索意图,导致检索结果不理想。其次,专利撰写时有意识地制造与众不同的词汇,导致相关专利无法被检索到。目前有很多的研究方法都旨在提高专利检索的召回率,但是仍然有许多问题有待解决,检索效果有待改善。提出了一个基于词向量的专利自动扩展查询方法,在词向量的基础上,构建一个关键词查询网络,通过稠密子图发现算法来寻找扩展词集合,提高扩展词的有效性。在CLEF-IP 2012数据集的基础上进行了充分的实验,实验结果表明,本文提出的算法能够保证扩展词集获取的灵活性和有效性,同时能进一步提高专利检索的召回率。

关键词: 专利检索, 扩展查询, 词向量, 深度学习

Abstract:

Patent retrieval is very different from information retrieval. Patent texts include right statement, abstract and full text, so we cannot simply apply the retrieval algorithms for common texts to patent retrieval. Patent retrieval usually faces the problem of low recall rate. Firstly, due to the highly professional and complex expression and terms of patent texts, it is not easy to capture the search intent from users’ queries, eventually leading to unsatisfactory search results. Secondly, inventors consciously create some distinctive words when they write patent texts to avoid being retrieved. Many retrieval algorithms are designed to improve the recall rate, however, many problems remain to be solved and the effectiveness be improved. We propose an automatic patent query expansion model based on word embedding. On the basis of word embedding, a keyword network in patent domain is constructed, and then the dense subgraph discovery algorithm is used to find expansion terms, which can improve the effectiveness of expansion terms. Extensive experiments on the CLEF-IP 2012 dataset show that the proposed algorithm can guarantee the flexibility and effectiveness of expansion terms and improve the recall rate of patent retrieval.
 

Key words: patent retrieval, query expansion, word embedding, deep learning