基于词向量的专利自动扩展查询研究

计算机工程与科学

基于词向量的专利自动扩展查询研究

刘梦兰1,2，刘斌1,2 ，彭智勇1,2

(1.武汉大学软件工程国家重点实验室，湖北武汉 430072；2.武汉大学计算机学院，湖北武汉 430072)

收稿日期:2017-07-10 修回日期:2017-09-15 出版日期:2017-12-25 发布日期:2017-12-25
基金资助:
湖北省科技支撑计划（2015BAA127）

Automatic patent query expansion

based on word embedding

LIU Meng-lan1,2 ，LIU Bin1,2，PENG Zhi-yong1,2

(1.State Key Laboratory of Software Engineering,Wuhan University,Wuhan 430072;

2.School of Computer,Wuhan University,Wuhan 430072,China)

Received:2017-07-10 Revised:2017-09-15 Online:2017-12-25 Published:2017-12-25

摘要/Abstract

摘要：

专利检索与普通的文本检索有着极大的不同，专利文本包括权利声明、摘要、全文等不同部分，自然不能简单地将普通文本的检索方法应用到专利检索当中来。专利检索通常面临着召回率低下的问题，首先，由于专利文本具有极强的专业性，有着复杂的术语表达方式，用户输入的关键词通常无法明确捕捉到检索意图，导致检索结果不理想。其次，专利撰写时有意识地制造与众不同的词汇，导致相关专利无法被检索到。目前有很多的研究方法都旨在提高专利检索的召回率，但是仍然有许多问题有待解决，检索效果有待改善。提出了一个基于词向量的专利自动扩展查询方法，在词向量的基础上，构建一个关键词查询网络，通过稠密子图发现算法来寻找扩展词集合，提高扩展词的有效性。在CLEF-IP 2012数据集的基础上进行了充分的实验，实验结果表明，本文提出的算法能够保证扩展词集获取的灵活性和有效性，同时能进一步提高专利检索的召回率。

关键词: 专利检索, 扩展查询, 词向量, 深度学习

Abstract:

Patent retrieval is very different from information retrieval. Patent texts include right statement, abstract and full text, so we cannot simply apply the retrieval algorithms for common texts to patent retrieval. Patent retrieval usually faces the problem of low recall rate. Firstly, due to the highly professional and complex expression and terms of patent texts, it is not easy to capture the search intent from users’ queries, eventually leading to unsatisfactory search results. Secondly, inventors consciously create some distinctive words when they write patent texts to avoid being retrieved. Many retrieval algorithms are designed to improve the recall rate, however, many problems remain to be solved and the effectiveness be improved. We propose an automatic patent query expansion model based on word embedding. On the basis of word embedding, a keyword network in patent domain is constructed, and then the dense subgraph discovery algorithm is used to find expansion terms, which can improve the effectiveness of expansion terms. Extensive experiments on the CLEF-IP 2012 dataset show that the proposed algorithm can guarantee the flexibility and effectiveness of expansion terms and improve the recall rate of patent retrieval.

Key words: patent retrieval, query expansion, word embedding, deep learning

刘梦兰1,2，刘斌1,2,彭智勇1,2. 基于词向量的专利自动扩展查询研究[J]. 计算机工程与科学.

LIU Meng-lan1,2,LIU Bin1,2，PENG Zhi-yong1,2.

Automatic patent query expansion

based on word embedding

[J]. Computer Engineering & Science.

[1]	尹春勇, 张小虎. 基于Transformer和Text-CNN的日志异常检测[J]. 计算机工程与科学, 2025, 47(03): 448-458.
[2]	徐雯, 于瓅. 基于迭代收缩阈值与深度学习的压缩感知图像重构网络[J]. 计算机工程与科学, 2025, 47(03): 485-493.
[3]	刘拥民, 许成, 黄浩, 张钱垒, 赵俊杰, . 基于SAE和WGAN的入侵检测方法研究[J]. 计算机工程与科学, 2025, 47(02): 256-264.
[4]	许天佑, 高光勇. 基于可逆生成对抗网络的鲁棒图像隐藏[J]. 计算机工程与科学, 2025, 47(02): 288-297.
[5]	吴玉虹, 王建. 基于Patches-CNN的模拟电路故障诊断[J]. 计算机工程与科学, 2025, 47(01): 35-44.
[6]	徐超, 阮荣耀, 陈勇, . 一种基于区块链的医疗数据审计方法[J]. 计算机工程与科学, 2025, 47(01): 95-106.
[7]	陈欣然, 刘宁, 闫中敏, 刘磊, 崔立真. 基于注意力指导的双粒度跨模态医学特征学习框架[J]. 计算机工程与科学, 2025, 47(01): 150-159.
[8]	罗婧, 叶志晟, 杨泽华, 傅天豪, 魏雄, 汪小林, 罗英伟, . 研发类GPU集群任务数据集的构建及分析[J]. 计算机工程与科学, 2024, 46(12): 2128-2137.
[9]	敬超, 闭玉申. 面向深度学习作业的干扰感知在线调度算法研究[J]. 计算机工程与科学, 2024, 46(12): 2138-2148.
[10]	陈磊, 梁正友, 孙宇, 蔡俊民. 多尺度特征融合的移动端单目深度估计研究[J]. 计算机工程与科学, 2024, 46(09): 1616-1524.
[11]	刘强, 李沐春, 伍晓洁, 王煜恒. S-JSMA：一种低扰动冗余的快速JSMA对抗样本生成方法[J]. 计算机工程与科学, 2024, 46(08): 1395-1402.
[12]	丁建平, 李卫军, 刘雪洋, 陈旭. 命名实体识别研究综述[J]. 计算机工程与科学, 2024, 46(07): 1296-1310.
[13]	胡昭华, 王长富, . 改进Faster R-CNN的遥感图像小目标检测算法[J]. 计算机工程与科学, 2024, 46(06): 1063-1071.
[14]	谭郁松, 王伟, 蹇松雷, 易超雄. 基于异常保持的弱监督学习网络入侵检测模型[J]. 计算机工程与科学, 2024, 46(05): 801-809.
[15]	高珊, 李世杰, 蔡志平. 基于深度学习的中文文本分类综述[J]. 计算机工程与科学, 2024, 46(04): 684-692.