• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (05): 903-910.

• 人工智能与数据挖掘 • 上一篇    下一篇

融合RoBERTa的多尺度语义协同专利文本分类模型

梅侠峰,吴晓鸰,黄泽民,凌捷   

  1. (广东工业大学计算机学院,广东 广州 510006)
  • 收稿日期:2021-05-06 修回日期:2021-10-15 接受日期:2023-05-25 出版日期:2023-05-25 发布日期:2023-05-16
  • 基金资助:
    广东省重点领域研发计划(2019B010139002);广州市重点领域研发计划(202007010004)

A multi-scale semantic collaborative patent text classification model based on RoBERTa

MEI Xia-feng,WU Xiao-ling,HUANG Ze-min,LING Jie   

  1. (School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 510006,China)
  • Received:2021-05-06 Revised:2021-10-15 Accepted:2023-05-25 Online:2023-05-25 Published:2023-05-16

摘要: 针对静态词向量工具(如word2vec)舍弃词的上下文语境信息,以及现有专利文本分类模型特征抽取能力不足等问题,提出了一种融合RoBERTa的多尺度语义协同(RoBERTa-MCNN-BiSRU++-AT)专利文本分类模型。RoBERTa能够学习到当前词符合上下文的动态语义表示,解决静态词向量无法表示多义词的问题。多尺度语义协同模型利用卷积层捕获文本多尺度局部语义特征,再由双向内置注意力简单循环单元进行不同层次的上下文语义建模,将多尺度输出特征进行拼接,由注意力机制对分类结果贡献大的关键特征分配更高权重。在国家信息中心发布的专利文本数据集上进行验证,与ALBERT-BiGRU和BiLSTM-ATT-CNN相比,RoBERTa-MCNN-BiSRU++-AT部级专利的分类准确率分别提升了2.7%和5.1%,大类级专利的分类准确率分别提升了6.7%和8.4%。结果表明,RoBERTa-MCNN-BiSRU++-AT能有效提升对不同层级专利的分类准确率。

关键词: 专利文本分类, 语义协同, 简单循环单元, RoBERTa模型

Abstract: For patent text classification, the existing static word vector tools such as word2vec cannot express the context information of words, and most of the models can not completely extract features. Aiming at this problem, a multi-scale semantic collaborative patent text classification model based on RoBERTa, named RoBERTa-MCNN-BiSRU++-AT, is proposed. RoBERTa can learn the context-appropriate dynamic semantic representation of the current word and solve the problem that static word vectors cannot represent polysemous words. The multi-scale semantic collaboration model uses the convolution layer to capture the multi-scale local semantic features of text, and then uses the bidirectional built-in simple attention loop unit to model the context semantics at different levels. The multi-scale output features are spliced, and the key features that contribute more to the classification result are assigned higher weight by the attention mechanism. Experiments were carried out on the patent text data set published by the National Information Center. The results show that, compared with ALBERT-BiGRU and BiLSTM-ATT-CNN, RoBERTa-MCNN-BiSRU++-AT increases the accuracy by 2.7% and 5.1% respectively in patent text classification at the department level, and by 6.7% and 8.4% respectively in patent text classification at the major class level. RoBERTa-MCNN-BiSRU++-AT can effectively improve the classification effect of different levels of patent texts.

Key words: patent text classification, semantic collaboration, simple recurrent unit, RoBERTa model