• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (08): 1498-1507.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于RoBERTa-wwm-BiLSTM-CRF的扶持政策文本实体识别研究

喻金平1,朱伟锋1,廖列法2   

  1. (1.江西理工大学信息工程学院,江西 赣州 314000;2.江西理工大学软件工程学院,江西 南昌 330000)
  • 收稿日期:2022-03-27 修回日期:2022-05-05 接受日期:2023-08-25 出版日期:2023-08-25 发布日期:2023-08-21
  • 基金资助:
    国家自然科学基金(71462018,71761018)

Entity recognition of support policy text based on RoBERTa-wwm-BiLSTM-CRF

YU Jin-ping1,ZHU Wei-feng1,LIAO Lie-fa2   

  1. (1.School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou 314000;
    2.School of Software Engineering,Jiangxi University of Science and Technology,Nanchang 330000,China)
  • Received:2022-03-27 Revised:2022-05-05 Accepted:2023-08-25 Online:2023-08-25 Published:2023-08-21

摘要: 扶持政策能够帮助企业获得政府在资金补助、税务减免等方面的支持,帮助企业更好地发展。针对扶持政策文本存在实体边界难以划分且传统词向量无法解决一词多义的问题,提出基于RoBERTa-wwm-BiLSTM-CRF的扶持政策文本实体识别模型。该模型使用预训练语言模型RoBERTa-wwm训练得到动态词向量,能够表征词的多义性;利用BiLSTM网络进一步抽取扶持政策文本的上下文信息和语义特征;最后通过条件随机场得到最佳的预测序列。提出的模型在自建的5 512条语料组成的扶持政策数据集上的F1值达到91.7%,结果表明,该模型能够有效识别扶持政策文本的命名实体,从而提高企业筛选政策的效率。

关键词: 扶持政策文本, 预训练语言模型, 命名实体识别, 动态词向量, 企业扶持

Abstract: Support policies can help enterprises obtain government support in funding subsidies, tax reductions, and other aspects, and help enterprises develop better. In order to address the problem that the entity boundaries in support policy texts are difficult to define and traditional word vectors cannot solve the problem of polysemy, a support policy texts named entity recognition model based on RoBERTa-wwm-BiLSTM-CRF is proposed. Firstly, the model uses the pre-trained language model RoBERTa-wwm to obtain dynamic word vectors, which can represent the polysemy of words. Secondly, the BiLSTM network is used to further extract the context information and semantic features of support policy texts. Finally, the best prediction sequence is obtained through the conditional random field. The proposed model achieves an F1 value of 91.7% on a self-built support policy dataset composed of 5 512 sentences. The results show that the model can effectively recognize the named entities in support policy texts, thereby improving the efficiency of enterprise policy screening.

Key words: support policy text, pre-trained language model, named entity recognition, dynamic word vector, enterprise support