• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (10): 1873-1879.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于BERT和BiLSTM-CRF的生物医学命名实体识别

许力,李建华   

  1. (华东理工大学信息科学与工程学院,上海200237)
  • 收稿日期:2020-04-12 修回日期:2020-09-14 接受日期:2021-10-25 出版日期:2021-10-25 发布日期:2021-10-22
  • 基金资助:
    国家重大新药创制(2018ZX09735002);国家重点研发计划(2016YFA0502304)

Biomedical named entity recognition based on BERT and BiLSTM-CRF

XU Li,LI Jian-hua   

  1. (School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)
  • Received:2020-04-12 Revised:2020-09-14 Accepted:2021-10-25 Online:2021-10-25 Published:2021-10-22

摘要: 在生物医学领域,以静态词向量表征语义的命名实体识别方法准确率不高。针对此问题,提出一种将预训练语言模型BERT和BiLSTM相结合应用于生物医学命名实体识别的模型。首先使用BERT进行语义提取生成动态词向量,并加入词性分析、组块分析特征提升模型精度;其次,将词向量送入BiLSTM模型进一步训练,以获取上下文特征;最后通过CRF进行序列解码,输出概率最大的结果。该模型在BC4CHEMD、BC5CDR-chem和NCBI-disease数据集上的平均F1值达到了89.45%。实验结果表明,提出的模型有效地提升了生物医学命名实体识别的准确率。


关键词: 生物医学;命名实体识别, 预训练语言模型;词性分析;组块分析 

Abstract: In biomedical field, the named entity recognition method based on static word vector achieves low precision. To solve this problem, a method of combining pre-training model BERT and BiLSTM-CRF for biomedical named entity recognition is proposed. Firstly, the BERT is used for semantic extraction and the generation of dynamic word vector. Part of speech and chunking features are added to improve the model precision. Secondly, the word vector is sent to the BiLSTM model for further training to obtain the context features. Finally, the CRF is used to decode sequence and output the result with maximum probability. The average F1 score of this model reaches 89.45% on BC4CHEMD, BC5CDR-chem and NCBI-disease datasets. Experimental results show that the proposed model can effectively improve the precision of the model in the biomedical named entity recognition task.


Key words: biomedicine, named entity recognition, pre-training language model, part of speech, chunk- ing