• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (08): 1513-1520.

• 人工智能与数据挖掘 • 上一篇    

结合混合特征提取与深度学习的长文本语义相似度计算

徐捷1,邵玉斌1,杜庆治1,龙华1,2,马迪南2   

  1. (1.昆明理工大学信息工程与自动化学院,云南 昆明 650504;2.云南省媒体融合重点实验室,云南 昆明 650228)

  • 收稿日期:2023-02-27 修回日期:2023-05-09 接受日期:2024-08-25 出版日期:2024-08-25 发布日期:2024-09-02
  • 基金资助:
    云南省融媒体重点实验室项目 (220235205)

Long text semantic similarity calculation combining hybrid  feature extraction and deep learning

XU Jie1,SHAO Yu-bin1,DU Qing-zhi1,LONG Hua1,2,MA Di-nan2   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504;
    2.Yunnan Provincial Key Laboratory of Media Convergence,Kunming 650228,China)
  • Received:2023-02-27 Revised:2023-05-09 Accepted:2024-08-25 Online:2024-08-25 Published:2024-09-02

摘要: 文本语义相似度计算是自然语言处理中一项非常重要的任务,但是目前对于文本语义相似度的研究多集中在短文本领域,而不是长文本。相较于短文本,长文本语义信息丰富,但同时语义信息容易分散。针对长文本语义信息分散的问题,提出一种特征提取模型,提取出长文本的主要语义信息;对提取的语义信息使用滑窗重叠的方法输入BERT预训练模型得到文本向量表示;然后,通过双向长短期记忆网络建模长文本的前后语义联系,将其映射到语义空间内;再通过线性层增加模型表示能力;最后,通过相似语义向量内积最大化和交叉熵损失函数进行微调。实验结果表明,该模型在CNSE和CNSS数据集上F1分数分别为0.84和0.91,性能优于基线模型。

关键词: 长文本语义相似度, 特征提取, BERT预训练模型, 语义空间

Abstract: Text semantic similarity calculation is a crucial task in natural language processing, but current research on similarity mostly focuses on short texts rather than long texts. Compared to short texts, long texts are semantically rich but their semantic information tends to be scattered. To address the issue of scattered semantic information in long texts, a feature extraction method is proposed to extract the main semantic information from long texts. The extracted semantic information is then fed into a BERT pre-training model using a sliding window overlap approach to obtain text vector representations. A bidirectional long short-term memory network is then utilized to model the contextual semantic relationships of long texts, mapping them into a semantic space. The models representation ability is further enhanced through the addition of a linear layer. Finally, finetuning is performed by maximizing the inner product of similar semantic vectors and minimizing the cross-entropy loss function. Experiment results  show that this method achieves F1 scores of 0.84 and 0.91 on the CNSE and CNSS datasets, outperforming the baseline models.


Key words: long text semantic similarity, feature extraction, BERT pre-training model, semantic space