• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2024, Vol. 46 ›› Issue (08): 1513-1520.

• Artificial Intelligence and Data Mining • Previous Articles    

Long text semantic similarity calculation combining hybrid  feature extraction and deep learning

XU Jie1,SHAO Yu-bin1,DU Qing-zhi1,LONG Hua1,2,MA Di-nan2   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504;
    2.Yunnan Provincial Key Laboratory of Media Convergence,Kunming 650228,China)
  • Received:2023-02-27 Revised:2023-05-09 Accepted:2024-08-25 Online:2024-08-25 Published:2024-09-02

Abstract: Text semantic similarity calculation is a crucial task in natural language processing, but current research on similarity mostly focuses on short texts rather than long texts. Compared to short texts, long texts are semantically rich but their semantic information tends to be scattered. To address the issue of scattered semantic information in long texts, a feature extraction method is proposed to extract the main semantic information from long texts. The extracted semantic information is then fed into a BERT pre-training model using a sliding window overlap approach to obtain text vector representations. A bidirectional long short-term memory network is then utilized to model the contextual semantic relationships of long texts, mapping them into a semantic space. The models representation ability is further enhanced through the addition of a linear layer. Finally, finetuning is performed by maximizing the inner product of similar semantic vectors and minimizing the cross-entropy loss function. Experiment results  show that this method achieves F1 scores of 0.84 and 0.91 on the CNSE and CNSS datasets, outperforming the baseline models.


Key words: long text semantic similarity, feature extraction, BERT pre-training model, semantic space