• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2026, Vol. 48 ›› Issue (2): 363-371.

• 人工智能与数据挖掘 • 上一篇    下一篇

一种基于预训练语言模型的多特征融合文章对匹配模型

陆顺意,何庆


  

  1. (贵州大学大数据与信息工程学院,贵州  贵阳 550025)

  • 出版日期:2026-02-25 发布日期:2026-03-10
  • 基金资助:
    国家自然科学基金(62166006);贵州省省级科技计划(黔科合支撑[2023]一般093);贵州省自然科学基金(黔科合支撑[2023]一般 251)


Article pair matching model based on multi-feature fusion of pre-trained language models

LU Shunyi,HE Qing   

  1. (College of Big Data and Information Engineering,Guizhou University,Guiyang,550025,China)
  • Online:2026-02-25 Published:2026-03-10

摘要: 针对传统文本语义匹配方法存在难以深入挖掘文本间深度语义特征及交互关系的问题,提出了一种基于预训练语言模型的多特征融合文章对匹配模型MF-APM。首先,通过数据增强策略对文章内容进行删减,以筛选出关键句子。其次,将增强后的新闻文档输入到具有孪生网络结构的Longformer模型来提取文章内容的深层特征,结合注意力特征融合方法得到文档匹配信息。其次,使用BERT对新闻标题进行交互式编码,将得到的编码向量输入到多头注意力机制中,以提取标题的深层次交互特征,进而获得标题交互信息。最后,通过将标题交互信息和文档交互信息的语义特征通过最大池化特征融合的方式实现文本对关系的预测。此外,在模型训练过程中,还引入了PolyLoss来代替传统的二进制交叉熵损失函数,有效降低了超参数调整的复杂性。将提出的MF-APM与其他匹配模型在CNSE和CNSS这2个数据集上进行比较,实验结果相较于基线模型,MF-APM模型在CNSE和CNSS数据集上的准确率分别至少提升了0.41和1.59个百分点,F1值分别至少提升了4.64和1.66个百分点,有效提升了文章对匹配任务的准确性。

关键词: 预训练语言模型, 长文本匹配, 多头注意力机制, 注意力特征融合, PolyLoss函数

Abstract: To address the issue that traditional text semantic matching methods struggle to deeply mine in-depth semantic features and interaction relationships between texts, this paper proposes an  article pair matching model based on multi-feature fusion of pre-trained language models (MF-APM). Firstly, a data augmentation strategy is employed to prune article content,  filtering out key sentences. Secondly, the augmented news documents are fed into a Longformer model with a Siamese network architecture to extract deep features of the article content, and document matching information is obtained by combining attention-based feature fusion methods. Thirdly, BERT is used to interactively encode news headlines, and the resulting encoded vectors are input into a multi-head attention mechanism to extract deep interactive features of the headlines, thereby obtaining headline interaction information. Finally, the semantic features of both headline interaction information and document interaction information are fused through max-pooling feature fusion to predict the relationship between text pairs. Additionally, during model training, PolyLoss is introduced to replace the traditional binary cross-entropy loss function, effectively reducing the complexity of hyperparameter tuning. The proposed MF-APM model is compared with other matching models on 2 datasets, CNSE and CNSS. Experimental results show that, compared to the baseline models, the MF-APM model achieves accuracy improvements of 0.41 and 1.59 percentage points on the CNSE and CNSS datasets, respectively, and F1-score improvements of 4.64 and 1.66 percentage points, effectively enhancing the accuracy of article pair matching tasks.

Key words: pre-trained language model, long text matching, multi-head attention mechanism, attention feature fusion, PolyLoss function