• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (06): 1133-1140.

• 人工智能与数据挖掘 • 上一篇    

融合降噪微调与图注意力机制的藏文长文本分类

敬容1,万福成1,2,黄锐1,于洪志1,2,马宁1,2   

  1. (1.西北民族大学语言与文化计算教育部重点实验室,甘肃 兰州 730030;
    2.西北民族大学甘肃省民族语言文化智能信息处理重点实验室,甘肃 兰州 730030)
  • 收稿日期:2024-08-27 修回日期:2024-09-06 出版日期:2025-06-25 发布日期:2025-06-26
  • 基金资助:
    国家自然科学基金(62366046);甘肃省科技计划(24JRRA154);西北民族大学基本科研业务费项目(31920240102)

Tibetan long text classification by fusing denoising fine-tuning and graph attention mechanism

JING Rong1,WAN Fucheng1,2,HUANG Rui1,YU Hongzhi1,2,MA Ning1,2    

  1.  (1.Key Laboratory of Linguistic and Cultural Computing Ministry of Education,
    Northwest Minzu University,Lanzhou 730030;
    2.Key Laboratory of Minzu Languages and Cultures Intelligent Information Processing,
    Gansu Province(Northwest Minzu University),Lanzhou 730030,China) 
  • Received:2024-08-27 Revised:2024-09-06 Online:2025-06-25 Published:2025-06-26

摘要: 在藏文长文本分类任务中,长距离依赖问题尤为突出。同时,多语言预训练模型在处理藏文文本分类任务时也存在一定的偏差。针对以上问题,基于预训练语言模型CINO-Large,提出融合降噪微调与图注意力机制的藏文长文本分类方法。首先,在CINO-Large中引入不完全信任损失函数In-trust,通过任务适应性损失增强模型在下游任务中的泛化能力。其次,在图结构建模中引入滑动窗口和线性分类,选择性增加文档与文档边缘,提高节点间的特征区分度。最后,利用图注意力机制GAT捕捉不同节点在图中的重要性,完成藏文长文本分类任务。在TNCC中的新闻长文本上,由所提方法构建的模型的分类准确率达到了71.66%,与预训练语言模型CINO-Large相比,其准确率、精确度和F1分数分别提高了1.77%、2.67%和2.03%,在部分分类困难的子类别上,模型的F1分数能显著提升20%左右。

关键词: 预训练模型, 降噪微调, 图注意力机制, 藏文长文本分类

Abstract: In Tibetan long text classification tasks, the issue of long-distance dependencies is particularly prominent. Meanwhile, multilingual pre-trained models exhibit certain biases when handling Tibetan text classification tasks. To address these issues, this paper proposes a Tibetan long text classification method based on the pre-trained language model CINO-Large, which integrates denoising fine-tuning and a graph attention network. Firstly, the In-trust loss function is introduced into CINO-Large to enhance the model’s generalization ability in downstream tasks through task-adaptive loss. Secondly, sliding windows and linear classification are introduced into graph structure modeling to selectively increase document-document edges, thereby improving the feature distinguishability among nodes. Finally, the graph attention mechanism is utilized to capture the importance of different nodes in the graph, completing the Tibetan long text classification task. On the TNCC news long text dataset, the classification accuracy of the proposed method reaches 71.66%. Compared to the pre-trained language model CINO-Large, the accuracy, precision, and F1 score of the proposed model are improved by 1.77%, 2.67% and 2.03%, respectively. For some subclasses that are difficult to classify, the F1 score of the proposed method can be significantly improved by approximately 20%.

Key words: pre-training model, denoising fine-tuning, graph attention mechanism, Tibetan long text classification