• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (04): 635-646.

• 计算机网络与信息安全 • 上一篇    下一篇

基于异构图神经网络的半监督网站主题分类

王谢中1,陈旭1,景永俊1,王叔洋2   

  1. (1.北方民族大学计算机科学与工程学院,宁夏 银川 750000;2.北方民族大学电气信息工程学院,宁夏 银川 750000)
  • 收稿日期:2023-09-06 修回日期:2023-10-17 接受日期:2024-04-25 出版日期:2024-04-25 发布日期:2024-04-18
  • 基金资助:
    宁夏回族自治区重点研发项目(2023BDE02017);北方民族大学中央高校基本科研业务费专项资金(2022PT_S04)

Semi-supervised website topic classification based on hetero-geneous graph neural networkWANG

Xie-zhong1,CHEN Xu1,JING Yong-jun1,WANG Shu-yang2   

  1. (1.School of Computer Science and Engineering,North Minzu University,Yinchuan 750000;
    2.School of Electrical and Information Engineering,North Minzu University,Yinchuan 750000,China)
  • Received:2023-09-06 Revised:2023-10-17 Accepted:2024-04-25 Online:2024-04-25 Published:2024-04-18

摘要: 互联网网站数量快速增长使现有方法难以准确分类特定网站主题,如基于URL的方法无法处理未反映在URL中的主题信息,基于网页内容的方法受到数据稀疏性和语义关系捕捉的限制。为此,提出一种基于异构图神经网络的半监督网站主题分类方法HGNN-SWT。该方法不仅利用网站文本特征来弥补仅使用URL特征的不足,还利用异构图对网站文本和词语的稀疏关系进行建模,通过处理图中的节点和边关系来提高分类性能。同时引入基于随机游走的邻居节点采样方法,考虑节点的局部特征和全局图结构,并提出特征融合策略,捕捉网站文本数据的上下文关系和特征交互。通过在自制的Chinaz Website数据集上的实验,证明了HGNN-SWT方法在网站主题分类任务中相较于现有方法具有更高的准确率。

关键词: 网站主题, 异构图神经网络, 半监督, 特征融合

Abstract: The rapid growth of the number of Internet websites has made existing methods challenging to accurately classify specific website topics. URL-based methods, for example, struggle to handle topic information not reflected in the URL, while content-based methods face limitations due to data sparsity and challenges in capturing semantic relationships. To address this, a semi-supervised website topic classification method, HGNN-SWT, based on a heterogeneous graph neural network, is proposed. This method not only utilizes website text features to complement the limitations of using only URL features but also models sparse relationships between website text and words using a heterogeneous graph, improving classification performance by handling node and edge relationships within the graph. The approach introduces a neighbor node sampling method based on random walks, considering both local features and the global graph structure of nodes. Additionally, a feature fusion strategy is proposed to capture contextual relationships and feature interactions within website text data. Experimental results on a self-created Chinaz Website dataset demonstrate that HGNN-SWT achieves higher accuracy in website topic classification compared to existing methods.

Key words: website topic, heterogeneous graph neural network, semi-supervised, feature fusion