• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (07): 1308-1315.

• 人工智能与数据挖掘 • 上一篇    下一篇

融合词语统计特征和语义信息的文本分类方法研究

张丽,马静   

  1. (南京航空航天大学经济与管理学院,江苏 南京211106)
  • 收稿日期:2020-03-03 修回日期:2020-07-21 接受日期:2021-07-25 出版日期:2021-07-25 发布日期:2021-08-17
  • 基金资助:
    国家自然科学基金(71373123);中央高校基本科研业务费专项前瞻性发展策略研究资助项目(NW2018004)

A text classification method combining word statistical characteristics and semantic information

ZHANG Li,MA Jing   

  1. (School of Economic and Management,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China)
  • Received:2020-03-03 Revised:2020-07-21 Accepted:2021-07-25 Online:2021-07-25 Published:2021-08-17

摘要: 为了更好地表示文本语义信息,提高文本分类准确率,改进了特征权重计算方法,并融合特征向量与语义向量进行文本表示。首先基于文本复杂网络实现文本特征提取,接着利用网络节点统计特征改进TF-IDF得到特征向量,再基于LSTM抽取语义向量,最后将特征向量与语义向量相融合,使新的文本表示向量信息区分度更高。以网络新闻数据为实验对象的实验结果表明,改进特征权重计算方法,在特征向量中引入了语义和结构信息,并融合特征向量和语义向量,能进一步丰富文本信息,改善文本分类效果。


关键词: 文本分类, 文本复杂网络, 特征权重, LSTM

Abstract: In order to better represent the text semantic information and improve the accuracy of text classification, this paper improves the feature weight calculation method and integrates the feature vector and semantic vector for text representation. Firstly, this method extracts the text features based on the text complex network. Secondly, the statistical features of network nodes are used to improve the TF-IDF weight algorithm to get the feature vector. Thirdly, LSTM is used to get the semantic vector. Finally, the feature vector is integrated with the semantic vector to make the new text representation vector information more distinguishable. In this paper, the network news data is taken as the experimental object. The experimental results show that the improved feature weight algorithm can further enrich the text information and improve the text classification performance by introducing semantic information and structural information into the feature vector and integrating the feature vector with semantic vector.


Key words: text classification, text complex network, feature weight, long short-term memory (LSTM)