• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

文本分类中CTM模型的优化和可视化应用研究

马长林,杨正良,谢罗迪   

  1. (华中师范大学计算机学院,湖北 武汉 430079)
  • 收稿日期:2016-09-20 修回日期:2016-11-03 出版日期:2017-03-25 发布日期:2017-03-25
  • 基金资助:

    国家自然科学基金(61003192)

Optimization and visualization application
of  CTM model in text classification

MA Chang-lin,YANG Zheng-liang,XIE Luo-di   

  1. (School of Computer,Central China Normal University,Wuhan 430079,China)
  • Received:2016-09-20 Revised:2016-11-03 Online:2017-03-25 Published:2017-03-25

摘要:

如何从海量文本中自动提取相关信息已成为巨大的技术挑战,文本分类作为解决该问题的重要方法已引起广大关注,而其中文本表示是影响分类效果的关键因素。为此采用相关主题模型进行文本表示,以保证信息完整同时表现主题相关性;基于该模型,对主题数目和特征提取实施了优化处理,综合复杂度和对数似然函数来确定最优主题数目,引入基于互信息的主成分分析算法进行最优特征提取,降低数据维度和特征冗余,使用R语言进行可视化实验分析。

 

关键词: 文本分类, CTM 模型, 特征提取

Abstract:

How to automatically extract related information from enormous texts has become a huge challenge. As an efficient way to solve this problem, text classification has attracted much attention, in which text representation is a critical factor to affect classification results. The correlated topic model can implement text representation, which can correctly reflect the correlation between topics under the case to remain the integrity of information. Based on this model, we optimize feature selection and the number of topics, and determine the number of topics with perplexity and log-likelihood function. We adopt the principal component analysis algorithm based on mutual information to optimize feature selection, which can reduce data dimension and the redundancy of text features. The R language is used to visualize the experimental results.
 

Key words: text classification, CTM model, feature selection