文本分类中基于概率主题模型的噪声处理方法
收稿日期: 2009-01-13
修回日期: 2009-05-18
网络出版日期: 2010-06-25
基金资助
国家自然科学基金资助项目(60775037);国家863计划资助项目(2009AA01Z123)
A Probabilistic Topic Model Based Noise Processing Method for Text Classification
Received date: 2009-01-13
Revised date: 2009-05-18
Online published: 2010-06-25
林洋港,陈恩红 . 文本分类中基于概率主题模型的噪声处理方法[J]. 计算机工程与科学, 2010 , 32(7) : 89 -92 . DOI: 10.3969/j.issn.1007130X.2010.
The performance of text classification depends directly on the quality of training corpus.In practical applications,noise samples are unavoidable in the training corpus and thus influence the effect of the text classification approach.To this end,a novel probabilistic topic model based noise processing method is proposed for text classification.In our method,the noise samples are filtered according to the class entropy.Then the data is smoothed using the generative process of the topic model to further weaken the influence of noise samples,meanwhile the original size of the training corpus is kept.The experimental results of the real world data show that the method proposed is robust to the distribution of noise samples,and has a relative good performance on the data sets with a high noise ratio.
/
| 〈 |
|
〉 |