• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2008, Vol. 30 ›› Issue (6): 128-130.

• 论文 • 上一篇    下一篇

独立于语种的文本分类方法

陈林 杨丹   

  • 出版日期:2008-06-01 发布日期:2010-05-19

  • Online:2008-06-01 Published:2010-05-19

摘要:

本文提出了一种独立于语种不需分词的文本分类方法。与传统文本分类模型相比,该方法在字的级别上利用了n元语法模型,文本分类时无需进行分词,并且避免了特征选择和大量预处理过程。我们系统地研究了模型中的关键因素以及它们对分类结果的影响,并详细介绍了评价方法。该文本分类方法已经在中文和英文两个语种上得到实现,并获得了较好的分类性能。

关键词: 文本分类 n元语法模型 语种

Abstract:

The paper proposes an approach to language independent text classification without word segmentation. Unlike the case of traditional text classificati on models, the approach based on the character-level n-gram language modeling avoids word segmentation, explicit feature selection and extensive pre-pro  cessing. We systematically study the key factors in language modeling and their influence on classification, and describe an evaluation method in detail . Experimental results show that the proposed method can achieve good performance in text classification tasks.

Key words: text classification, n-gram model, language