J4 ›› 2008, Vol. 30 ›› Issue (6): 128-130.
• 论文 • 上一篇 下一篇
陈林 杨丹
出版日期:
发布日期:
Online:
Published:
摘要:
本文提出了一种独立于语种不需分词的文本分类方法。与传统文本分类模型相比,该方法在字的级别上利用了n元语法模型,文本分类时无需进行分词,并且避免了特征选择和大量预处理过程。我们系统地研究了模型中的关键因素以及它们对分类结果的影响,并详细介绍了评价方法。该文本分类方法已经在中文和英文两个语种上得到实现,并获得了较好的分类性能。
关键词: 文本分类 n元语法模型 语种
Abstract:
The paper proposes an approach to language independent text classification without word segmentation. Unlike the case of traditional text classificati on models, the approach based on the character-level n-gram language modeling avoids word segmentation, explicit feature selection and extensive pre-pro cessing. We systematically study the key factors in language modeling and their influence on classification, and describe an evaluation method in detail . Experimental results show that the proposed method can achieve good performance in text classification tasks.
Key words: text classification, n-gram model, language
陈林 杨丹. 独立于语种的文本分类方法[J]. J4, 2008, 30(6): 128-130.
0 / / 推荐
导出引用管理器 EndNote|Ris|BibTeX
链接本文: http://joces.nudt.edu.cn/CN/
http://joces.nudt.edu.cn/CN/Y2008/V30/I6/128